Claude Fable 5
- SWE-bench Verified
- 96%
- Output (from)
- $50.00 / 1M
Last refreshed 2026-06-10. Next refresh: weekly.
Compare frontier models for multi-step AI agent workflows across SWE-bench Verified, tau-bench, and MultiChallenge. Built for coding agents, tool-use agents, and long-horizon task automation.
Verdict
Claude Opus 4.8 is the runner-up, 7 points back on SWE-bench Verified.
Agent picks mirror the first-wave home task: we surface SWE-bench Verified as the headline podium, then list τ-bench and MultiChallenge tables so tool-use and horizon tradeoffs stay visible.
These source-backed rows qualify for this task page, but they are not scored leaderboard picks until the category benchmark data exists.
| Model | Why it is listed | Status | Tracked price |
|---|---|---|---|
| Qwen3.7-Plus ToolsCode execution | Qwen3.7-Plus has agent or tool-use metadata in seed data; keep it separate from the scored agent boards until benchmarks land. | Benchmark pending No tracked SWE-bench Verified, tau-bench, or MultiChallenge score yet. | In $0.40 / Out $1.60 |
| Holo3.1-35B-A3B Tools | Holo3.1-35B-A3B has agent or tool-use metadata in seed data; keep it separate from the scored agent boards until benchmarks land. | Benchmark pending No tracked SWE-bench Verified, tau-bench, or MultiChallenge score yet. | In $0.25 / Out $1.80 |
| Step 3.7 Flash Tools | Step 3.7 Flash has agent or tool-use metadata in seed data; keep it separate from the scored agent boards until benchmarks land. | Benchmark pending No tracked SWE-bench Verified, tau-bench, or MultiChallenge score yet. | In $0.20 / Out $1.15 |
| Nano Banana Pro (Gemini 3 Pro Image) Tools | Nano Banana Pro (Gemini 3 Pro Image) has agent or tool-use metadata in seed data; keep it separate from the scored agent boards until benchmarks land. | Benchmark pending No tracked SWE-bench Verified, tau-bench, or MultiChallenge score yet. | In $2.00 / Out $120.00 |
Agentic coding performance on human-validated GitHub issues.
| # | Model | Score | Source |
|---|---|---|---|
| 1 | Claude Fable 5 Anthropic's most capable widely released model, built for demanding reasoning and long-horizon agentic work. Claude Fable 5 is the generally available Mythos-class Claude model, supports vision, tool use, structured outputs, prompt caching, Batch API processing, adaptive thinking that is always on, a 1M-token context window, and up to 128k output tokens. It is available through the Claude API, AWS Bedrock, Vertex AI, and Microsoft Foundry as of June 9, 2026, with first-party pricing at $10 per 1M input tokens and $50 per 1M output tokens. | 96% | vals.ai |
| 2 | Claude Opus 4.8 Anthropic's most capable model for complex reasoning, long-horizon agentic coding, and high-autonomy work. Features adaptive thinking, dynamic multi-agent workflows in Claude Code, fast mode (2.5x faster at ~3x lower cost), and effort control. Knowledge cutoff January 2026. Pricing: $5/$25 per 1M tokens in/out. | 88.6% | llm-stats.com |
| 3 | Claude Opus 4.7 Claude Opus 4.7 is Anthropic's generally available flagship model with 1M context, 128K max output, adaptive thinking, and a new tokenizer with roughly 555K words per 1M tokens. | 87.6% | swebench.com |
| 4 | GPT-5.3-Codex Most capable agentic coding model from OpenAI. Optimized for long-horizon, agentic coding tasks in the Codex CLI and API. Note: GPT-5.3-Codex-Spark is a distinct ChatGPT Pro research preview (not API-accessible). | 85% | swebench.com |
| 5 | GPT-5.5 GPT-5.5 is OpenAI's fully retrained agentic model, released April 23, 2026. Optimised for agentic coding, computer use, knowledge work, and early scientific research. Achieves 82.7% on Terminal-Bench 2.0 (Codex CLI scaffold), 84.9% on GDPval, 58.6% on SWE-Bench Pro, 93.6% on GPQA Diamond, and 82.6% on SWE-Bench Verified (Vals.ai independent harness). Knowledge cutoff December 2025. Supports reasoning effort levels (none/low/medium/high/xhigh). Context window 1,050,000 tokens with a long-context surcharge above 272K tokens. Model ID: gpt-5.5. | 82.6% | vals.ai |
| 6 | Claude Opus 4.5 Claude Opus 4.5 is Anthropic's Claude 4.5 model with multimodal text and image input and an optional reasoning mode. It offers a 200K-token context window and scores 80.7 on MMMU. | 80.9% | anthropic.com |
| 7 | Claude Opus 4.6 Claude Opus 4.6 is Anthropic's Claude 4.6 model with multimodal text and image input and an optional reasoning mode. It offers a 1M-token context window and scores 80.8 on SWE-bench Verified. | 80.8% | anthropic.com |
| 8 | DeepSeek V4 Pro DeepSeek V4 Pro is DeepSeek's flagship open-weights model, released April 24 2026 under the MIT license. Architecture: 1.6T total / 49B active parameters, MoE with Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) hybrid — requiring only 27% of inference FLOPs vs standard 1M-context transformers — plus Manifold-Constrained Hyper-Connections (mHC) and Muon Optimizer. Context window: 1,000,000 tokens; max output: 384,000 tokens (Think Max mode requires ≥384K context). Text-only (no vision/image input). Supports three reasoning modes: Non-Think, Think High, Think Max. Function calling, tool use, and structured outputs supported. Key benchmarks: SWE-bench Verified 80.6%, SWE-bench Pro 55.4%, LiveCodeBench 93.5%, GPQA Diamond 90.1%, MMLU-Pro 87.5%, Terminal-Bench 2.0 67.9%, Chatbot Arena 1460 (2026-04-28). Current API pricing: $0.435/$0.87 per 1M input/output tokens; DeepSeek made the former 75% promotional rate permanent effective 2026-05-31 15:59 UTC. | 80.6% | swebench.com |
| 9 | Gemini 3.1 Pro PreviewPreview Google: Gemini 3.1 Pro Preview available via OpenRouter. Pricing: $2/1M input, $12/1M output. | 80.6% | blog.google |
| 10 | MiniMax M3 MiniMax's frontier open-weight multimodal model with MiniMax Sparse Attention (MSA) architecture enabling economical 1M-token context (1/20th compute vs prior generation). Scored 59% SWE-bench Pro, surpassing GPT-5.5. Accepts text, image, and video input. Max output 512K tokens. Pricing: $0.60/$2.40 per 1M tokens in/out (list). Open-weight model weights release pending. | 80.5% | benchlm.ai |
| 11 | Qwen3.7-Max Alibaba's closed-weight flagship language model, announced at the 2026 Alibaba Cloud Summit (May 20). Scored 56.6 on Artificial Analysis Intelligence Index at launch—highest-ranked Chinese model. 1M-token context with prompt caching (up to 90% discount). Pricing: $2.50/$7.50 per 1M tokens in/out. | 80.4% | llm-stats.com |
| 12 | Kimi K2.6 Kimi K2.6 is Moonshot AI's multimodal agentic coding model, released April 20 2026 under a Modified MIT license. Built on a 1-trillion-parameter MoE architecture (32B active, 384 experts with 8 selected per token plus 1 shared expert, 61 layers), it features a 262K context window and up to 65,536 output tokens. Supports native image and video inputs (screenshots, PDFs, spreadsheets). Designed for long-horizon coding with agent swarms of up to 300 sub-agents and 4,000 coordinated steps; Moonshot AI cites 200–300 sequential tool calls without task drift. Key benchmarks: SWE-bench Verified 80.2%, SWE-bench Pro 58.6%, LiveCodeBench v6 89.6%, GPQA Diamond 90.5%, Terminal-Bench 2.0 66.7%. Chatbot Arena Elo 1454 (2026-04-28 snapshot). | 80.2% | swebench.com |
| 13 | MiniMax M2.5 Highspeed MiniMax M2.5 Highspeed is MiniMax's inference-optimized variant of M2.5, released simultaneously in February 2026. It delivers identical intelligence and outputs to standard M2.5 through a specialized inference engine at lower latency. The model supports a 204,800-token context window, 131,072-token max output, function calling, structured output, and reasoning. API model ID: MiniMax-M2.5-highspeed. It is designed for latency-sensitive interactive applications and automated agent pipelines. | 80.2% | minimax.io |
| 14 | GPT-5.2 GPT-5.2 is OpenAI's incremental update in the GPT-5 series offering improvements in agentic coding and long-context performance at 128K context. | 80% | swebench.com |
| 15 | Claude Sonnet 4.6 Claude Sonnet 4.6 is Anthropic's best combination of speed and intelligence. Proprietary decoder-only model with 1M-token context, 64K max output, multimodal vision, extended thinking, and function calling. Available via Anthropic API, AWS Bedrock, GCP Vertex AI, and OpenRouter at $3/1M input and $15/1M output tokens. | 79.6% | nxcode.io |
| 16 | DeepSeek V4 Flash DeepSeek V4 Flash is a 284B parameter (13B activated) Mixture-of-Experts language model with 1M-token context. Features a hybrid attention architecture combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) for efficient long-context inference. Supports thinking and non-thinking modes. Legacy API aliases deepseek-chat and deepseek-reasoner map to this model's non-thinking and thinking modes respectively. Pricing: $0.14/1M input, $0.28/1M output (cache hit: $0.0028/1M input). MIT licensed. | 79% | huggingface.co |
| 17 | Xiaomi MiMo-V2.5-Pro Xiaomi's April 22, 2026 public-beta flagship in the MiMo-V2.5 series. The official Xiaomi MiMo page describes MiMo-V2.5-Pro as its most capable model to date, focused on general agentic capability, complex software engineering, long-horizon tasks, and ultra-long-context instruction following. OpenRouter lists it as text-to-text with 1,048,576 token context, 131,072 max completion tokens, reasoning controls, tool use, and response_format support. Xiaomi says the V2.5 series will be open-sourced soon, but no public weights/license were verified at research time. | 78.9% | huggingface.co |
| 18 | Qwen3-Max Alibaba's Qwen3-Max, flagship model with improved multilingual and reasoning capabilities. | 78.8% | swebench.com |
| 19 | Qwen3.6-Plus Qwen3.6-Plus is Alibaba Cloud's GA Qwen3.6 flagship for long-context reasoning, coding, tool use, and multimodal workflows. DashScope lists it with a 1M-token context window, structured output support, and standard public token pricing. | 78.8% | benchlm.ai |
| 20 | Qwen3.6 Max PreviewPreview Qwen3.6-Max-Preview is a proprietary frontier model from Alibaba Cloud built on a sparse MoE architecture, available for preview as part of the Qwen3.6 series. | 78.8% | datalearner.com |
| 21 | Gemini 3 FlashPreview Gemini 3 Flash is Google's speed-optimized Gemini 3 model, available in public preview via the Gemini API and Vertex AI. It supports text, image, audio, and video inputs with a 1M token context window and is priced at $0.50 per 1M input tokens and $3.00 per 1M output tokens. | 78% | deepmind.google |
| 22 | Gemini 3.5 Flash Gemini 3.5 Flash is Google DeepMind's generally available Flash model for sustained frontier-level performance on agentic and coding tasks. It supports multimodal inputs, native thinking, tool and function calling, structured outputs, code execution, search grounding, batch processing, and long contexts up to 1M tokens. | 78% | techjacksolutions.com |
| 23 | GLM-5 Flagship open-weight foundation model from Zhipu AI with 744B parameters (40B active per token) in Mixture of Experts architecture. Trained on 28.5T tokens using DeepSeek Sparse Attention on Huawei Ascend hardware. Achieves state-of-the-art performance on coding and agentic benchmarks (SWE-bench Verified: 77.8%). Supports autonomous planning, multi-step tool use, and self-correction. | 77.8% | swebench.com |
| 24 | Mistral Medium 3.5 Mistral Medium 3.5 is Mistral AI's first flagship merged model, combining instruction-following, reasoning, coding, and vision in one dense 128B model. It supports configurable reasoning effort, text and image input, native function calling, JSON output, and a 256K context window. Released as open weights under Mistral's Modified MIT license, it can be self-hosted on as few as four H100/H200 GPUs and scores 77.6% on SWE-bench Verified. | 77.6% | mistral.ai |
| 25 | Muse Spark Muse Spark is the first model in Meta's Muse family, developed by Meta Superintelligence Labs (MSL). It is a natively multimodal reasoning model with capabilities including tool-use, visual chain-of-thought reasoning, and multi-agent orchestration. Muse Spark achieves 58% on Humanity's Last Exam and 38% on FrontierScience Research benchmarks, while being competitive with Llama 4 Maverick at over 10x less compute. Available via meta.ai and the Meta AI app; private API preview only — not open-source. | 77.4% | benchlm.ai |
| 26 | Qwen3.6-27B Open-weight dense Qwen3.6 27B model with native multimodal support across text, image, and video. Apache 2.0. | 77.2% | benchlm.ai |
| 27 | Claude Sonnet 4.5 Claude Sonnet 4.5 is Anthropic's Claude 4.5 model with multimodal text and image input and an optional reasoning mode. It offers a 200K-token context window and scores 86 on MMLU PRO. | 77.2% | anthropic.com |
| 28 | DeepSeek V3 0324 DeepSeek: DeepSeek V3 0324 available via OpenRouter. Pricing: $0.2/1M input, $0.77/1M output. | 76.8% | swebench.com |
| 29 | Kimi K2.5 Kimi K2.5 is Moonshot AI's Kimi model focused on code generation and software engineering. It offers a 256K-token context window and scores 87.9 on GPQA. | 76.8% | github.com |
| 30 | Grok 4.20 Grok 4.20 is xAI's February 2026 Grok 4-series model, first previewed under the informal Grok 4.2 beta label. Standard API variants launched around March 10, 2026 as grok-4.20-0309-reasoning and grok-4.20-0309-non-reasoning with a 1M context window. | 76.7% | benchlm.ai |
| 31 | ByteDance Doubao Seed 2.0 Pro Doubao Seed 2.0 Pro is ByteDance's flagship frontier model released February 14, 2026 via Volcano Engine. Achieves 88.9 on GPQA Diamond, 87.0 on MMLU-Pro, 87.8 on LiveCodeBench v6, and 76.5 on SWE-bench Verified. Priced at $0.47/M input and $2.37/M output, making it 3–6× cheaper than comparable frontier models. | 76.5% | digitalapplied.com |
| 32 | ByteDance Doubao Seed 2.0 Code Doubao Seed 2.0 Code is ByteDance's coding-specialized frontier model released February 14, 2026 via Volcano Engine. Achieves 87.8 on LiveCodeBench v6 and 76.5 on SWE-bench Verified. Tuned for software development tasks including code generation, debugging, repository-level understanding, and automated code review. | 76.5% | digitalapplied.com |
| 33 | Qwen3.5-Plus Qwen3.5-Plus is the flagship commercial API model of the Qwen3.5 native vision-language series, delivering outstanding performance comparable to state-of-the-art models with significant leaps in both pure-text and multimodal capabilities compared to the Qwen3 series. | 76.4% | huggingface.co |
| 34 | Qwen3.5-397B-A17B Alibaba's largest Qwen3.5 model, featuring a Mixture-of-Experts architecture with 397B total parameters and 17B active per token (using 512 total experts with 10 routed + 1 shared active). Supports 201 languages with a native 262K token context window extensible to 1M tokens via YaRN. Includes a thinking/reasoning mode, tool calling with MCP integration, and unified vision-language capabilities through early fusion training. | 76.2% | benchlm.ai |
| 35 | Gemini 3 Pro Google DeepMind's most advanced reasoning Gemini model. Part of the Gemini 3 series with frontier-class intelligence, multimodal understanding, and 1M token context window. | 76.2% | deepmind.google |
| 36 | Antigravity AgentPreview Antigravity Agent is Google DeepMind's preview managed agent for autonomous coding and browsing workflows. Powered by Gemini 3.5 Flash, it plans, reasons, runs code, manages files, and browses the web inside a secure Google-hosted Linux sandbox through the Interactions API. It accepts text and image input, has a 1,048,576-token input context window that compacts at about 135K tokens, and supports a 65,536-token output limit. Environment compute is not billed during preview; Google describes pricing as pay-as-you-go based on underlying Gemini model tokens and tool use. | 76.2% | ai.google.dev |
| 37 | GPT-5 OpenAI's previous intelligent reasoning model with configurable reasoning effort. Released August 2025. Supports minimal, low, medium, and high reasoning levels. Succeeded by GPT-5.1 and later models. | 74.9% | openai.com |
| 38 | Hunyuan Hy3 PreviewPreview Tencent HunYuan's Hy3 Preview is a high-efficiency Mixture-of-Experts language model for agentic and production workflows. OpenRouter lists tencent/hy3-preview as released Apr 22, 2026 with 262,144 context, preview pricing, reasoning controls, and tool-use support. Hugging Face Transformers documents Hy3-preview as a Tencent HunYuan MoE model with a dense-MoE hybrid architecture, 192 routed experts, and one always-active shared expert per MoE layer. SCMP's Apr 23 coverage reports Tencent described HY3-Preview as a new flagship model developed by the HunYuan and Yuanbao teams with 295B parameters. Treat release metadata as high confidence for existence/context and medium confidence for exact parameter count until Tencent publishes a primary technical card. | 74.4% | github.com |
| 39 | Step 3.5 Flash Step 3.5 Flash is StepFun's most capable open-source foundation model. Its sparse Mixture of Experts architecture activates only 11B of 196B total parameters per token, making it highly efficient at long contexts. Achieves 60.47% on SWE-Bench Verified and 56.2% on SWE-Pro, competing with frontier closed models on agentic coding benchmarks. Supports 256K token context. | 74.4% | static.stepfun.com |
| 40 | Ring-2.6-1T Ring-2.6-1T is InclusionAI's MIT-licensed trillion-parameter MoE reasoning model for agent workflows, engineering tasks, scientific analysis, and enterprise automation. It supports high and xhigh reasoning effort modes and entered OpenRouter's Programming top 10 in the 2026-05-18 audit. | 74% | huggingface.co |
| 41 | ByteDance Doubao Seed 2.0 Lite Doubao Seed 2.0 Lite is ByteDance's general-purpose frontier model released February 14, 2026. Achieves 85.1 on GPQA Diamond and 87.7 on MMLU-Pro, matching or exceeding many larger models. Optimized for production workloads requiring strong general reasoning at moderate cost. | 73.5% | digitalapplied.com |
| 42 | MAI-Thinking-1 MAI-Thinking-1 is Microsoft AI's flagship reasoning model, built from scratch on enterprise-grade commercially licensed data without third-party distillation. The sparse mixture-of-experts model activates about 35B parameters from roughly 1T total parameters, supports a 256K-token context window, and targets frontier reasoning and software engineering work at a mid-weight price point. Microsoft reports 97% on AIME 2025, 94.5% on AIME 2026, 84.2% on GPQA Diamond, 87.7% on LiveCodeBench v6, 73.5% on SWE-bench Verified, and 52.8% on SWE-bench Pro. In a 1,276-task Surge blind side-by-side evaluation, it narrowly beat Claude Sonnet 4.6 but trailed Claude Opus 4.6. It supports function calling and developer instructions through the Chat Completions API. | 73.5% | microsoft.ai |
| 43 | Seed 2.0 Lite ByteDance mid-tier Seed 2.0 agent model balancing strong performance with cost efficiency. Supports multimodal input (text, image, video), tool calling, and a 4-level reasoning effort system. AIME 2025: 93, Codeforces: 2233. Max output: 131K tokens. | 73.5% | help.apiyi.com |
| 44 | Qwen3.6-35B-A3B Qwen3.6-35B-A3B is an open-weight multimodal MoE model with 35B total parameters and 3B activated per token, released April 2026. It features a hybrid architecture combining Gated DeltaNet linear attention and standard Gated Attention with 256 total experts (8 routed + 1 shared), and includes a vision encoder for image and video understanding. Optimized for agentic coding, long-context reasoning, and visual tasks; supports 256K native context (extensible to ~1M via YaRN) with integrated thinking mode for multi-turn agent interactions. | 73.4% | huggingface.co |
| 45 | Claude Haiku 4.5 Claude Haiku 4.5 is Anthropic's Claude 4.5 model with multimodal text and image input. It offers a 200K-token context window and scores 73.3 on SWE-bench Verified. | 73.3% | swebench.com |
| 46 | Qwen3.5-27B Open-weight dense 27B multimodal Qwen3.5 model with native vision-language support. Apache 2.0. | 72.4% | huggingface.co |
| 47 | Qwen3.5-122B-A10B Open-weight MoE Qwen3.5 model with 122B total and 10B active parameters. Apache 2.0. | 72% | huggingface.co |
| 48 | o3 OpenAI o3 reasoning model with advanced multi-step problem-solving capabilities. | 71.7% | openai.com |
| 49 | GPT-5.4 GPT-5.4 is OpenAI's flagship frontier reasoning model, released March 5, 2026. It incorporates advances from GPT-5.3-Codex for coding and agentic workflows, and adds 'Thinking' mode with editable reasoning plans. Key capabilities include computer use (navigating interfaces via Playwright), image understanding and generation integration, full-stack web app generation, tool calling, and deep research. Knowledge cutoff is August 31, 2025. Model ID: gpt-5.4. | 71.7% | artificialanalysis.ai |
| 50 | MAI-Code-1-Flash MAI-Code-1-Flash is Microsoft AI's lightweight agentic coding model built directly inside GitHub Copilot's production harness. It is designed for fast everyday developer workflows, adaptive thinking by task complexity, multi-turn instruction following, and token-efficient coding. Microsoft reports 51.2% on SWE-bench Pro versus 35.2% for Claude Haiku 4.5 in the same Copilot harness, plus stronger results on SWE-bench Verified, SWE-bench Multilingual, and Terminal-Bench 2.0 without publishing exact scores for those secondary benchmarks. | 71.6% | llm-stats.com |
| 51 | Qwen3-Coder-Next Qwen3-Coder-Next is an ultra-sparse Mixture-of-Experts coding agent model from Alibaba's Qwen team, released February 3, 2026 under Apache 2.0. It has 80B total parameters with 3B active at inference, delivering substantially higher throughput than comparable dense models. It supports a native 256K context window, function calling, structured outputs, Claude Code, Qwen Code, Cline, Kilo, and other scaffold templates. Benchmarks reported in the DAT-3724 datapack include SWE-Bench Pro 44.3%, SWE-Bench Resolved 70.6%, and TerminalBench 2 36.2%. | 71.3% | arxiv.org |
| 52 | Grok Build 0.1 Grok Build 0.1 is xAI's agentic coding model for the Grok Build CLI and public API beta. It launched for the CLI on May 14, 2026, and for public API beta access on May 29, 2026. It replaces grok-code-fast-1 for current coding-agent workflows, supports text and image input, tool calling, structured outputs, always-on reasoning, prompt caching, and a 256K-token context window; xAI self-reports 70.8% on SWE-bench Verified using its internal harness. | 70.8% | x.ai |
| 53 | Claude 3.7 Sonnet Claude 3.7 Sonnet is Anthropic's advanced model with extended thinking capabilities, offering state-of-the-art reasoning for complex tasks. | 70.3% | anthropic.com |
| 54 | Qwen3.5-35B-A3B Alibaba's Qwen3.5-35B-A3B is a Mixture-of-Experts model released February 24, 2026, with 35B total parameters and 3B active during inference. Part of the Qwen3.5 series with a 262K native context window (extendable to ~1M tokens). Optimized for high inference throughput (78+ tokens/second on NVIDIA hardware). Open-source under Apache 2.0. | 69.2% | huggingface.co |
| 55 | DeepSeek V3.2 Exp DeepSeek: DeepSeek V3.2 Exp available via OpenRouter. Pricing: $0.27/1M input, $0.41/1M output. | 68.4% | swebench.com |
| 56 | ByteDance Doubao Seed 2.0 Mini Doubao Seed 2.0 Mini is ByteDance's compact frontier model released February 14, 2026. Achieves 79.0 on GPQA Diamond and 83.6 on MMLU-Pro. Designed for high-throughput batch processing where cost and latency are critical. | 67.9% | seed.bytedance.com |
| 57 | DeepSeek V3.1 Enhanced reasoning and grounded retrieval model from DeepSeek with multimodal text and image understanding. | 66% | swebench.com |
| 58 | Gemini 2.5 Pro Google DeepMind's most capable Gemini 2.5 model with native thinking/reasoning support. Features a 1M-token context window, multimodal inputs (text, image, audio, video), function calling, and strong performance across coding, mathematics, and scientific reasoning tasks. | 63.8% | blog.google |
| 59 | North Mini Code 1.0 A 30B-parameter sparse mixture-of-experts (3B active parameters) code generation model from Cohere, released as open weights under Apache 2.0. Designed for agentic software engineering, code generation, and terminal-based tasks. Supports a 256K token context window with 64K maximum output length, interleaved thinking, and tool use via JSON schema chat templates. The inaugural model in Cohere's North family. | 61% | huggingface.co |
| 60 | Nemotron 3 Super-120B-A12B NVIDIA Nemotron 3 Super-120B-A12B is a 120B total / 12B active hybrid Latent MoE model with interleaved Mamba-2 and MoE layers for agentic, reasoning, and conversational tasks. Fireworks lists the NVFP4 variant for on-demand deployment with 262k context. | 60.5% | llm-stats.com |
| 61 | DeepSeek R1 0528 DeepSeek R1 0528 is DeepSeek's DeepSeek R1 model with an optional reasoning mode. It offers a 130K-token context window with weights openly available for self-hosting and scores 81 on GPQA. | 57.6% | github.com |
| 62 | GPT-4.1 OpenAI's GPT-4.1 model released April 2025, excelling at coding tasks, precise instruction following, and web development. Outperforms GPT-4o in these areas with a 1 million token context window. Available via API and in ChatGPT for Plus, Pro, Team, Enterprise, and Edu users. | 54.6% | openai.com |
| 63 | Nemotron-Cascade-2-30B-A3B 30B MoE model with 3B active parameters - superior reasoning with IMO/IOI 2025 gold-medal performance | 50.2% | arxiv.org |
| 64 | DeepSeek R1 DeepSeek R1: Reasoning-optimized model with extended thinking capabilities. 128K context. | 49.2% | arxiv.org |
| 65 | Upstage Solar Pro 3 Upstage Solar Pro 3 is Upstage's Solar Pro model. It offers a 200K-token context window. | 28.6% | upstage.ai |
| 66 | GPT-4.1 Mini Fast and efficient small model from OpenAI replacing GPT-4o mini. Released April 2025 alongside GPT-4.1. Shows improvements in instruction-following, coding, and intelligence with a 1 million token context window. Available in ChatGPT for paid users. | 23.6% | openai.com |
Multi-turn tool use in retail and airline customer-service tasks.
| # | Model | Score | Source |
|---|---|---|---|
| 1 | Ring-2.6-1T Ring-2.6-1T is InclusionAI's MIT-licensed trillion-parameter MoE reasoning model for agent workflows, engineering tasks, scientific analysis, and enterprise automation. It supports high and xhigh reasoning effort modes and entered OpenRouter's Programming top 10 in the 2026-05-18 audit. | 95.3% | huggingface.co |
| 2 | Mistral Medium 3.5 Mistral Medium 3.5 is Mistral AI's first flagship merged model, combining instruction-following, reasoning, coding, and vision in one dense 128B model. It supports configurable reasoning effort, text and image input, native function calling, JSON output, and a 256K context window. Released as open weights under Mistral's Modified MIT license, it can be self-hosted on as few as four H100/H200 GPUs and scores 77.6% on SWE-bench Verified. | 91.4% | huggingface.co |
| 3 | ByteDance Doubao Seed 2.0 Pro Doubao Seed 2.0 Pro is ByteDance's flagship frontier model released February 14, 2026 via Volcano Engine. Achieves 88.9 on GPQA Diamond, 87.0 on MMLU-Pro, 87.8 on LiveCodeBench v6, and 76.5 on SWE-bench Verified. Priced at $0.47/M input and $2.37/M output, making it 3–6× cheaper than comparable frontier models. | 90.4% | digitalapplied.com |
| 4 | LFM2.5 8B A1B LFM2.5-8B-A1B is Liquid AI's latest on-device mixture-of-experts model, succeeding LFM2-8B-A1B. It has 8.3B total parameters with approximately 1.5B active per token (the A1B label uses a rounded ~1B figure). The architecture combines 18 double-gated LIV convolutional layers with 6 GQA attention layers, trained on 38 trillion tokens. The context window expands to 128K tokens (up from 32K in the predecessor). It is a reasoning model that generates explicit chain-of-thought steps before producing its final answer, making reasoning tokens cheap due to the MoE design. Strong tool-calling, function-calling, and instruction-following capabilities make it well-suited for agentic workflows on edge hardware. Weights are openly available on Hugging Face under the lfm1.0 license. | 88.1% | marktechpost.com |
| 5 | Claude Sonnet 4.6 Claude Sonnet 4.6 is Anthropic's best combination of speed and intelligence. Proprietary decoder-only model with 1M-token context, 64K max output, multimodal vision, extended thinking, and function calling. Available via Anthropic API, AWS Bedrock, GCP Vertex AI, and OpenRouter at $3/1M input and $15/1M output tokens. | 87.5% | taubench.com |
| 6 | Qwen3.5-397B-A17B Alibaba's largest Qwen3.5 model, featuring a Mixture-of-Experts architecture with 397B total parameters and 17B active per token (using 512 total experts with 10 routed + 1 shared active). Supports 201 languages with a native 262K token context window extensible to 1M tokens via YaRN. Includes a thinking/reasoning mode, tool calling with MCP integration, and unified vision-language capabilities through early fusion training. | 86.7% | huggingface.co |
| 7 | Command A+ Command A+ is Cohere's open-weight sparse mixture-of-experts model for enterprise agentic, multimodal, multilingual, RAG, and reasoning-heavy workloads. It combines text and image inputs, tool use, structured outputs, 48-language support, and hardware-efficient deployment that can run on one B200 or two H100 GPUs. | 85% | cohere.com |
| 8 | Claude Opus 4.6 Claude Opus 4.6 is Anthropic's Claude 4.6 model with multimodal text and image input and an optional reasoning mode. It offers a 1M-token context window and scores 80.8 on SWE-bench Verified. | 84.8% | benchlm.ai |
| 9 | GLM-5 Flagship open-weight foundation model from Zhipu AI with 744B parameters (40B active per token) in Mixture of Experts architecture. Trained on 28.5T tokens using DeepSeek Sparse Attention on Huawei Ascend hardware. Achieves state-of-the-art performance on coding and agentic benchmarks (SWE-bench Verified: 77.8%). Supports autonomous planning, multi-step tool use, and self-correction. | 82.1% | taubench.com |
| 10 | Qwen3.5-35B-A3B Alibaba's Qwen3.5-35B-A3B is a Mixture-of-Experts model released February 24, 2026, with 35B total parameters and 3B active during inference. Part of the Qwen3.5 series with a 262K native context window (extendable to ~1M tokens). Optimized for high inference throughput (78+ tokens/second on NVIDIA hardware). Open-source under Apache 2.0. | 81.2% | huggingface.co |
| 11 | Qwen3.5-122B-A10B Open-weight MoE Qwen3.5 model with 122B total and 10B active parameters. Apache 2.0. | 79.5% | huggingface.co |
| 12 | Qwen3.5-9B Qwen3.5-9B is Alibaba's Qwen3.5 model with multimodal text and image input. It offers a 256K-token context window with weights openly available for self-hosting. | 79.1% | huggingface.co |
| 13 | Qwen3.5-27B Open-weight dense 27B multimodal Qwen3.5 model with native vision-language support. Apache 2.0. | 79% | huggingface.co |
| 14 | Grok 4.20 Grok 4.20 is xAI's February 2026 Grok 4-series model, first previewed under the informal Grok 4.2 beta label. Standard API variants launched around March 10, 2026 as grok-4.20-0309-reasoning and grok-4.20-0309-non-reasoning with a 1M context window. | 78.9% | benchlm.ai |
| 15 | GPT-5.4 GPT-5.4 is OpenAI's flagship frontier reasoning model, released March 5, 2026. It incorporates advances from GPT-5.3-Codex for coding and agentic workflows, and adds 'Thinking' mode with editable reasoning plans. Key capabilities include computer use (navigating interfaces via Playwright), image understanding and generation integration, full-stack web app generation, tool calling, and deep research. Knowledge cutoff is August 31, 2025. Model ID: gpt-5.4. | 78.3% | taubench.com |
| 16 | GPT-5.3-Codex Most capable agentic coding model from OpenAI. Optimized for long-horizon, agentic coding tasks in the Codex CLI and API. Note: GPT-5.3-Codex-Spark is a distinct ChatGPT Pro research preview (not API-accessible). | 77.8% | taubench.com |
| 17 | Qwen3-Max Alibaba's Qwen3-Max, flagship model with improved multilingual and reasoning capabilities. | 76.8% | taubench.com |
| 18 | Qwen3.6-Plus Qwen3.6-Plus is Alibaba Cloud's GA Qwen3.6 flagship for long-context reasoning, coding, tool use, and multimodal workflows. DashScope lists it with a 1M-token context window, structured output support, and standard public token pricing. | 76.8% | benchlm.ai |
| 19 | Gemini 3.1 Pro PreviewPreview Google: Gemini 3.1 Pro Preview available via OpenRouter. Pricing: $2/1M input, $12/1M output. | 76.5% | taubench.com |
| 20 | GPT-5.2 GPT-5.2 is OpenAI's incremental update in the GPT-5 series offering improvements in agentic coding and long-context performance at 128K context. | 75.1% | taubench.com |
| 21 | Kimi K2.5 Kimi K2.5 is Moonshot AI's Kimi model focused on code generation and software engineering. It offers a 256K-token context window and scores 87.9 on GPQA. | 74.2% | taubench.com |
| 22 | Gemini 3 FlashPreview Gemini 3 Flash is Google's speed-optimized Gemini 3 model, available in public preview via the Gemini API and Vertex AI. It supports text, image, audio, and video inputs with a 1M token context window and is priced at $0.50 per 1M input tokens and $3.00 per 1M output tokens. | 71.5% | taubench.com |
| 23 | Mistral Large 3 675B Instruct Mistral Large 3 675B Instruct is MistralAI's Mistral Large model. It offers a 128K-token context window and scores 70.2 on τ-bench. | 70.2% | taubench.com |
| 24 | Llama 4 Maverick 17B Instruct FP8 Meta's Llama 4 Maverick 17B with 128 experts, FP8-optimized for cost-efficient inference. Supports native Model Router integration on Microsoft Foundry. | 68.5% | taubench.com |
| 25 | Mistral Small 4 Mistral Small 4 is a hybrid 119B MoE model unifying instruct, reasoning, and coding capabilities. Features configurable reasoning effort per request and native function calling with JSON output support. | 65.8% | benchlm.ai |
| 26 | Llama 4 Scout 17B-16E Instruct Meta's Llama 4 Scout is a 17-billion parameter mixture-of-experts model with 16 expert routing. Optimized for efficient inference on edge and cloud environments with strong multi-turn conversation capabilities. Available on Cloudflare Workers AI. | 62.3% | taubench.com |
| 27 | Nemotron 3 Super-120B-A12B NVIDIA Nemotron 3 Super-120B-A12B is a 120B total / 12B active hybrid Latent MoE model with interleaved Mamba-2 and MoE layers for agentic, reasoning, and conversational tasks. Fireworks lists the NVFP4 variant for on-demand deployment with 262k context. | 61.1% | huggingface.co |
Multi-turn instruction retention, memory, editing, and self-coherence.
| # | Model | Score | Source |
|---|---|---|---|
| 1 | Gemini 3.1 Pro PreviewPreview Google: Gemini 3.1 Pro Preview available via OpenRouter. Pricing: $2/1M input, $12/1M output. | 71.4% | labs.scale.com |
| 2 | GPT-5.4 GPT-5.4 is OpenAI's flagship frontier reasoning model, released March 5, 2026. It incorporates advances from GPT-5.3-Codex for coding and agentic workflows, and adds 'Thinking' mode with editable reasoning plans. Key capabilities include computer use (navigating interfaces via Playwright), image understanding and generation integration, full-stack web app generation, tool calling, and deep research. Knowledge cutoff is August 31, 2025. Model ID: gpt-5.4. | 69.2% | labs.scale.com |
| 3 | Qwen3.5-397B-A17B Alibaba's largest Qwen3.5 model, featuring a Mixture-of-Experts architecture with 397B total parameters and 17B active per token (using 512 total experts with 10 routed + 1 shared active). Supports 201 languages with a native 262K token context window extensible to 1M tokens via YaRN. Includes a thinking/reasoning mode, tool calling with MCP integration, and unified vision-language capabilities through early fusion training. | 67.6% | llm-stats.com |
| 4 | Qwen3.5-122B-A10B Open-weight MoE Qwen3.5 model with 122B total and 10B active parameters. Apache 2.0. | 61.5% | llm-stats.com |
| 5 | Kimi K2.5 Kimi K2.5 is Moonshot AI's Kimi model focused on code generation and software engineering. It offers a 256K-token context window and scores 87.9 on GPQA. | 61.4% | labs.scale.com |
| 6 | Claude Opus 4.7 Claude Opus 4.7 is Anthropic's generally available flagship model with 1M context, 128K max output, adaptive thinking, and a new tokenizer with roughly 555K words per 1M tokens. | 58.6% | labs.scale.com |
| 7 | Claude Sonnet 4.6 Claude Sonnet 4.6 is Anthropic's best combination of speed and intelligence. Proprietary decoder-only model with 1M-token context, 64K max output, multimodal vision, extended thinking, and function calling. Available via Anthropic API, AWS Bedrock, GCP Vertex AI, and OpenRouter at $3/1M input and $15/1M output tokens. | 57.1% | labs.scale.com |
| 8 | MAI-Thinking-1 MAI-Thinking-1 is Microsoft AI's flagship reasoning model, built from scratch on enterprise-grade commercially licensed data without third-party distillation. The sparse mixture-of-experts model activates about 35B parameters from roughly 1T total parameters, supports a 256K-token context window, and targets frontier reasoning and software engineering work at a mid-weight price point. Microsoft reports 97% on AIME 2025, 94.5% on AIME 2026, 84.2% on GPQA Diamond, 87.7% on LiveCodeBench v6, 73.5% on SWE-bench Verified, and 52.8% on SWE-bench Pro. In a 1,276-task Surge blind side-by-side evaluation, it narrowly beat Claude Sonnet 4.6 but trailed Claude Opus 4.6. It supports function calling and developer instructions through the Chat Completions API. | 53% | llm-stats.com |
| 9 | Claude Haiku 4.5 Claude Haiku 4.5 is Anthropic's Claude 4.5 model with multimodal text and image input. It offers a 200K-token context window and scores 73.3 on SWE-bench Verified. | 50.5% | labs.scale.com |
| 10 | Qwen3-Max Alibaba's Qwen3-Max, flagship model with improved multilingual and reasoning capabilities. | 41.2% | labs.scale.com |
Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.
Most capable agentic coding model from OpenAI. Optimized for long-horizon, agentic coding tasks in the Codex CLI and API. Note: GPT-5.3-Codex-Spark is a distinct ChatGPT Pro research preview (not API-accessible).
85%
SWE-bench Verified
GPT-5.5 is OpenAI's fully retrained agentic model, released April 23, 2026. Optimised for agentic coding, computer use, knowledge work, and early scientific research. Achieves 82.7% on Terminal-Bench 2.0 (Codex CLI scaffold), 84.9% on GDPval, 58.6% on SWE-Bench Pro, 93.6% on GPQA Diamond, and 82.6% on SWE-Bench Verified (Vals.ai independent harness). Knowledge cutoff December 2025. Supports reasoning effort levels (none/low/medium/high/xhigh). Context window 1,050,000 tokens with a long-context surcharge above 272K tokens. Model ID: gpt-5.5.
82.6%
SWE-bench Verified
Claude Opus 4.5 is Anthropic's Claude 4.5 model with multimodal text and image input and an optional reasoning mode. It offers a 200K-token context window and scores 80.7 on MMMU.
80.9%
SWE-bench Verified