Compare AI models

Side-by-side comparison of any two LLMs — GPT vs Claude, Gemini vs DeepSeek, open vs proprietary — on pricing, benchmarks, API availability, context window, and release date.

Sitemap coverage 4408+ pairs

Decision builder

Pick the pair before opening the detail page

220 selectable models

Model AModel BOpen comparison

Claude Opus 4.7 vs Claude Opus 4.8

Pick Claude Opus 4.8 for higher current agentic coding and computer-use confidence; token pricing is tied on tracked $5/1M input and $25/1M output routes, so keep Claude Opus 4.7 only for already-validated prompts or coding workflow support constraints.

0% gap

Output price: $25.00 / $25.00
Context: 1m / 1m
Benchmarks: 5 shared
Providers: 6 / 6

Popular pairs

Browse comparisons with a decision signal attached

GPT-5.6 Sol vs Grok 4.5

Pick GPT-5.6 Sol when you want OpenAI's July 2026 frontier stack, the larger 1.05M context window, and OpenAI-sourced GA rows such as DeepSWE 1.1 at 72.7% and GPQA Diamond at 94.6%. Pick Grok 4.5 when xAI's lower standard-tier API pricing ($2/$6 per 1M tokens for prompts up to 200K) and Grok Build/Cursor distribution matter more, and run your own acceptance tests because several Grok 4.5 chart scores are xAI first-party only. Do not treat Terra or Luna as the OpenAI flagship in this pair.

400% gap8 benchmarks

Output price: $30.00 / $6.00
Context: 1.05m / 500k
Benchmarks: 8 shared
Providers: 2 / 3

CodingRAGAgentsLong contextGrok 4.5 leads SWE-bench Pro

Claude Fable 5 vs GPT-5.6 Sol

Pick GPT-5.6 Sol for a fully available OpenAI GA route with 1.05M context, lower standard output pricing at $30/M versus Fable 5's $50/M, and OpenAI launch rows you can cite directly in procurement reviews. Pick Claude Fable 5 when Anthropic's agentic coding evidence, adaptive thinking, and long-horizon workflow positioning outweigh access verification, especially after you confirm live Fable 5 availability on your provider route. Do not reuse GPT-5.6 Ultra multi-agent scores as single-model apples-to-apples results.

67% gap7 benchmarks

Output price: $50.00 / $30.00
Context: 1m / 1.05m
Benchmarks: 7 shared
Providers: 8 / 2

CodingRAGAgentsLong contextClaude Fable 5 leads SWE-bench Pro

Claude Fable 5 vs Grok 4.5

Pick Grok 4.5 when lower standard-tier API economics ($2/$6 per 1M tokens up to 200K prompt tokens), Grok Build or Cursor distribution, and xAI-first coding launch claims matter most, while accepting first-party-only chart rows for some benchmarks. Pick Claude Fable 5 when Anthropic's stronger sourced agentic coding evidence and adaptive thinking fit the workload and you have verified live Fable 5 access, accepting higher $10/$50 launch pricing and explicit safety-classifier behavior. This is the flagship xAI-versus-Anthropic route; use Sonnet 5 only for balanced-tier product comparisons.

733% gap7 benchmarks

Output price: $50.00 / $6.00
Context: 1m / 500k
Benchmarks: 7 shared
Providers: 8 / 3

CodingRAGAgentsLong contextClaude Fable 5 leads SWE-bench Pro

Claude Fable 5 vs Claude Sonnet 5

Pick Claude Fable 5 when the workload needs Anthropic's highest generally available capability tier and you can accept refusal/fallback behavior plus the $10/M input and $50/M output launch price once access is restored. Pick Claude Sonnet 5 for cost-sensitive production coding, higher throughput, and mainstream API adoption where Sonnet-tier capability is enough.

400% gap10 benchmarks

Output price: $50.00 / $10.00
Context: 1m / 1m
Benchmarks: 10 shared
Providers: 8 / 5

CodingRAGAgentsLong contextClaude Fable 5 leads SWE-bench Verified

Claude Sonnet 4.6 vs Claude Sonnet 5

Claude Sonnet 5 is ~50% cheaper at $2/1M; pay for Claude Sonnet 4.6 only for coding workflow support.

50% gap4 benchmarks

Output price: $15.00 / $10.00
Context: 1m / 1m
Benchmarks: 4 shared
Providers: 6 / 5

CodingRAGAgentsLong contextClaude Sonnet 5 leads SWE-bench Verified

Claude Fable 5 vs Claude Opus 4.8

Pick Claude Fable 5 when you need Anthropic's most capable widely released Mythos-class route and can accept documented refusal and fallback behavior after verifying live access. Keep Claude Opus 4.8 when you need the prior Opus behavior profile, fallback predictability, or a route that may still be easier to reason about for some regulated workflows. Treat the June 9 launch, June 12 suspension, and July 1 redeploy timeline as part of the product decision, not a footnote.

100% gap10 benchmarks

Output price: $50.00 / $25.00
Context: 1m / 1m
Benchmarks: 10 shared
Providers: 8 / 6

CodingRAGAgentsLong contextClaude Fable 5 leads SWE-bench Verified

Claude Fable 5 vs GPT-5.5

On every published agentic coding benchmark, Claude Fable 5 outperforms GPT-5.5 by a wide margin: 80.3% vs 58.6% on SWE-bench Pro (+21.7 pts), 96% vs 82.6% on SWE-bench Verified (Vals.ai), and 85.0% vs 78.7% on OSWorld-Verified computer use. Fable 5 also leads on knowledge-work quality (GDPval-AA ELO: 1932 vs 1769) and agentic legal tasks (13.3% vs 2.1% Legal Agent Benchmark). GPT-5.5 counters with a 93.6% GPQA Diamond score, while Fable 5's GPQA is not published, and notably costs half as much at $5/$30 per 1M tokens versus $10/$50. For pure coding and agentic workflows, Claude Fable 5 is the stronger performer once access is restored. For teams balancing cost with broad reasoning capability, GPT-5.5 is a compelling alternative, especially at roughly half the output price.

67% gap14 benchmarks

Output price: $50.00 / $30.00
Context: 1m / 1.05m
Benchmarks: 14 shared
Providers: 8 / 4

CodingRAGAgentsLong contextClaude Fable 5 leads SWE-bench Verified

Claude Opus 4.7 vs Claude Opus 4.8

0% gap5 benchmarks

Output price: $25.00 / $25.00
Context: 1m / 1m
Benchmarks: 5 shared
Providers: 6 / 6

CodingRAGAgentsLong contextClaude Opus 4.8 leads SWE-bench Verified

Claude Opus 4.8 vs Gemini 3.5 Pro

Use Claude Opus 4.8 for production agentic coding today: it has tracked provider routes, pricing, and public rows for SWE-bench Verified, SWE-bench Pro, Terminal-Bench 2.1, and GPQA Diamond. Track Gemini 3.5 Pro for long-context and multimodal workloads where the 2M-token window may beat Opus 4.8's 1M window, but do not budget or migrate production traffic until Google confirms GA pricing and public API details.

No price gapBenchmark gap

Output price: $25.00 / Unpriced
Context: 1m / 2m
Benchmarks: No shared rows
Providers: 6 / 0

CodingRAGAgentsLong context

Gemini 3.5 Pro vs GPT-5.5

Pick GPT-5.5 for production decisions today because it has public pricing, multiple provider routes, and benchmark rows in the local data. Keep Gemini 3.5 Pro on the shortlist when the workload is bottlenecked by context length or Google ecosystem routing, but wait for GA pricing, public provider routes, and independent benchmark evidence before replacing GPT-5.5.

No price gapBenchmark gap

Output price: Unpriced / $30.00
Context: 2m / 1.05m
Benchmarks: No shared rows
Providers: 0 / 4

Long contextVisionCodingRAG

Claude Opus 4.8 vs GPT-5.3-Codex

Pick Claude Opus 4.8 for autonomous repo work, complex multi-file engineering, computer-use agents, and long-context sessions: it leads GPT-5.3-Codex by 12.4 points on SWE-bench Pro and 18.7 points on OSWorld, with 1M context versus 400K. Pick GPT-5.3-Codex for cost-sensitive coding pipelines, OpenAI-native Codex workflows, and terminal automation where its $1.75/M input price and 77.3% Terminal-Bench 2.0 score matter more than the harder agent benchmarks.

79% gap2 benchmarks

Output price: $25.00 / $14.00
Context: 1m / 400k
Benchmarks: 2 shared
Providers: 6 / 3

CodingRAGAgentsLong contextClaude Opus 4.8 leads SWE-bench Verified

Claude Opus 4.8 vs GPT-5.5

Pick Claude Opus 4.8 for coding; GPT-5.5 is better when coding workflow support matters more.

20% gap14 benchmarks

Output price: $25.00 / $30.00
Context: 1m / 1.05m
Benchmarks: 14 shared
Providers: 6 / 4

CodingRAGAgentsLong contextClaude Opus 4.8 leads SWE-bench Verified

DeepSeek V4 Pro vs Unisound U2

Pick DeepSeek V4 Pro for production evaluation today: it has sourced context, pricing routes, and stronger public benchmark coverage. Evaluate Unisound U2 when Chinese sovereign routing, Token Hub access, or long-horizon task-execution positioning matters, but treat its GPQA 87.9 and SWE-bench Verified 75.0 rows as low-confidence vendor claims until independent leaderboards confirm them.

No price gapBenchmark gap

Output price: $0.870 / Unpriced
Context: 1m / —
Benchmarks: No shared rows
Providers: 5 / 1

CodingRAGAgentsLong context

Gemini 3.5 Flash vs GPT-5.5

Gemini 3.5 Flash is safer overall; choose GPT-5.5 when coding workflow support matters.

233% gap12 benchmarks

Output price: $9.00 / $30.00
Context: 1.05m / 1.05m
Benchmarks: 12 shared
Providers: 4 / 4

CodingRAGAgentsLong contextGPT-5.5 leads SWE-bench Verified

DeepSeek V4 Pro vs GLM-5.1

Pick DeepSeek V4 Pro when cost and context length are the bottleneck: it is about 3x cheaper on input at $0.44/M versus $1.40/M, supports a 1M-token window versus 200K, and leads GPQA Diamond 90.1% versus 86.2%. Pick GLM-5.1 when SWE-bench Pro scores and in-model code execution sandbox are the priority, or when Chatbot Arena human preference score is meaningful (1475 versus 1456 on the text arena).

302% gap6 benchmarks

Output price: $0.870 / $3.50
Context: 1m / 200k
Benchmarks: 6 shared
Providers: 5 / 5

CodingRAGAgentsLong contextGLM-5.1 leads SWE-bench Pro

DeepSeek V4 Pro vs Kimi K2.6

Pick DeepSeek V4 Pro for pure code generation, large-codebase analysis, and the lowest per-token cost before its 75% discount expires on 2026-05-31. Pick Kimi K2.6 when your pipeline processes images, screenshots, PDFs, or spreadsheets, or when you need long agent runs with many sequential tool calls.

301% gap12 benchmarks

Output price: $0.870 / $3.49
Context: 1m / 262k
Benchmarks: 12 shared
Providers: 5 / 9

CodingRAGAgentsLong contextDeepSeek V4 Pro leads MMLU PRO

Claude Sonnet 4.6 vs DeepSeek V4 Flash

DeepSeek V4 Flash is ~3233% cheaper at $0.09/1M; pay for Claude Sonnet 4.6 only for coding workflow support.

8233% gap11 benchmarks

Output price: $15.00 / $0.180
Context: 1m / 1m
Benchmarks: 11 shared
Providers: 6 / 5

CodingRAGAgentsLong contextClaude Sonnet 4.6 leads MMLU PRO

Llama 3 70B Instruct vs Llama 3.1 70B Instruct

Pick Llama 3.1 70B Instruct for coding; token pricing is tied, so keep Llama 3 70B Instruct only for already-validated prompts or route constraints.

0% gap2 benchmarks

Output price: $0.400 / $0.400
Context: 8k / 128k
Benchmarks: 2 shared
Providers: 18 / 13

CodingClassificationJSON / Tool useRAGLlama 3.1 70B Instruct leads HumanEval

Popular comparisons

Top model matchups by recent search demand

The matchups buyers actually run before committing to a provider for coding, agents, or build automation.

Top 100