Eligibility — Chat models tied to code work with current public/self-serve availability: code specialization, tracked code-execution flags, scores on SWE-bench / HumanEval / LiveCodeBench / Aider / BigCodeBench, or known code-family slugs.
Primary ranking — SWE-bench Verified (higher is better), then HumanEval, then SWE-bench Pro.
Benchmark variants — Independent standardized rows such as Vals.ai, Scale AI SWE-bench Pro, and Nebius/OpenHands SWE-bench Verified stay separate from vendor-reported variants. Vendor-only orchestration claims do not promote a model on this page until a comparable public harness publishes the score.
Tie-breaks — Newer `release` date when benchmark scores match.
Variant collapse — We keep one row per model family (`familySlug` + parameter tier). When headline scores tie within ±0.5 pt (±10 Elo on Chatbot Arena), we pick the canonical SKU by lowest tracked input price, then GA over preview or limited access, then newest `release`. A folded sibling within the benchmark noise band can show a "Tied within margin" chip on that score cell.
Preview fallback — The podium prefers GA models. Preview or invite-only candidates can fill this page only when fewer than three GA coding primaries remain after the gate.
Pricing column — Input/output prices are the lowest tracked public commercial rate cards in seed data; partner-only pricing is kept out of the default ranking.

SWE-bench Verified HumanEval

New models awaiting benchmark coverage

These source-backed rows qualify for this task page, but they are not scored leaderboard picks until the category benchmark data exists.

Model	Why it is listed	Status	Tracked price
Kimi K2.7-Code HighSpeed Tools	Kimi K2.7-Code HighSpeed is a newly researched coding-capable model; keep it on the watchlist until category scores land.	Benchmark pending No tracked SWE-bench Verified, HumanEval, or SWE-bench Pro score yet.	In $1.90 / Out $8.00
Kimi K2.7-Code Tools	Kimi K2.7-Code is a newly researched coding-capable model; keep it on the watchlist until category scores land.	Benchmark pending No tracked SWE-bench Verified, HumanEval, or SWE-bench Pro score yet.	In $0.61 / Out $3.07
Qwen3.7-Plus ToolsCode execution	Qwen3.7-Plus is a newly researched coding-capable model; keep it on the watchlist until category scores land.	Benchmark pending No tracked SWE-bench Verified, HumanEval, or SWE-bench Pro score yet.	In $0.32 / Out $1.28
Composer 2.5 ToolsCode execution	Composer 2.5 is Cursor-native and released 2026-05-18; use it inside Cursor, not as a standalone API.	Benchmark pending No tracked SWE-bench Verified, HumanEval, or SWE-bench Pro score yet.	In $0.50 / Out $2.50

#	Model	SWE-bench Verified	HumanEval	SWE-bench Pro	Context	Input $/1M	Output $/1M
1	Claude Opus 4.8 ReasoningVisionTools SWE-bench Verified: 88.6%	88.6%	—	69.2%	1m	$5.00	$25.00
2	Claude Opus 4.7 ReasoningVisionTools SWE-bench Verified: 87.6%	87.6%	—	64.3%	1m	$5.00	$25.00
3	Claude Sonnet 5 ReasoningVisionTools SWE-bench Verified: 85.2%	85.2%	—	63.2%	1m	$2.00	$10.00
4	GPT-5.3-Codex ReasoningVisionTools SWE-bench Verified: 85%	85%	—	56.8%	400k	$1.75	$14.00
5	GPT-5.5 ReasoningVisionTools SWE-bench Verified: 82.6%	82.6%	94.2%	58.6%	1.05m	$5.00	$30.00
6	GPT-5.5 Pro ReasoningVisionTools SWE-bench Verified: 82.6%	82.6%	—	58.6%	1.05m	$30.00	$180.00
7	Claude Opus 4.5 ReasoningVisionTools SWE-bench Verified: 80.9%	80.9%	—	41.8%	200k	$5.00	$25.00
8	Claude Opus 4.6 ReasoningVisionTools SWE-bench Verified: 80.8%	80.8%	95%	53.4%	1m	$5.00	$25.00
9	Gemini 3.1 Pro Preview PreviewVisionTools SWE-bench Verified: 80.6%	80.6%	94%	54.2%	1m	$2.00	$12.00
10	DeepSeek V4 Pro ReasoningTools SWE-bench Verified: 80.6%	80.6%	76.8%	55.4%	1m	$0.43	$0.87
11	MiniMax M3 ReasoningVisionTools SWE-bench Verified: 80.5%	80.5%	—	59%	1m	$0.30	$1.20
12	Qwen3.7-Max ReasoningTools SWE-bench Verified: 80.4%	80.4%	—	60.6%	1m	$1.25	$3.75
13	Kimi K2.6 ReasoningVisionTools SWE-bench Verified: 80.2%	80.2%	92%	58.6%	262k	$0.73	$3.40
14	MiniMax M2.5 Highspeed ReasoningTools SWE-bench Verified: 80.2%	80.2%	—	—	205k	$0.60	$2.40
15	Claude Sonnet 4.6 ReasoningVisionTools SWE-bench Verified: 79.6%	79.6%	98%	—	1m	$3.00	$15.00
16	DeepSeek V4 Flash ReasoningTools SWE-bench Verified: 79%	79%	69.5%	52.6%	1m	$0.10	$0.20
17	Xiaomi MiMo-V2.5-Pro Tools SWE-bench Verified: 78.9%	78.9%	—	57.2%	1.05m	$0.43	$0.87
18	Qwen3.6 Max Preview PreviewReasoningVisionTools SWE-bench Verified: 78.8%	78.8%	—	—	256k	$1.04	$6.24
19	Qwen3.6-Plus VisionTools SWE-bench Verified: 78.8%	78.8%	—	—	1m	$0.33	$1.95
20	Qwen3-Max VisionTools SWE-bench Verified: 78.8%	78.8%	—	—	262k	$0.78	$3.90

Honorable mentions

Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.

#4GPT-5.3-Codex
Most capable agentic coding model from OpenAI. Optimized for long-horizon, agentic coding tasks in the Codex CLI and API. Note: GPT-5.3-Codex-Spark is a distinct ChatGPT Pro research preview (not API-accessible).
85%
SWE-bench Verified
#5GPT-5.5
GPT-5.5 is OpenAI's fully retrained agentic model, released April 23, 2026. Optimised for agentic coding, computer use, knowledge work, and early scientific research. Achieves 82.7% on Terminal-Bench 2.0 (Codex CLI scaffold), 84.9% on GDPval, 58.6% on SWE-Bench Pro, 93.6% on GPQA Diamond, and 82.6% on SWE-Bench Verified (Vals.ai independent harness). Knowledge cutoff December 2025. Supports reasoning effort levels (none/low/medium/high/xhigh). Context window 1,050,000 tokens with a long-context surcharge above 272K tokens. Model ID: gpt-5.5.
82.6%
SWE-bench Verified
#6GPT-5.5 Pro
GPT-5.5 Pro is OpenAI's premium extra-compute deployment of GPT-5.5, released April 23, 2026. It uses the same underlying weights as GPT-5.5 standard with additional parallel test-time compute for harder tasks. Supports text and image inputs, reasoning effort control, tool use, structured outputs, code execution, a 1,050,000-token context window, and 128K max output. Key datapack rows: Terminal-Bench 2.1 78.2%, SWE-bench Pro 58.6%, GPQA Diamond 93.6%, ARC-AGI-2 high effort 83.3%, BrowseComp Pro compute 90.1%, and FrontierMath Tier 4 39.6%. Official pricing is $30/M input, $180/M output, $10/M batch input, and $45/M batch output; native cached input discount is not listed.
82.6%
SWE-bench Verified

Compare Top Picks

Side-by-side comparison of the top picks by price, benchmark, and API access.

Claude Opus 4.8 vs Claude Opus 4.7 Claude Opus 4.8 vs Claude Sonnet 5 Claude Opus 4.8 vs GPT-5.3-Codex Claude Opus 4.8 vs GPT-5.5 Claude Opus 4.7 vs Claude Sonnet 5 Claude Opus 4.7 vs GPT-5.3-Codex

Browse Other Categories

Best LLMs for RAG Best AI Agent Models 2026: SWE-bench Ranked Best LLMs for Classification Best Open Source LLMs Best Multimodal / Vision LLMs Best LLM for Translation in 2026 Best AI Image Models in 2026 Best AI Video Models in 2026 Best LLMs for Reasoning & Math Best Small Language Models (SLMs)Best LLMs for Function Calling & Tool Use Cheapest LLM APIs You Can Call Right Now Best Long Context LLMs Best Mainstream LLM APIs, Ranked Best LLMs for Enterprise Best Free LLMs You Can Use Right Now Best LLMs for Writing Best LLMs for Marketing Best LLMs for Customer Support

Frequently asked questions

Which LLM is best for code generation?

Claude Opus 4.8 is the current LLMReference top pick for code generation. The verdict uses the stored category signal SWE-bench Verified: 88.6%. Output pricing starts at $25.00 per 1M tokens. Review the linked model and provider pages before production use because availability and pricing can change.

How does Claude Opus 4.8 compare to Claude Opus 4.7 for code generation?

Claude Opus 4.8 leads Claude Opus 4.7 in the visible shortlist on SWE-bench Verified: 88.6% versus 87.6%. The pricing cards show Claude Opus 4.8: output pricing starts at $25.00 per 1m tokens and Claude Opus 4.7: output pricing starts at $25.00 per 1m tokens.

How does LLMReference rank LLMs for code generation?

LLMReference ranks LLMs for code generation from stored model, benchmark, freshness, and pricing data. The current methodology summary is: Coding leaders are ordered on shipped coding-agent evidence first, then classic code generation scores, with recency as the last tie-break.

How often is this list updated?

The LLM rankings on this page are updated daily as new benchmark scores, provider availability, and pricing data are tracked. The "as of" date at the top of the page shows the most recent refresh.

How do you decide which models appear in the top 3?

The podium picks are driven by the primary benchmark signal for this category (shown in the Methodology section), filtered to non-deprecated models with confirmed API availability. In ties, we prefer the more recently released model.

Are preview or beta models included?

Preview models appear in the "Watch list" section but are not in the main ranked podium unless the category explicitly allows it (e.g., /best/coding and /best/agents, where preview models often lead benchmarks).

Can I compare two specific models head-to-head?

Yes — use the Compare tool at llmreference.com/compare for a side-by-side breakdown of context window, pricing, benchmarks, and provider availability.

Is the pricing data real-time?

Pricing is tracked from provider documentation and updated regularly. It reflects the best available public data, not live API quotes — always verify before billing.