LLM ReferenceLLM Reference

Best LLMs for Code Generation (2026)

Last refreshed 2026-05-18. Next refresh: weekly.

Compare coding-capable models by sourced software-engineering benchmarks, context window, provider coverage, and tracked token pricing.

Top three picks

Opinionated short stack for this category — scroll for the full leaderboard, pricing, and compare links.

How we rank

Coding leaders are ordered on shipped coding-agent evidence first, then classic code generation scores, with recency as the last tie-break.

  1. EligibilityChat models tied to code work with current public/self-serve availability: code specialization, tracked code-execution flags, scores on SWE-bench / HumanEval / LiveCodeBench / Aider / BigCodeBench, or known code-family slugs.
  2. Primary rankingSWE-bench Verified (higher is better), then HumanEval, then SWE-bench Pro.
  3. Tie-breaksNewer `release` date when benchmark scores match.
  4. Variant collapseWe keep one row per model family (`familySlug` + parameter tier). When headline scores tie within ±0.5 pt (±10 Elo on Chatbot Arena), we pick the canonical SKU by lowest tracked input price, then GA over preview or limited access, then newest `release`. A folded sibling within the benchmark noise band can show a "Tied within margin" chip on that score cell.
  5. Preview fallbackThe podium prefers GA models. Preview or invite-only candidates can fill this page only when fewer than three GA coding primaries remain after the gate.
  6. Pricing columnInput/output prices are the lowest tracked public commercial rate cards in seed data; partner-only pricing is kept out of the default ranking.
#ModelInput $/1MOutput $/1M
1GPT-5.5
ReasoningVisionTools

SWE-bench Verified: 88.7%

$5.00$30.00
2GPT-5.5 Pro
ReasoningVisionTools

SWE-bench Verified: 88.7%

$30.00$180.00
3Claude Opus 4.7
ReasoningVisionTools

SWE-bench Verified: 87.6%

$5.00$25.00
4GPT-5.3-Codex
ReasoningVisionTools

SWE-bench Verified: 85%

$1.75$14.00
5Claude Opus 4.5
ReasoningVisionTools

SWE-bench Verified: 80.9%

$5.00$25.00
6Claude Opus 4.6
ReasoningVisionTools

SWE-bench Verified: 80.8%

$5.00$25.00
7DeepSeek V4 Pro
ReasoningTools

SWE-bench Verified: 80.6%

$0.43$0.87
8Gemini 3.1 Pro Preview
PreviewVisionTools

SWE-bench Verified: 80.6%

$2.00$12.00
9Kimi K2.6
ReasoningVisionTools

SWE-bench Verified: 80.2%

$0.75$3.50
10GPT-5.2
ReasoningVisionTools

SWE-bench Verified: 80%

$1.75$14.00
11Claude Sonnet 4.6
ReasoningVisionTools

SWE-bench Verified: 79.6%

$3.00$15.00
12DeepSeek V4 Flash
ReasoningTools

SWE-bench Verified: 79%

$0.14$0.28
13Xiaomi MiMo-V2.5-Pro
Tools

SWE-bench Verified: 78.9%

$1.00$3.00
14Qwen3.6-Plus
VisionTools

SWE-bench Verified: 78.8%

$0.33$1.95
15Qwen3-Max
VisionTools

SWE-bench Verified: 78.8%

$0.78$3.90
16GLM-5
ReasoningTools

SWE-bench Verified: 77.8%

$0.60$2.08
17Mistral Medium 3.5
ReasoningVisionTools

SWE-bench Verified: 77.6%

$1.50$7.50
18Muse Spark
ReasoningVisionTools

SWE-bench Verified: 77.4%

19Qwen3.6-27B
ReasoningVisionTools

SWE-bench Verified: 77.2%

$0.32$3.20
20Grok 4.20
ReasoningTools

SWE-bench Verified: 76.7%

$1.25$2.50

Honorable mentions

Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.

  • Most capable agentic coding model from OpenAI. Optimized for long-horizon, agentic coding tasks in the Codex CLI and API. Note: GPT-5.3-Codex-Spark is a distinct ChatGPT Pro research preview (not API-accessible).

    85%

    SWE-bench Verified

  • Claude Opus 4.5 available on AWS Bedrock

    80.9%

    SWE-bench Verified

  • Claude Opus 4.6 available on AWS Bedrock

    80.8%

    SWE-bench Verified