Eligibility — Chat/completion models that pass the generic API leaderboard filter (no embeddings/rerankers/modality SKUs).
Primary ranking — GPQA Diamond, then MMLU when GPQA is missing or tied.
Variant collapse — We keep one row per model family (`familySlug` + parameter tier). When headline scores tie within ±0.5 pt (±10 Elo on Chatbot Arena), we pick the canonical SKU by lowest tracked input price, then GA over preview or limited access, then newest `release`. A folded sibling within the benchmark noise band can show a "Tied within margin" chip on that score cell.
Price tie-break — Lower tracked input $/1M wins only after the capability scores are exhausted.
Pricing — Rates reflect tracked provider rows — spot/check enterprise tiers separately.

GPQA Diamond MMLU

#	Model	Capability signal	Context	Input $/1M	Output $/1M
1	GPT-5.6 Sol ReasoningVisionTools Capability signal: GPQA Diamond 94.6%	GPQA Diamond 94.6%	1.05m	$5.00	$30.00
2	Gemini 3.1 Pro Preview PreviewVisionTools Capability signal: GPQA Diamond 94.3%	GPQA Diamond 94.3%	1m	$2.00	$12.00
3	Claude Opus 4.7 ReasoningVisionTools Capability signal: GPQA Diamond 94.2%	GPQA Diamond 94.2%	1m	$5.00	$25.00
4	GPT-5.5 ReasoningVisionTools Capability signal: GPQA Diamond 93.6%	GPQA Diamond 93.6%	1.05m	$5.00	$30.00
5	Claude Opus 4.8 ReasoningVisionTools Capability signal: GPQA Diamond 93.6%	GPQA Diamond 93.6%	1m	$5.00	$25.00
6	GPT-5.5 Pro ReasoningVisionTools Capability signal: GPQA Diamond 93.6%	GPQA Diamond 93.6%	1.05m	$30.00	$180.00
7	Kimi K3 ReasoningVisionTools Capability signal: GPQA Diamond 93.5%	GPQA Diamond 93.5%	1.05m	$3.00	$15.00
8	MiniMax M3 ReasoningVisionTools Capability signal: GPQA Diamond 92.9%	GPQA Diamond 92.9%	1m	$0.30	$1.20
9	Qwen3.7-Max ReasoningTools Capability signal: GPQA Diamond 92.4%	GPQA Diamond 92.4%	1m	$1.25	$3.75
10	Gemini 3.5 Flash ReasoningVisionTools Capability signal: GPQA Diamond 92.2%	GPQA Diamond 92.2%	1.05m	$1.50	$9.00
11	GPT-5.4 ReasoningVisionTools Capability signal: GPQA Diamond 92%	GPQA Diamond 92%	1.05m	$2.50	$15.00
12	Gemini 3 Pro VisionTools Capability signal: GPQA Diamond 91.9%	GPQA Diamond 91.9%	1m	$1.25	$5.00
13	Claude Opus 4.6 ReasoningVisionTools Capability signal: GPQA Diamond 91.3%	GPQA Diamond 91.3%	1m	$5.00	$25.00
14	GLM-5.2 ReasoningTools Capability signal: GPQA Diamond 91.2%	GPQA Diamond 91.2%	1m	$1.40	$4.40
15	Kimi K2.6 ReasoningVisionTools Capability signal: GPQA Diamond 90.5%	GPQA Diamond 90.5%	262k	$0.73	$3.40
16	Qwen3.6-Plus VisionTools Capability signal: GPQA Diamond 90.4%	GPQA Diamond 90.4%	1m	$0.33	$1.95
17	Gemini 3 Flash PreviewVisionTools Capability signal: GPQA Diamond 90.4%	GPQA Diamond 90.4%	1m	$0.50	$3.00
18	DeepSeek V4 Pro ReasoningTools Capability signal: GPQA Diamond 90.1%	GPQA Diamond 90.1%	1m	$0.43	$0.87
19	Grok 4.3 ReasoningVisionTools Capability signal: GPQA Diamond 90.1%	GPQA Diamond 90.1%	1m	$1.25	$2.50
20	Claude Sonnet 4.6 ReasoningVisionTools Capability signal: GPQA Diamond 89.9%	GPQA Diamond 89.9%	1m	$3.00	$15.00

Honorable mentions

Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.

#4Claude Opus 4.8
Claude Opus 4.8 is Anthropic's flagship Claude 4.8 model, released May 28, 2026 for agentic coding, long-horizon reasoning, computer use, and professional knowledge work. It supports text and image inputs, adaptive reasoning, tool use, structured outputs, computer-use tools, prompt caching, Batch API, Dynamic Workflows parallel subagents, a 1M-token context window on Anthropic API/Bedrock/Vertex, and 128K max output. Key datapack rows: SWE-bench Pro 69.2%, SWE-bench Verified 88.6%, Terminal-Bench 2.1 74.6%, HLE with tools 57.9%, OSWorld-Verified 83.4%, GDPval-AA 1890 Elo, and MCP-Atlas 82.2%. Standard Anthropic API pricing is $5/M input and $25/M output.
93.6%
GPQA Diamond
#5GPT-5.5 Pro
GPT-5.5 Pro is OpenAI's premium extra-compute deployment of GPT-5.5, released April 23, 2026. It uses the same underlying weights as GPT-5.5 standard with additional parallel test-time compute for harder tasks. Supports text and image inputs, reasoning effort control, tool use, structured outputs, code execution, a 1,050,000-token context window, and 128K max output. Key datapack rows: Terminal-Bench 2.1 78.2%, SWE-bench Pro 58.6%, GPQA Diamond 93.6%, ARC-AGI-2 high effort 83.3%, BrowseComp Pro compute 90.1%, and FrontierMath Tier 4 39.6%. Official pricing is $30/M input, $180/M output, $10/M batch input, and $45/M batch output; native cached input discount is not listed.
93.6%
GPQA Diamond
#6Kimi K3
Kimi K3 is Moonshot AI's 2.8-trillion-parameter flagship multimodal model for long-horizon coding, knowledge work, deep reasoning, and agentic workflows. It uses Kimi Delta Attention, Attention Residuals, and a sparse MoE design (16 of 896 experts active), supports a 1,048,576-token context window, text/image/video input, always-on reasoning, ToolCalls, strict JSON Schema structured output, automatic context caching, and partial mode through Moonshot's OpenAI-compatible API.
93.5%
GPQA Diamond

Compare Top Picks

Side-by-side comparison of the top picks by price, benchmark, and API access.

GPT-5.6 Sol vs Gemini 3.1 Pro Preview GPT-5.6 Sol vs Claude Opus 4.7 GPT-5.6 Sol vs GPT-5.5 GPT-5.6 Sol vs Claude Opus 4.8 Gemini 3.1 Pro Preview vs Claude Opus 4.7 Gemini 3.1 Pro Preview vs GPT-5.5

Browse Other Categories

Best LLMs for Code Generation Best LLMs for RAG Best AI Agent Models 2026: SWE-bench Ranked Best LLMs for Classification Best Open Source LLMs Best Multimodal / Vision LLMs Best LLM for Translation in 2026 Best AI Image Models in 2026 Best AI Video Models in 2026 Best LLMs for Reasoning & Math Best Small Language Models (SLMs)Best LLMs for Function Calling & Tool Use Cheapest LLM APIs You Can Call Right Now Best Long Context LLMs Best LLMs for Enterprise Best Free LLMs You Can Use Right Now Best LLMs for Writing Best LLMs for Marketing Best LLMs for Customer Support

Frequently asked questions

Which LLM is best for mainstream API use?

GPT-5.6 Sol is the current LLMReference top pick for mainstream API use. The verdict uses the stored category signal GPQA Diamond: 94.6%. Output pricing starts at $30.00 per 1M tokens. Review the linked model and provider pages before production use because availability and pricing can change.

How does GPT-5.6 Sol compare to Claude Opus 4.7 for mainstream API use?

GPT-5.6 Sol leads Claude Opus 4.7 in the visible shortlist on GPQA Diamond: 94.6% versus 94.2%. The pricing cards show GPT-5.6 Sol: output pricing starts at $30.00 per 1m tokens and Claude Opus 4.7: output pricing starts at $25.00 per 1m tokens.

How does LLMReference rank LLMs for mainstream API use?

LLMReference ranks LLMs for mainstream API use from stored model, benchmark, freshness, and pricing data. The current methodology summary is: Mainstream API picks now lead with capability: GPQA Diamond first, MMLU as fallback, then price only as the tie-break.

How often is this list updated?

The LLM rankings on this page are updated daily as new benchmark scores, provider availability, and pricing data are tracked. The "as of" date at the top of the page shows the most recent refresh.

How do you decide which models appear in the top 3?

The podium picks are driven by the primary benchmark signal for this category (shown in the Methodology section), filtered to non-deprecated models with confirmed API availability. In ties, we prefer the more recently released model.

Are preview or beta models included?

Preview models appear in the "Watch list" section but are not in the main ranked podium unless the category explicitly allows it (e.g., /best/coding and /best/agents, where preview models often lead benchmarks).

Can I compare two specific models head-to-head?

Yes — use the Compare tool at llmreference.com/compare for a side-by-side breakdown of context window, pricing, benchmarks, and provider availability.

Is the pricing data real-time?

Pricing is tracked from provider documentation and updated regularly. It reflects the best available public data, not live API quotes — always verify before billing.