Eligibility — Reasoning flag or GPQA Diamond above the editorial floor used on this page.
Primary ranking — GPQA Diamond (higher is better), then newer release.
Variant collapse — We keep one row per model family (`familySlug` + parameter tier). When headline scores tie within ±0.5 pt (±10 Elo on Chatbot Arena), we pick the canonical SKU by lowest tracked input price, then GA over preview or limited access, then newest `release`. A folded sibling within the benchmark noise band can show a "Tied within margin" chip on that score cell.
Pricing — Reasoning tiers are often priced separately — confirm provider SKUs.

GPQA Diamond

#	Model	GPQA Diamond	Context	Input $/1M	Output $/1M
1	Fugu ReasoningVisionTools GPQA Diamond: 95.5%	95.5%	1m	—	—
2	Fugu Ultra ReasoningVisionTools GPQA Diamond: 95.5%	95.5%	1m	$5.00	$30.00
3	Claude Mythos Preview Invite-onlyReasoningVisionTools GPQA Diamond: 94.6%	94.6%	1m	—	—
4	Gemini 3.1 Pro Preview PreviewVisionTools GPQA Diamond: 94.3%	94.3%	1m	$2.00	$12.00
5	Claude Opus 4.7 ReasoningVisionTools GPQA Diamond: 94.2%	94.2%	1m	$5.00	$25.00
6	Claude Opus 4.8 ReasoningVisionTools GPQA Diamond: 93.6%	93.6%	1m	$5.00	$25.00
7	GPT-5.5 ReasoningVisionTools GPQA Diamond: 93.6%	93.6%	1.05m	$5.00	$30.00
8	GPT-5.5 Pro ReasoningVisionTools GPQA Diamond: 93.6%	93.6%	1.05m	$30.00	$180.00
9	MiniMax M3 ReasoningVisionTools GPQA Diamond: 92.9%	92.9%	1m	$0.30	$1.20
10	Qwen3.7-Max ReasoningTools GPQA Diamond: 92.4%	92.4%	1m	$1.25	$3.75
11	Gemini 3.5 Flash ReasoningVisionTools GPQA Diamond: 92.2%	92.2%	1.05m	$1.50	$9.00
12	GPT-5.4 ReasoningVisionTools GPQA Diamond: 92%	92%	1.05m	$2.50	$15.00
13	Gemini 3 Pro VisionTools GPQA Diamond: 91.9%	91.9%	1m	$1.25	$5.00
14	Qwen3.6-Max Vision GPQA Diamond: 91.8%	91.8%	262k	—	—
15	Claude Opus 4.6 ReasoningVisionTools GPQA Diamond: 91.3%	91.3%	1m	$5.00	$25.00
16	GLM-5.2 ReasoningTools GPQA Diamond: 91.2%	91.2%	1m	$1.40	$4.40
17	Kimi K2.6 ReasoningVisionTools GPQA Diamond: 90.5%	90.5%	262k	$0.73	$3.40
18	Qwen3.6-Plus VisionTools GPQA Diamond: 90.4%	90.4%	1m	$0.33	$1.95
19	Gemini 3 Flash PreviewVisionTools GPQA Diamond: 90.4%	90.4%	1m	$0.50	$3.00
20	Grok 4.3 ReasoningVisionTools GPQA Diamond: 90.1%	90.1%	1m	$1.25	$2.50

Honorable mentions

Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.

#4GPT-5.5
GPT-5.5 is OpenAI's fully retrained agentic model, released April 23, 2026. Optimised for agentic coding, computer use, knowledge work, and early scientific research. Achieves 82.7% on Terminal-Bench 2.0 (Codex CLI scaffold), 84.9% on GDPval, 58.6% on SWE-Bench Pro, 93.6% on GPQA Diamond, and 82.6% on SWE-Bench Verified (Vals.ai independent harness). Knowledge cutoff December 2025. Supports reasoning effort levels (none/low/medium/high/xhigh). Context window 1,050,000 tokens with a long-context surcharge above 272K tokens. Model ID: gpt-5.5.
93.6%
GPQA Diamond
#5GPT-5.5 Pro
GPT-5.5 Pro is OpenAI's premium extra-compute deployment of GPT-5.5, released April 23, 2026. It uses the same underlying weights as GPT-5.5 standard with additional parallel test-time compute for harder tasks. Supports text and image inputs, reasoning effort control, tool use, structured outputs, code execution, a 1,050,000-token context window, and 128K max output. Key datapack rows: Terminal-Bench 2.1 78.2%, SWE-bench Pro 58.6%, GPQA Diamond 93.6%, ARC-AGI-2 high effort 83.3%, BrowseComp Pro compute 90.1%, and FrontierMath Tier 4 39.6%. Official pricing is $30/M input, $180/M output, $10/M batch input, and $45/M batch output; native cached input discount is not listed.
93.6%
GPQA Diamond
#6MiniMax M3
MiniMax M3 is MiniMax's current API flagship (released June 1, 2026) with MiniMax Sparse Attention for economical 1M-token context, native multimodality, and agentic coding. Benchmark rows in LLMReference separate vendor-reported SWE-bench Pro, Terminal-Bench 2.1, MCP-Atlas, and BrowseComp scores (source: minimax.io) from third-party rows — inspect evaluator, variant, and source on each benchmark cell before comparing to leaderboard claims.
92.9%
GPQA Diamond

Compare Top Picks

Side-by-side comparison of the top picks by price, benchmark, and API access.

Fugu vs Claude Mythos Preview Fugu vs Gemini 3.1 Pro Preview Fugu vs Claude Opus 4.7 Fugu Ultra vs Claude Mythos Preview Fugu Ultra vs Gemini 3.1 Pro Preview Fugu Ultra vs Claude Opus 4.7

Browse Other Categories

Best LLMs for Code Generation Best LLMs for RAG Best AI Agent Models 2026: SWE-bench Ranked Best LLMs for Classification Best Open Source LLMs Best Multimodal / Vision LLMs Best LLM for Translation in 2026 Best AI Image Models in 2026 Best AI Video Models in 2026 Best Small Language Models (SLMs)Best LLMs for Function Calling & Tool Use Cheapest LLM APIs You Can Call Right Now Best Long Context LLMs Best Mainstream LLM APIs, Ranked Best LLMs for Enterprise Best Free LLMs You Can Use Right Now Best LLMs for Writing Best LLMs for Marketing Best LLMs for Customer Support

Frequently asked questions

Which LLM is best for reasoning and math?

Fugu Ultra is the current LLMReference top pick for reasoning and math. The verdict uses the stored category signal GPQA Diamond: 95.5%. Output pricing starts at $30.00 per 1M tokens. Review the linked model and provider pages before production use because availability and pricing can change.

How does Fugu Ultra compare to Claude Opus 4.7 for reasoning and math?

Fugu Ultra leads Claude Opus 4.7 in the visible shortlist on GPQA Diamond: 95.5% versus 94.2%. The pricing cards show Fugu Ultra: output pricing starts at $30.00 per 1m tokens and Claude Opus 4.7: output pricing starts at $25.00 per 1m tokens.

How does LLMReference rank LLMs for reasoning and math?

LLMReference ranks LLMs for reasoning and math from stored model, benchmark, freshness, and pricing data. The current methodology summary is: Reasoning boards prioritize GPQA Diamond scores, favoring models explicitly tagged for reasoning or unusually strong GPQA.

How often is this list updated?

The LLM rankings on this page are updated daily as new benchmark scores, provider availability, and pricing data are tracked. The "as of" date at the top of the page shows the most recent refresh.

How do you decide which models appear in the top 3?

The podium picks are driven by the primary benchmark signal for this category (shown in the Methodology section), filtered to non-deprecated models with confirmed API availability. In ties, we prefer the more recently released model.

Are preview or beta models included?

Preview models appear in the "Watch list" section but are not in the main ranked podium unless the category explicitly allows it (e.g., /best/coding and /best/agents, where preview models often lead benchmarks).

Can I compare two specific models head-to-head?

Yes — use the Compare tool at llmreference.com/compare for a side-by-side breakdown of context window, pricing, benchmarks, and provider availability.

Is the pricing data real-time?

Pricing is tracked from provider documentation and updated regularly. It reflects the best available public data, not live API quotes — always verify before billing.