LLM ReferenceLLM Reference

Best Reasoning LLMs (2026)

Last refreshed 2026-05-18. Next refresh: weekly.

Top AI models for complex reasoning, math, and step-by-step problem solving, ranked by sourced reasoning benchmarks and release freshness.

Top three picks

Opinionated short stack for this category — scroll for the full leaderboard, pricing, and compare links.

How we rank

Reasoning boards prioritize GPQA Diamond scores, favoring models explicitly tagged for reasoning or unusually strong GPQA.

  1. EligibilityReasoning flag or GPQA Diamond above the editorial floor used on this page.
  2. Primary rankingGPQA Diamond (higher is better), then newer release.
  3. Variant collapseWe keep one row per model family (`familySlug` + parameter tier). When headline scores tie within ±0.5 pt (±10 Elo on Chatbot Arena), we pick the canonical SKU by lowest tracked input price, then GA over preview or limited access, then newest `release`. A folded sibling within the benchmark noise band can show a "Tied within margin" chip on that score cell.
  4. PricingReasoning tiers are often priced separately — confirm provider SKUs.
#ModelInput $/1MOutput $/1M
1Claude Mythos Preview
PreviewReasoningVisionTools

GPQA Diamond: 94.6%

2Gemini 3.1 Pro Preview
PreviewVisionTools

GPQA Diamond: 94.3%

$2.00$12.00
3Claude Opus 4.7
ReasoningVisionTools

GPQA Diamond: 94.2%

$5.00$25.00
4GPT-5.5
ReasoningVisionTools

GPQA Diamond: 93.6%

$5.00$30.00
5GPT-5.5 Pro
ReasoningVisionTools

GPQA Diamond: 93.6%

$30.00$180.00
6GPT-5.4
ReasoningTools

GPQA Diamond: 92%

$2.50$15.00
7Claude Opus 4.6
ReasoningVisionTools

GPQA Diamond: 91.3%

$5.00$25.00
8Kimi K2.6
ReasoningVisionTools

GPQA Diamond: 90.5%

$0.75$3.50
9DeepSeek V4 Pro
ReasoningTools

GPQA Diamond: 90.1%

$0.43$0.87
10Claude Sonnet 4.6
ReasoningVisionTools

GPQA Diamond: 89.9%

$3.00$15.00
11Muse Spark
ReasoningVisionTools

GPQA Diamond: 89.5%

12Qwen3.5-397B-A17B
ReasoningTools

GPQA Diamond: 89.3%

$0.39$2.34
13Trinity-Large-Thinking
ReasoningTools

GPQA Diamond: 89.2%

$0.22$0.85
14ByteDance Doubao Seed 2.0 Pro
Tools

GPQA Diamond: 88.9%

15DeepSeek V4 Flash
ReasoningTools

GPQA Diamond: 88.1%

$0.14$0.28
16Kimi K2.5
Tools

GPQA Diamond: 87.9%

$0.44$2.00
17Qwen3.6-27B
ReasoningVisionTools

GPQA Diamond: 87.8%

$0.32$3.20
18o3
Reasoning

GPQA Diamond: 87.7%

$2.00$8.00
19MiniMax M2.7
ReasoningTools

GPQA Diamond: 87.4%

$0.30$1.20
20Gemini 3.1 Flash-Lite
VisionTools

GPQA Diamond: 86.9%

$0.25$1.50

Honorable mentions

Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.

  • GPT-5.4 is OpenAI's flagship frontier reasoning model, released March 5, 2026. It incorporates advances from GPT-5.3-Codex for coding and agentic workflows, and adds 'Thinking' mode with editable reasoning plans. Key capabilities include computer use (navigating interfaces via Playwright), image understanding and generation integration, full-stack web app generation, tool calling, and deep research. Knowledge cutoff is August 31, 2025. Model ID: gpt-5.4.

    92%

    GPQA Diamond

  • Claude Opus 4.6 available on AWS Bedrock

    91.3%

    GPQA Diamond

  • Kimi K2.6 is Moonshot AI's latest agentic reasoning model, launched April 13 2026 as a code preview for Kimi Code subscribers. Built on a 1-trillion-parameter MoE architecture (32B active, 384 experts), it inherits K2.5's 256K context window and adds enhanced reliability for long-horizon agentic workflows — supporting 200–300 sequential tool calls without drift. Optimized for coding, multi-step agent planning, and vision-assisted tasks such as processing screenshots, PDFs, and spreadsheets.

    90.5%

    GPQA Diamond