LLM Reference

Best Multimodal LLMs for Vision (2026)

Last refreshed 2026-06-30. Next refresh: weekly.

Best vision and multimodal LLMs in 2026, ranked by image benchmarks. Covers image QA, document understanding, and video analysis.

Verdict

Use Qwen3.6-Plus for vision today.

Qwen3.5-397B-A17B is the runner-up, 1 point back on MMMU.

Researched 43d agoWhy this pickMethodology

How we rank

Vision/multimodal leaders rank on standard MMMU, use MathVista only as a comparable near-tie signal, then recency. MMMU Pro is tracked separately as harder multimodal evidence, but models without standard MMMU stay benchmark-pending for this leaderboard.

  1. EligibilityModels flagged `vision` or `multimodal` in seed data, excluding audio, speech, image-generation, and video-generation specialist models.
  2. Primary rankingStandard MMMU score is the primary score. When two scored models are within 1 point and both have MathVista, MathVista breaks the near-tie before release recency.
  3. Benchmark pendingRecent source-backed vision models without tracked standard MMMU stay visible in a separate benchmark-pending section. MMMU Pro rows remain informational unless the methodology is changed to accept that harder benchmark as a proxy.
  4. Variant collapseWe keep one row per model family (`familySlug` + parameter tier). When headline scores tie within ±0.5 pt (±10 Elo on Chatbot Arena), we pick the canonical SKU by lowest tracked input price, then GA over preview or limited access, then newest `release`. A folded sibling within the benchmark noise band can show a "Tied within margin" chip on that score cell.
  5. PricingMultimodal pricing often differs by modality — use provider rows for image/video-specific tiers.
#ModelInput $/1MOutput $/1M
1Qwen3.6-Plus
VisionTools

MMMU: 86%

$0.33$1.95
2ByteDance Doubao Seed 2.0 Pro
VisionTools

MMMU: 85.4%

$0.47$2.37
3Qwen3.5-397B-A17B
ReasoningVisionTools

MMMU: 85%

$0.39$2.34
4Gemini 3.5 Flash
ReasoningVisionTools

MMMU: 83.6%

$1.50$9.00
5Claude Sonnet 4.6
ReasoningVisionTools

MMMU: 83.6%

$3.00$15.00
6o3
ReasoningVisionTools

MMMU: 82.9%

$2.00$8.00
7GPT-5.4
ReasoningVisionTools

MMMU: 82.1%

$2.50$15.00
8Qwen3.6 Max Preview
PreviewReasoningVisionTools

MMMU: 82%

$1.04$6.24
9Gemini 2.5 Pro
ReasoningVisionTools

MMMU: 81.7%

$1.25$10.00
10Gemini 3 Pro
VisionTools

MMMU: 81%

$1.25$5.00
11Claude Opus 4.5
ReasoningVisionTools

MMMU: 80.7%

$5.00$25.00
12Gemini 2.5 Flash
VisionTools

MMMU: 79.7%

$0.30$2.50
13Gemini 2.5 Pro Preview 05-06
PreviewVision

MMMU: 79.6%

$1.25$10.00
14Claude Sonnet 4.5
ReasoningVisionTools

MMMU: 77.8%

$3.00$15.00
15Claude Opus 4.6
ReasoningVisionTools

MMMU: 76.5%

$5.00$25.00
16Command A+
ReasoningVisionTools

MMMU: 75.1%

17Claude 3.7 Sonnet
ReasoningVisionTools

MMMU: 75%

$3.00$15.00
18Llama 4 Maverick 17B Instruct FP8
Vision

MMMU: 73.4%

$0.15$0.60
19Llama 4 Scout 17B-16E Instruct
Vision

MMMU: 69.4%

$0.08$0.22
20GPT-4o
VisionTools

MMMU: 69.1%

$2.50$10.00

New models awaiting benchmark coverage

These source-backed rows qualify for this task page, but they are not scored leaderboard picks until the category benchmark data exists.

ModelWhy it is listedStatusTracked price
Claude Sonnet 5
ToolsCode execution
Claude Sonnet 5 reports source-backed vision or multimodal capability; keep it separate from the scored vision ranking until standard MMMU data lands.Benchmark pending

No tracked standard MMMU score yet.

In $2.00 / Out $10.00
Seed 2.1 Pro
Seed 2.1 Pro reports source-backed vision or multimodal capability; keep it separate from the scored vision ranking until standard MMMU data lands.Benchmark pending

No tracked standard MMMU score yet.

In $0.83 / Out $4.14
Seed 2.1 Turbo
Seed 2.1 Turbo reports source-backed vision or multimodal capability; keep it separate from the scored vision ranking until standard MMMU data lands.Benchmark pending

No tracked standard MMMU score yet.

In $0.41 / Out $2.07
Fugu Ultra
Tools
Fugu Ultra reports source-backed vision or multimodal capability; keep it separate from the scored vision ranking until standard MMMU data lands.Benchmark pending

No tracked standard MMMU score yet.

In $5.00 / Out $30.00
Kimi K2.7-Code HighSpeed
Tools
Kimi K2.7-Code HighSpeed reports source-backed vision or multimodal capability; keep it separate from the scored vision ranking until standard MMMU data lands.Benchmark pending

No tracked standard MMMU score yet.

In $1.90 / Out $8.00
Claude Opus 4.8
ToolsCode execution
Claude Opus 4.8 reports source-backed vision or multimodal capability; keep it separate from the scored vision ranking until standard MMMU data lands.Benchmark pending

No tracked standard MMMU score yet.

In $5.00 / Out $25.00
GPT-5.5
ToolsCode execution
GPT-5.5 reports source-backed vision or multimodal capability; keep it separate from the scored vision ranking until standard MMMU data lands.Benchmark pending

No tracked standard MMMU score yet.

In $5.00 / Out $30.00
Claude Opus 4.7
ToolsCode execution
Claude Opus 4.7 reports source-backed vision or multimodal capability; keep it separate from the scored vision ranking until standard MMMU data lands.Benchmark pending

No tracked standard MMMU score yet.

In $5.00 / Out $25.00

Honorable mentions

Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.

  • Claude Sonnet 4.6 is Anthropic's best combination of speed and intelligence. Proprietary decoder-only model with 1M-token context, 64K max output, multimodal vision, extended thinking, and function calling. Available via Anthropic API, AWS Bedrock, GCP Vertex AI, and OpenRouter at $3/1M input and $15/1M output tokens.

    83.6%

    MMMU

  • #5o3

    OpenAI o3 reasoning model with advanced multi-step problem-solving capabilities.

    82.9%

    MMMU

  • GPT-5.4 is OpenAI's flagship frontier reasoning model, released March 5, 2026. It incorporates advances from GPT-5.3-Codex for coding and agentic workflows, and adds 'Thinking' mode with editable reasoning plans. Key capabilities include computer use (navigating interfaces via Playwright), image understanding and generation integration, full-stack web app generation, tool calling, and deep research. Knowledge cutoff is August 31, 2025. Model ID: gpt-5.4.

    82.1%

    MMMU

Frequently asked questions

Which LLM is best for vision?

Qwen3.6-Plus is the current LLMReference top pick for vision. The verdict uses the stored category signal MMMU: 86%. Output pricing starts at $1.95 per 1M tokens. Review the linked model and provider pages before production use because availability and pricing can change.

How does Qwen3.6-Plus compare to Qwen3.5-397B-A17B for vision?

Qwen3.6-Plus leads Qwen3.5-397B-A17B in the visible shortlist on MMMU: 86% versus 85%. The pricing cards show Qwen3.6-Plus: output pricing starts at $1.95 per 1m tokens and Qwen3.5-397B-A17B: output pricing starts at $2.34 per 1m tokens.

How does LLMReference rank LLMs for vision?

LLMReference ranks LLMs for vision from stored model, benchmark, freshness, and pricing data. The current methodology summary is: Vision/multimodal leaders rank on standard MMMU, use MathVista only as a comparable near-tie signal, then recency. MMMU Pro is tracked separately as harder multimodal evidence, but models without standard MMMU stay benchmark-pending for this leaderboard.

How often is this list updated?

The LLM rankings on this page are updated daily as new benchmark scores, provider availability, and pricing data are tracked. The "as of" date at the top of the page shows the most recent refresh.

How do you decide which models appear in the top 3?

The podium picks are driven by the primary benchmark signal for this category (shown in the Methodology section), filtered to non-deprecated models with confirmed API availability. In ties, we prefer the more recently released model.

Are preview or beta models included?

Preview models appear in the "Watch list" section but are not in the main ranked podium unless the category explicitly allows it (e.g., /best/coding and /best/agents, where preview models often lead benchmarks).

Can I compare two specific models head-to-head?

Yes — use the Compare tool at llmreference.com/compare for a side-by-side breakdown of context window, pricing, benchmarks, and provider availability.

Is the pricing data real-time?

Pricing is tracked from provider documentation and updated regularly. It reflects the best available public data, not live API quotes — always verify before billing.