LLM ReferenceLLM Reference

Best Multimodal LLMs for Vision (2026)

Last refreshed 2026-05-18. Next refresh: weekly.

Top multimodal models that understand images, video, and documents, ranked by vision benchmarks, capabilities, pricing, and context window.

Top three picks

Opinionated short stack for this category — scroll for the full leaderboard, pricing, and compare links.

How we rank

Vision/multimodal leaders rank on MMMU (multimodal understanding), then recency.

  1. EligibilityModels flagged `vision` or `multimodal` in seed data.
  2. Primary rankingMMMU score, then newer release.
  3. Variant collapseWe keep one row per model family (`familySlug` + parameter tier). When headline scores tie within ±0.5 pt (±10 Elo on Chatbot Arena), we pick the canonical SKU by lowest tracked input price, then GA over preview or limited access, then newest `release`. A folded sibling within the benchmark noise band can show a "Tied within margin" chip on that score cell.
  4. PricingMultimodal pricing often differs by modality — use provider rows for image/video-specific tiers.
#ModelInput $/1MOutput $/1M
1Qwen3.6-Plus
VisionTools

MMMU: 86%

$0.33$1.95
2Qwen3.5-397B-A17B
ReasoningTools

MMMU: 85%

$0.39$2.34
3GPT-5.4
ReasoningTools

MMMU: 82.1%

$2.50$15.00
4Qwen3.6 Max Preview
PreviewReasoningVisionTools

MMMU: 82%

$1.04$6.24
5Gemini 3 Pro
VisionTools

MMMU: 81%

$1.25$5.00
6Claude Opus 4.5
ReasoningVisionTools

MMMU: 80.7%

$5.00$25.00
7Gemini 2.5 Flash
VisionTools

MMMU: 79.7%

$0.30$2.50
8Claude Sonnet 4.5
ReasoningVisionTools

MMMU: 77.8%

$3.00$15.00
9Claude 3.7 Sonnet
ReasoningVisionTools

MMMU: 75%

$3.00$15.00
10GPT-4o
VisionTools

MMMU: 69.1%

$2.50$10.00
11Qwen2-VL-72B-Instruct
Vision

MMMU: 64.5%

$0.90$0.90
12Llama 3.2 90B Vision
Vision

MMMU: 60.3%

$1.35$1.80
13Llama 3.2 11B Vision
Vision

MMMU: 50.7%

$0.20$0.27
14GLM-4V 9B

MMMU: 48.3%

$0.05$0.25
15Phi 3.5 Vision Instruct
Vision

MMMU: 43%

16Sora 2

MMMU:

17Perceptron Mk1
ReasoningVision

MMMU:

$0.15$1.50
18MiniCPM-V 4.6
Vision

MMMU:

19GPT Realtime 2
ReasoningTools

MMMU:

$32.00$64.00
20GPT Realtime Translate

MMMU:

Honorable mentions

Next seats in this ranking. Lines below are from each model's stored description in LLMReference seed data—spot-check the model page before relying on a capability claim.

  • Qwen3.6-Max-Preview is a proprietary frontier model from Alibaba Cloud built on a sparse MoE architecture, available for preview as part of the Qwen3.6 series.

    82%

    MMMU

  • #5Gemini 3 Pro

    Google DeepMind's most advanced reasoning Gemini model. Part of the Gemini 3 series with frontier-class intelligence, multimodal understanding, and 1M token context window.

    81%

    MMMU

  • Claude Opus 4.5 available on AWS Bedrock

    80.7%

    MMMU