LLM Reference

Best Multimodal LLMs for Vision (2026)

Top AI models that understand images, video, and documents. Compare GPT-4o, Gemini, Claude, and other multimodal models by capabilities and pricing.

#ModelInput $/1MOutput $/1M
1Mistral Small 4
VisionTools
2Nemotron 3 VoiceChat
Vision
3GPT-4.5
VisionTools
4Phi-4 Reasoning Vision 15B
5Gemini 3.1 Flash-Lite Preview
VisionTools
$0.25$1.5
6DeepSeek V3.1
Vision
$0.56$1.68
7Transcribe (03-2026)
8Gemini 3.1 Pro Preview
VisionTools
$2$12
9Gemini 3.1 Pro
VisionTools
$2$12
10Claude Sonnet 4.6
ReasoningVisionTools
$3$15
11Claude Sonnet 4.6 Batch
ReasoningVisionTools
$1.5$7.5
12Gemini 2.0 Ultra
VisionTools
13Claude Opus 4.6
ReasoningVisionTools
$5$25
14Claude Opus 4.6 Batch
ReasoningVisionTools
$2.5$12.5
15Qwen 3 Max
VisionTools
$0.78$3.9
16Grok-3
ReasoningVisionTools
$3$15
17Gemini 3 Flash Preview
VisionTools
$0.5$3
18Gemini 3 Flash
VisionTools
$0.5$3
19Mistral Small 3.1 24B Instruct
Vision
$0.03$0.11
20Gemini 3 Pro
VisionTools
$2$12