LLM Reference
Concepts & capability filters
Capability filtercapabilitybeginner

multimodal

Also known as: multi-modal, multimodality

See matching models with benchmark scores and pricing.

443

matching active models

46

tracked providers

299

models with routes

model.multimodalmodel.visionmodel.audio

Definition

Multimodal refers to LLMs extended to process and generate across multiple data modalities, such as text paired with images, audio, or video, via unified tokenization and cross-attention mechanisms. This enables tasks like visual question answering or captioning, integrating modality-specific encoders into the core transformer.

Models With multimodal

Showing the first 80 matches, sorted by decision relevance, with tracked capability and provider-route evidence.

443 matches
Mistral Medium 3 Instruct

Mistral Medium 3 Instruct is MistralAI's Mistral Medium model. It offers a 128K-token context window.

2025-10-01

Researched 19d ago

128k

128,000 tokens

128k contextVisionMultimodalBatch
Mistral AI Studio

$0.400 in / $2.00 out / 1M tokens

2 routes · 1 batch

Provider docs
Mistral Large 3 675B Instruct

Mistral Large 3 675B Instruct is MistralAI's Mistral Large model. It offers a 128K-token context window and scores 70.2 on τ-bench.

2025-12-01

Researched 4d ago

128k

128,000 tokens

128k contextVisionMultimodalJSONBatchPrompt cache
AWS Bedrock

$0.500 in / $1.50 out / 1M tokens

6 routes · 1 batch · 1 cache

Provider docs
Qwen3.5-9B

Qwen3.5-9B is Alibaba's Qwen3.5 model with multimodal text and image input. It offers a 256K-token context window with weights openly available for self-hosting.

2026-03-02

Researched 35d ago

262k

262,144 tokens

262k contextVisionMultimodalTool useFunctionsJSON
Alibaba Cloud PAI-EAS

$0.100 in / $0.150 out / 1M tokens

3 routes

Provider docs
Llama Guard 3 11B Vision

Llama Guard 3 11B Vision is Meta's Llama Guard model. Weights are openly available for self-hosting.

2024-09-25

Researched 35d ago

128k

128,000 tokens

128k contextVision

No tracked provider route

GPT-4 Vision Preview

GPT-4 Vision Preview is OpenAI's GPT-4 model with multimodal text and image input. It is deprecated (originally released 2023-11-06); use it only for reproducing earlier results or evaluating drift over time.

2023-11-06

Researched 35d ago

128k

128,000 tokens

128k contextVisionMultimodalCode exec

No tracked provider route

Xiaomi MiMo-V2.5

Xiaomi MiMo-V2.5 is the lower-cost native omnimodal sibling in the MiMo-V2.5 series. OpenRouter describes it as supporting text, image, audio, and video inputs with text output, Pro-level agentic performance at roughly half the inference cost, and improved multimodal perception over MiMo-V2-Omni. Xiaomi's official April 22 release page highlights MiMo-V2.5 alongside MiMo-V2.5-Pro in benchmark data and says the V2.5 series will be open-sourced soon; no public weights/license were verified at research time.

2026-04-22

Researched 28d ago

1.05m

1,048,576 tokens

1.05m contextReasoningVisionMultimodalTool useFunctions
OpenRouter

$0.140 in / $0.280 out / 1M tokens

2 routes · 1 cache

Provider docs
Kimi K2.6

Kimi K2.6 is Moonshot AI's multimodal agentic coding model, released April 20 2026 under a Modified MIT license. Built on a 1-trillion-parameter MoE architecture (32B active, 384 experts with 8 selected per token plus 1 shared expert, 61 layers), it features a 262K context window and up to 65,536 output tokens. Supports native image and video inputs (screenshots, PDFs, spreadsheets). Designed for long-horizon coding with agent swarms of up to 300 sub-agents and 4,000 coordinated steps; Moonshot AI cites 200–300 sequential tool calls without task drift. Key benchmarks: SWE-bench Verified 80.2%, SWE-bench Pro 58.6%, LiveCodeBench v6 89.6%, GPQA Diamond 90.5%, Terminal-Bench 2.0 66.7%. Chatbot Arena Elo 1454 (2026-04-28 snapshot).

2026-04-20

Researched 29d ago

262k

262,144 tokens

262k contextReasoningVisionMultimodalTool useFunctions
Novita AI

$0.800 in / $3.40 out / 1M tokens

9 routes · 3 cache

Provider docs
SenseNova V6

SenseNova V6 is SenseTime's latest large model with advanced multimodal reasoning and text generation at low cost. Features multimodal long chain-of-thought reasoning and reinforcement learning enhancements.

2025-04-01

Researched 35d ago

128k

128,000 tokens

128k contextVisionMultimodal
SenseTime API

Pricing not tracked / 1M tokens

1 route

Provider docs
Step-2

Step-2 is StepFun's Step model with multimodal text and image input. It offers a 256K-token context window.

2024-09-01

Researched 32d ago

256k

256,000 tokens

256k contextVisionMultimodalFunctions
StepFun

Pricing not tracked / 1M tokens

1 route

Provider docs
Step-1.5V

Step-1.5V is StepFun's multimodal language model with vision capabilities, building on Step-1 with image understanding.

2024-06-01

Researched 173d ago

128k

128,000 tokens

128k contextVisionMultimodal
StepFun

Pricing not tracked / 1M tokens

1 route

Provider docs
SenseNova-U1-A3B

SenseNova-U1-A3B is SenseTime's open-source multimodal MoE model released April 28, 2026. ~3B activated parameters (MoE backbone). Shares the NEO-Unify architecture with SenseNova-U1-8B: no visual encoder or VAE, native unified text-and-image representation. Apache 2.0 license. HuggingFace: OpenSenseNova/SenseNova-U1-A3B-MoT. Source: https://github.com/OpenSenseNova/SenseNova-U1

2026-04-28

Researched 35d ago

No window data

VisionMultimodal

No tracked provider route

Kimi K2.7-Code

Kimi K2.7-Code is Moonshot AI's coding-focused multimodal model released June 12, 2026, built on Kimi K2.6. Uses the same 1-trillion-parameter MoE architecture (32B active parameters, 384 experts with 8 selected per token, 61 layers) with a 262K context window and MoonViT vision encoder (400M parameters). Reports +21.8% on Moonshot's Kimi Code Bench v2, +11.0% on Program Bench, +31.5% on MLS Bench Lite versus K2.6, with approximately 30% fewer reasoning tokens. Forces thinking mode on by default and preserves reasoning content across multi-turn interactions for agentic use. Available via Kimi platform API and HuggingFace under Modified MIT license.

2026-06-12

Researched 3d ago

262k

262,144 tokens

262k contextReasoningVisionMultimodalTool useFunctions
OpenRouter

$0.612 in / $3.07 out / 1M tokens

2 routes · 2 cache

Provider docs
Holo3.1-35B-A3B

Holo3.1-35B-A3B is H Company's flagship open-weights 35B (3B active) sparse MoE computer-use VLM released June 1, 2026 under Apache 2.0. Achieves 79.3% on AndroidWorld and >25% improvement on the Holotab harness over Holo3. Supports native function-calling, FP8/NVFP4 quantization for DGX Spark, and local self-hosting.

2026-06-01

Researched 16d ago

262k

262,144 tokens

262k contextVisionMultimodalTool useFunctionsJSON
H Company API

$0.250 in / $1.80 out / 1M tokens

1 route

Provider docs
Holo3.1-9B

Holo3.1-9B is H Company's 9B-parameter open-weights computer-use VLM released June 1, 2026 under Apache 2.0. Part of the Holo3.1 family. Achieves 71% on AndroidWorld. Supports native function-calling and structured JSON output for web, desktop, and mobile agent workflows.

2026-06-01

Researched 16d ago

262k

262,144 tokens

262k contextVisionMultimodalTool useFunctionsJSON

No tracked provider route

Holo3.1-4B

Holo3.1-4B is H Company's 4B-parameter open-weights computer-use VLM released June 1, 2026 under Apache 2.0. Part of the Holo3.1 family. Achieves 71% on AndroidWorld (up from 58% for Holo3). Supports native function-calling, web/desktop/mobile environments, and local deployment via quantized checkpoints.

2026-06-01

Researched 16d ago

262k

262,144 tokens

262k contextVisionMultimodalTool useFunctionsJSON

No tracked provider route

Holo3.1-0.8B

Holo3.1-0.8B is H Company's smallest 0.8B-parameter open-weights computer-use VLM, released June 1, 2026 under Apache 2.0. Part of the Holo3.1 family spanning 0.8B-35B. Fine-tuned from Qwen3.5-0.8B. Supports native function-calling and structured JSON output for local/edge deployment.

2026-06-01

Researched 16d ago

262k

262,144 tokens

262k contextVisionMultimodalTool useFunctionsJSON

No tracked provider route

Phi-3 Vision

Phi-3 Vision is a sophisticated multimodal AI model from Microsoft, designed to adeptly integrate language and vision capabilities. Unlike traditional language models, it processes both text and images and can perform tasks such as optical character recognition, chart analysis, and image interpretation. Its architecture features an image encoder, a text-image connector, a projector for mapping image features, and the Phi-3 Mini language model. Despite its relatively small size of 4.2 billion parameters, it competes with larger models and suits devices with limited computational power. Phi-3 Vision's ability to handle up to 128K tokens supports complex multimodal reasoning. It draws upon high-quality and synthetic data for training while incorporating essential safety measures.

2024-05-21

Researched 173d ago

128k

128,000 tokens

128k contextVision
Fireworks AI

$0.200 in / $0.200 out / 1M tokens

3 routes

Provider docs
Holo3-122B-A10B

Holo3-122B-A10B is H Company's flagship 122B (10B active) sparse MoE computer-use VLM. Released March 31, 2026, available via the H Company API only (no public open weights). Achieves 78.85% on OSWorld-Verified - SOTA at launch. Paid-tier API access at $0.40/$3.00 per 1M input/output tokens.

2026-03-31

Researched 16d ago

66k

65,536 tokens

VisionMultimodalTool useFunctionsJSON
H Company API

$0.400 in / $3.00 out / 1M tokens

1 route

Provider docs
Amazon Nova Premier

Amazon Nova Premier is Amazon's most capable standard Bedrock Nova understanding model for complex reasoning, agentic workflows, and model distillation. It supports a 1M-token context window, text/image/video inputs, text output, reasoning, tool calling, and prompt caching; use it as the standard Bedrock Nova frontier pick instead of Nova 2 Omni early-access Forge checkpoints.

2025-03-17

Researched 34d ago

1m

1,000,000 tokens

1m contextReasoningVisionMultimodalTool useFunctions
AWS Bedrock

$2.50 in / $12.50 out / 1M tokens

2 routes · 1 batch · 1 cache

Provider docs
Nova Pro

Nova Pro is Amazon's Nova model. It offers a 300K-token context window.

2025-03-17

Researched 19d ago

300k

300,000 tokens

300k contextVisionMultimodalJSON
AWS Bedrock

$0.800 in / $3.20 out / 1M tokens

2 routes

Provider docs
Reka Edge

Reka Edge is Reka's Reka model with multimodal text and image input. It offers a 64K-token context window.

2024-02-12

Researched 35d ago

64k

64,000 tokens

MultimodalJSON
OpenRouter

$0.100 in / $0.100 out / 1M tokens

2 routes

Provider docs
Mistral Large 2 (2407)

Flagship sparse MoE Mistral model (675B total, 41B active) with 256K context and multimodal capabilities. Leads benchmarks in complex reasoning and long-context processing.

2024-07-23

Researched 61d ago

128k

128,000 tokens

128k contextVisionJSON
Chutes AI

$0.500 in / $1.50 out / 1M tokens

3 routes

Nova Lite

Nova Lite is Amazon's Nova model. It offers a 300K-token context window.

2025-03-17

Researched 19d ago

300k

300,000 tokens

300k contextVisionMultimodalJSON
AWS Bedrock

$0.060 in / $0.240 out / 1M tokens

2 routes

Provider docs
Claude 3 Sonnet

Claude 3 Sonnet by Anthropic is a versatile large language AI model, balancing intelligence and speed for diverse enterprise use cases. It is part of the Claude 3 family, positioned between the powerful Opus and the faster Haiku models. Sonnet excels in nuanced content creation, accurate summarization, and complex scientific query handling while also showcasing proficiency in non-English languages and coding tasks. Additionally, it enhances vision capabilities with exceptional skills in visual reasoning, such as interpreting charts, graphs, and transcribing text from imperfect images, which benefits industries like retail, logistics, and finance. Operated at twice the speed of Claude 3 Opus, Sonnet is efficient in context-sensitive customer support and multi-step workflows. It has achieved AI Safety Level 2 (ASL-2) and is accessible through multiple platforms, including Claude.ai, the Claude iOS app, the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI.

2024-03-04

Researched 65d ago

200k

200,000 tokens

200k contextReasoningVisionMultimodalJSONCode exec
AWS Bedrock

$3.00 in / $15.00 out / 1M tokens

2 routes · 1 cache

Provider docs
Reka Flash

21B multimodal model from Reka. State-of-the-art for 21B class, supports text/image/video

2024-02-12

Researched 173d ago

128k

128,000 tokens

128k contextMultimodal
Reka Platform

$0.200 in / $0.800 out / 1M tokens

1 route

Provider docs
Kimi K2.5

Kimi K2.5 is Moonshot AI's Kimi model focused on code generation and software engineering. It offers a 256K-token context window and scores 87.9 on GPQA.

2026-03-15

Researched 19d ago

256k

256,000 tokens

256k contextVisionMultimodalFunctionsJSONPrompt cache
OpenRouter

$0.440 in / $2.00 out / 1M tokens

9 routes · 1 cache

Provider docs
Llama 3.2 11B Vision Instruct

Instruction-tuned 11B Llama 3.2 Vision model for image reasoning, visual question answering, document understanding, and captioning. NVIDIA NIM lists text plus image input, text output, and a 128K context window for the Llama 3.2 Vision collection.

2024-09-25

Researched 22d ago

128k

128,000 tokens

128k contextVisionMultimodalJSON
Vercel AI Gateway

$0.160 in / $0.160 out / 1M tokens

8 routes

Provider docs
Llama 3.2 90B Vision Instruct

Instruction-tuned 90B Llama 3.2 Vision model for higher-capability image reasoning, visual question answering, visual grounding, and captioning. NVIDIA NIM lists text plus image input, text output, and a 128K context window for the Llama 3.2 Vision collection.

2024-09-25

Researched 22d ago

128k

128,000 tokens

128k contextVisionMultimodal
Bitdeer AI

$0.150 in / $0.450 out / 1M tokens

6 routes

Provider docs
Claude 3.7 Sonnet

Claude 3.7 Sonnet is Anthropic's advanced model with extended thinking capabilities, offering state-of-the-art reasoning for complex tasks.

2024-03-04

Researched 65d ago

200k

200,000 tokens

200k contextReasoningVisionMultimodalTool useFunctions
AWS Bedrock

$3.00 in / $15.00 out / 1M tokens

6 routes · 1 batch

Provider docs
Qwen3.6-Plus

Qwen3.6-Plus is Alibaba Cloud's GA Qwen3.6 flagship for long-context reasoning, coding, tool use, and multimodal workflows. DashScope lists it with a 1M-token context window, structured output support, and standard public token pricing.

2026-04-01

Researched 34d ago

1m

1,000,000 tokens

1m contextVisionMultimodalTool useFunctionsJSON
Alibaba Cloud PAI-EAS

$0.325 in / $1.95 out / 1M tokens

3 routes · 2 cache

Provider docs
Nemotron-Nano-12B-v2-VL

Nemotron-Nano-12B-v2-VL is NVIDIA AI's Nemotron Nano 2 model with multimodal text and image input. It was released 2025-10-28.

2025-10-28

Researched 35d ago

No window data

VisionMultimodalJSON
OpenRouter

$0.200 in / $0.600 out / 1M tokens

3 routes

Provider docs
o3

OpenAI o3 reasoning model with advanced multi-step problem-solving capabilities.

2025-04-16

Researched 15d ago

200k

200,000 tokens

200k contextReasoningVisionMultimodalTool useFunctions
OpenAI API

$2.00 in / $8.00 out / 1M tokens

3 routes · 1 batch · 2 cache

Provider docs
Pixtral Large

Large multimodal Mistral model delivering advanced vision-language understanding with support for high-resolution images and complex visual reasoning.

2024-11-18

Researched 65d ago

128k

128,000 tokens

128k contextVisionMultimodalJSON
Mistral AI Studio

$2.00 in / $6.00 out / 1M tokens

3 routes

Provider docs
ERNIE 4.5

ERNIE 4.5 is Baidu AI's ERNIE model. It offers an 8K-token context window.

2025-03-16

Researched 19d ago

8k

8,000 tokens

VisionMultimodal
Fireworks AI

$1.20 in / $1.20 out / 1M tokens

2 routes

Provider docs
Cosmos 3 Nano

Cosmos 3 Nano is NVIDIA's 16B-parameter omnimodel optimized for efficient inference on workstation-grade hardware (NVIDIA RTX PRO 6000). Architecture: dual-tower Mixture-of-Transformers with an 8B autoregressive Reasoner and an 8B diffusion-based Generator. The Reasoner supports up to 256K tokens of context for vision-language reasoning; the Generator produces video up to 720p at variable frame rates (default 189 frames). Natively handles text, image, video, audio (48kHz stereo), and robot action trajectories across 10+ robot embodiments including Franka Panda, UR, Google robot, and UMI. BF16 precision only. Available as open weights on Hugging Face and via the Cosmos 3 Reasoner NIM (NIM_MODEL_SIZE=nano). Intended for real-time robotics inference and edge-adjacent deployment. Robot action input/output is preserved in this description because the model schema does not have a dedicated action modality field.

2026-05-31

Researched 22d ago

256k

256,000 tokens

256k contextReasoningVisionMultimodalAudio
NVIDIA NIM

Pricing not tracked / 1M tokens

1 route

Provider docs
Cosmos 3 Super

Cosmos 3 Super is NVIDIA's flagship 64B-parameter omnimodel for physical AI, designed for large-scale synthetic data generation and high-fidelity simulation on NVIDIA Hopper and Blackwell datacenter GPUs. Architecture: dual-tower Mixture-of-Transformers with a 32B autoregressive Reasoner and a 32B diffusion-based Generator. Supports 256K token reasoning context, 720p video generation at variable frame rates, and 10+ robot embodiment action domains. Ranked #1 among open models on Physics-IQ, PAI-Bench, R-Bench, RoboLab, RoboArena, VANTAGE-Bench, TAR, and Artificial Analysis image/video leaderboards (Computex 2026). Training data: 1.3B data points across 393 datasets (2024-2026). Inference performance (vLLM-Omni): ~55s for 50-step video on 8xH200. Available as open weights on Hugging Face and via Cosmos 3 Reasoner NIM (NIM_MODEL_SIZE=super). Robot action input/output is preserved in this description because the model schema does not have a dedicated action modality field.

2026-05-31

Researched 22d ago

256k

256,000 tokens

256k contextReasoningVisionMultimodalAudio
NVIDIA NIM

Pricing not tracked / 1M tokens

1 route

Provider docs
Perceptron Mk1

Perceptron Mk1 is a closed-source vision-language model for image and video understanding, OCR, object detection, captioning, video QA, and embodied reasoning. Perceptron documents Mk1 with 32K context, reasoning support, and standard pricing of $0.15 per 1M input tokens and $1.50 per 1M output tokens.

2026-05-12

Researched 35d ago

33k

32,768 tokens

ReasoningVisionMultimodalJSON
OpenRouter

$0.150 in / $1.50 out / 1M tokens

1 route

Provider docs
Qianfan-OCR-Fast

Qianfan-OCR-Fast is Baidu Qianfan's speed-optimized OCR specialist surfaced on OpenRouter. It builds on the Qianfan-OCR document intelligence line for image-to-text, document parsing, layout analysis, chart understanding, and OCR-heavy extraction workflows.

2026-04-20

Researched 36d ago

66k

65,536 tokens

VisionMultimodal
OpenRouter

$0.680 in / $2.81 out / 1M tokens

1 route

Provider docs
SEA-LION V4 27B Instruct

SEA-LION V4 27B Instruct is AI Singapore's Gemma 3 27B-based regional language model for Southeast Asian language tasks. It extends the SEA-LION line with continued pretraining and post-training for Burmese, Indonesian, Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai, Vietnamese, and English workloads, and is available through Cloudflare Workers AI.

2025-08-25

Researched 20d ago

128k

128,000 tokens

128k contextVisionMultimodalTool useFunctionsJSON
Cloudflare Workers AI

Pricing not tracked / 1M tokens

1 route

Provider docs
ERNIE X1

Baidu ERNIE X1 advanced model with enhanced reasoning for enterprise applications.

2025-03-16

Researched 19d ago

No window data

VisionMultimodal
Baidu Qianfan

$0.300 in / $1.18 out / 1M tokens

1 route

Provider docs
MiniMax-01

MiniMax-01 combines MiniMax-Text-01 and MiniMax-VL-01, pairing a 456B-total-parameter MoE language model with multimodal understanding for long-context text generation and vision-language tasks.

2025-01-14

Researched 1d ago

4m

4,000,000 tokens

4m contextVisionMultimodal
OpenRouter

$0.200 in / $1.10 out / 1M tokens

1 route

Provider docs
Llama 3.2 11B Vision

Multimodal 11B parameter model balancing capability and computational efficiency

2024-09-25

Researched 65d ago

128k

128,000 tokens

128k contextVisionJSON
AWS Bedrock

$0.200 in / $0.270 out / 1M tokens

1 route

Provider docs
Llama 3.2 90B Vision

Advanced multimodal model with image reasoning, visual question answering, and document analysis

2024-09-25

Researched 65d ago

128k

128,000 tokens

128k contextVisionJSON
AWS Bedrock

$1.35 in / $1.80 out / 1M tokens

1 route

Provider docs
Pixtral 12B Instruct

Instruction-tuned 12B multimodal model for conversational vision-language tasks and image analysis with efficient inference.

2024-09-12

Researched 173d ago

128k

128,000 tokens

128k contextVisionMultimodal
Vercel AI Gateway

$0.150 in / $0.150 out / 1M tokens

1 route

Provider docs
GLM-4V 9B

GLM-4V 9B is Tsinghua Knowledge Engineering Group (THUDM)'s GLM-4 model with multimodal text and image input. It offers a 128K-token context window and scores 48.3 on MMMU.

2024-06-05

Researched 35d ago

131k

131,072 tokens

131k contextMultimodal
Replicate API

$0.050 in / $0.250 out / 1M tokens

1 route

Provider docs
PaliGemma 3B 896

PaliGemma 3B 896 is a versatile and lightweight vision-language model developed by Google, designed to process and integrate both images and text. Inspired by the PaLI-3 model, it employs components like the SigLIP vision model and the Gemma-2B language model, featuring a linear projection layer for seamless integration of visual and textual inputs. Capable of handling tasks such as image captioning, visual question answering, object detection, and segmentation, it supports multilingual text processing. Despite requiring task-specific fine-tuning for optimal performance, PaliGemma highlights strong capabilities across various vision-language applications, although it may encounter challenges with contextual understanding, biases, and computational demands 124.

2024-05-14

Researched 173d ago

512

512 tokens

VisionMultimodal
NVIDIA NIM

Pricing not tracked / 1M tokens

1 route

Provider docs
Qwen-Max

Closed-source flagship Qwen model with advanced reasoning capabilities for agent tasks.

2024-05-11

Researched 65d ago

128k

128,000 tokens

128k contextVisionJSON
OpenRouter

$1.04 in / $4.16 out / 1M tokens

1 route

Provider docs
Reka Core

Reka's frontier-class multimodal model for complex tasks. Approaches OpenAI/Google/Anthropic frontier models

2024-04-15

Researched 173d ago

128k

128,000 tokens

128k contextMultimodalFunctions
Reka Platform

$2.00 in / $6.00 out / 1M tokens

1 route

Provider docs
DeepSeek VL 7B

DeepSeek-VL 7B is an open-source vision-language model engineered for robust real-world applications. With its general multimodal understanding capabilities, it can process logical diagrams, web pages, formulas, scientific literature, natural images, and complex scenarios involving embodied intelligence. The model features a hybrid vision encoder that integrates SigLIP-L and SAM-B, allowing it to handle high-resolution (1024 x 1024) image inputs. Built on the DeepSeek-LLM-7b-base foundation, it is pre-trained on roughly 2 trillion text tokens and further trained on approximately 400 billion vision-language tokens. A standout variant, DeepSeek-VL-7b-chat, is specially optimized for conversational tasks, enhancing both performance and user experience by addressing the limitations of existing open-source multimodal models.

2024-03-15

Researched 173d ago

No window data

VisionMultimodal
Replicate API

$0.050 in / $0.250 out / 1M tokens

1 route

Provider docs
Gemini 1.5 Flash on Google Vertex AI

Gemini 1.5 Flash on Google Vertex AI is Google DeepMind's Gemini 1.5 model with multimodal text and image input. It offers a 1M-token context window.

2024-02-15

Researched 35d ago

1m

1,000,000 tokens

1m contextVisionMultimodalJSON
GCP Vertex AI

$0.035 in / $0.105 out / 1M tokens

1 route

Provider docs
Gemini 1.5 Pro on Google Vertex AI

Gemini 1.5 Pro on Google Vertex AI is Google DeepMind's Gemini 1.5 model with multimodal text and image input. It offers a 1M-token context window.

2024-02-15

Researched 35d ago

1m

1,000,000 tokens

1m contextVisionMultimodalJSON
GCP Vertex AI

$0.125 in / $0.375 out / 1M tokens

1 route

Provider docs
Gemini 1.0 Pro on Google Vertex AI

Gemini 1.0 Pro on Google Vertex AI is Google DeepMind's Gemini 1.0 model with multimodal text and image input. It offers a 32K-token context window.

2023-12-06

Researched 35d ago

33k

32,768 tokens

VisionMultimodalJSON
GCP Vertex AI

$0.125 in / $0.375 out / 1M tokens

1 route

Provider docs
LLaVA 13B

Original LLaVA (Large Language-and-Vision Assistant) 13B model. Multimodal vision+language model combining a vision encoder with a language model for visual understanding tasks.

2023-04-17

Researched 173d ago

4k

4,000 tokens

VisionMultimodal
Replicate API

Pricing not tracked / 1M tokens

1 route

Provider docs
Grok Build 0.1

xAI's dedicated agentic coding model. 314B MoE architecture, 100+ tokens/second. Built for web development, debugging, and MCP-native multi-step agentic tasks. Accepts text and image inputs. Available in us-east-1 and eu-west-1. API aliases: grok-code-fast-1, grok-code-fast. Public API opened June 1, 2026. Pricing: $1/$2 per 1M tokens in/out.

2026-06-01

Researched 14d ago

256k

256,000 tokens

256k contextReasoningVisionMultimodalTool useFunctions

No tracked provider route

Cosmos 3 Nano Policy DROID

Cosmos 3 Nano Policy DROID is a 16B-parameter robotics policy model fine-tuned from Cosmos 3 Nano on the DROID dataset. Given natural language instructions and visual observations from a robot camera (image or video), it generates robot action trajectories (JSON 1D list) for manipulation and control tasks. Compatible with multiple robot embodiments including Franka Panda (single/dual), UR, Google robot, WidowX 250, UMI, and Agibot. Supports 16-400 frame action sequences in various DoF configurations (9D-57D). Intended as a reference implementation for post-training Cosmos 3 Nano on specific robot platforms. The action output modality is represented in prose because the current model schema only has text, vision, video, audio, and related capability flags.

2026-05-31

Researched 22d ago

4k

4,000 tokens

VisionMultimodalJSONFine-tune

No tracked provider route

Cosmos 3 Super Image2Video

Cosmos 3 Super Image2Video is a 64B-parameter fine-tuned variant of Cosmos 3 Super specialized for temporally coherent image-to-video generation. Takes a single image (jpg/png/webp at 256p-720p) plus an optional text prompt (up to 4096 tokens) and outputs MP4 video with 5-400 frames (default 189) at up to 720p, with optional muxed AAC stereo audio at 48kHz. Ranked #1 on Artificial Analysis image-to-video leaderboard (open models). Available via Hugging Face Diffusers and vLLM-Omni.

2026-05-31

Researched 22d ago

4k

4,000 tokens

VisionMultimodalAudioFine-tune

No tracked provider route

Cosmos 3 Super Text2Image

Cosmos 3 Super Text2Image is a 64B-parameter fine-tuned variant of Cosmos 3 Super specialized for high-fidelity text-to-image generation. Takes text prompts up to 4096 tokens and outputs JPEG images at 256p, 480p, or 720p in aspect ratios 16:9, 4:3, 1:1, 3:4, or 9:16. Ranked #1 on Artificial Analysis text-to-image leaderboard (open models). Available via Hugging Face Diffusers (DiffusionPipeline) and vLLM-Omni.

2026-05-31

Researched 22d ago

4k

4,000 tokens

MultimodalFine-tune

No tracked provider route

Gemini 3.5 Flash

Google's strongest agentic and coding model, outperforming Gemini 3.1 Pro on coding and agentic benchmarks (Terminal-Bench 2.1: 76.2%, GDPval-AA: 1656 Elo, CharXiv Reasoning: 84.2%). Multimodal: text, vision, video, audio input. 1M-token context, 65K max output. Pricing: $1.50/$9.00 per 1M tokens in/out.

2026-05-19

Researched 14d ago

1m

1,000,000 tokens

1m contextReasoningVisionMultimodalTool useFunctions

No tracked provider route

SenseNova-U1-8B

SenseNova-U1-8B is SenseTime's open-source 8B multimodal model released April 28, 2026. Dense MoT (Mixture-of-Tokens) backbone. Uses the NEO-Unify architecture that eliminates both the visual encoder and VAE, enabling native unified image understanding and generation in a single representation space — the first commercially viable model with this capability. Achieves commercial-grade image generation quality comparable to Qwen-Image 2.0 Pro. Apache 2.0 license. HuggingFace: OpenSenseNova/SenseNova-U1-8B-MoT. Source: https://github.com/OpenSenseNova/SenseNova-U1

2026-04-28

Researched 35d ago

No window data

VisionMultimodal

No tracked provider route

Holo3-35B-A3B

Holo3-35B-A3B is H Company's 35B (3B active) open-weights sparse MoE computer-use VLM released March 31, 2026 under Apache 2.0. Fine-tuned from Qwen3.5-35B-A3B, it achieves 77.8% on OSWorld-Verified and is available for self-hosting through H Company's official Hugging Face repository. The official config exposes a 262,144-token context window.

2026-03-31

Researched 16d ago

262k

262,144 tokens

262k contextVisionMultimodalTool useFunctionsJSON

No tracked provider route

Nemotron 3 VoiceChat

Nemotron 3 VoiceChat is NVIDIA AI's Nemotron 3 model with multimodal text and image input. It was released 2026-03-16.

2026-03-16

Researched 35d ago

No window data

VisionMultimodalAudio

No tracked provider route

GPT-5 Image Mini

GPT-5 Image Mini is a cost-effective image generation model that combines GPT-5 Mini's language capabilities with image generation at 400K context.

2025-10-01

Researched 57d ago

400k

400,000 tokens

400k contextVisionMultimodal

No tracked provider route

Gemini 2.0 Flash-Lite (Preview 02-05)

Gemini 2.0 Flash Lite Preview (02-05). Retiring June 1, 2026. Migrate to Gemini 2.5 or Gemini 3 series.

2025-02-05

Researched 19d ago

1m

1,000,000 tokens

1m contextVisionMultimodal

No tracked provider route

Gemini 2.0 Pro (Experimental 02-05)

Gemini 2.0 Pro (Experimental 02-05) is Google DeepMind's Gemini 2.0 model. Its knowledge cutoff is 2024-08.

2025-02-05

Researched 19d ago

1m

1,000,000 tokens

1m contextVisionMultimodalJSON

No tracked provider route

Doubao 1.5 Pro Vision 32K

Doubao 1.5 Pro Vision 32K is ByteDance's multimodal language model released January 22, 2025 alongside Doubao 1.5 Pro. Incorporates improvements in multi-modal data synthesis, dynamic resolution, and multi-modal alignment for image understanding tasks.

2025-01-22

Researched 19d ago

32k

32,000 tokens

VisionMultimodal

No tracked provider route

GPT-4o Audio Preview (12-17)

GPT-4o Audio Preview (12-17) is OpenAI's GPT-4o Audio model. It offers a 128K-token context window.

2024-12-17

Researched 35d ago

128k

128,000 tokens

128k contextVisionAudioCode exec

No tracked provider route

Doubao Vision Lite 32K

Doubao Vision Lite 32K is ByteDance's Doubao Vision model. It offers a 32K-token context window.

2024-12-01

Researched 35d ago

32k

32,000 tokens

Vision

No tracked provider route

Doubao Vision Pro 32K

Doubao Vision Pro 32K is ByteDance's Doubao Vision model. It offers a 32K-token context window.

2024-12-01

Researched 35d ago

32k

32,000 tokens

Vision

No tracked provider route

GPT-4o (11-20)

GPT-4o (11-20) is OpenAI's GPT-4o model. It offers a 128K-token context window.

2024-11-20

Researched 35d ago

128k

128,000 tokens

128k contextVisionCode exec

No tracked provider route

GPT-4o Audio Preview (10-01)

GPT-4o model with integrated audio I/O capabilities for multimodal interactions.

2024-10-01

Researched 173d ago

128k

128,000 tokens

128k contextVisionAudioCode exec

No tracked provider route

Pixtral 12B Base

12B multimodal Mistral model combining text and vision capabilities for image understanding and visual reasoning tasks.

2024-09-12

Researched 173d ago

128k

128,000 tokens

128k contextVisionMultimodal

No tracked provider route

Phi 3.5 Vision Instruct

Phi 3.5 Vision Instruct is Microsoft Research's Phi-3 model with multimodal text and image input. It offers a 128K-token context window with weights openly available for self-hosting and scores 43 on MMMU.

2024-08-20

Researched 35d ago

128k

128,000 tokens

128k contextVisionMultimodal

No tracked provider route

ChatGPT-4o

The chatgpt-4o-latest model version continuously points to the version of GPT-4o used in ChatGPT, and is updated frequently, when there are significant changes.

2024-05-13

Researched 173d ago

128k

128,000 tokens

128k contextVisionCode exec

No tracked provider route

Gemini 1.0 Pro Vision

Gemini 1.0 Pro Vision is a multimodal large language model crafted by Google, excelling in tasks involving both visual and textual data. It boasts advanced capabilities in visual understanding, classification, and summarization, enabling the creation of content from images and videos. The model adeptly processes a range of visual and textual inputs, such as photographs, documents, and infographics, and is capable of generating image descriptions and object identification. Moreover, it supports zero-shot, one-shot, and few-shot learning, enhancing its adaptability to diverse applications. Despite its powerful features, Gemini 1.0 Pro Vision is slated for deprecation, with a removal date set for April 9, 2025, prompting users to transition to updated models like Gemini 1.5 Pro and Gemini 1.5 Flash 15.

2024-04-29

Researched 65d ago

12k

12,000 tokens

VisionJSON

No tracked provider route

DeepSeek VL 1.3B

DeepSeek VL 1.3B is an advanced vision-language (VL) model that integrates multimodal understanding capabilities, enabling it to process and interpret both images and text effectively. Featuring a SigLIP-L vision encoder for 384 x 384 pixel image inputs, it is built upon a foundation of extensive training on text and vision-language tokens. This open-source model supports tasks such as image captioning, visual question answering, and multimodal document understanding, while also excelling in scenarios requiring embodied intelligence. Despite its powerful features, it is compact with 1.3 billion parameters, making it resource-efficient for real-world applications and available on platforms like Hugging Face.

2024-03-15

Researched 173d ago

No window data

VisionMultimodal

No tracked provider route

DeepSeek VL 1.3B Chat

DeepSeek VL 1.3B Chat is DeepSeek's DeepSeek VL model. Weights are openly available for self-hosting.

2024-03-15

Researched 35d ago

No window data

VisionMultimodal

No tracked provider route

DeepSeek VL 7B Chat

DeepSeek VL 7B Chat is DeepSeek's DeepSeek VL model. Weights are openly available for self-hosting.

2024-03-15

Researched 35d ago

No window data

VisionMultimodal

No tracked provider route