LLM ReferenceLLM Reference
Concepts & capability filters
Capability filtercapabilitybeginner

Vision

Also known as: image input, visual input, vision-language

image input

284

matching active models

29

tracked providers

165

models with routes

model.visionmodel.multimodal

Definition

Vision capability means a model can accept image or visual inputs alongside text, enabling document understanding, screenshot analysis, visual question answering, and image-grounded extraction. In LLMReference this concept is used as a model-selection filter when a visual input flag is present in the model seed.

Models With Vision

Showing the first 80 decision-sorted matches, with model flags and provider-route evidence from seed data.

284 matches
Qwen3.5-9B

Open-weight small dense Qwen3.5 model. Apache 2.0.

2026-03-02

Researched 1d ago

262K

262,144 tokens

262K contextVisionMultimodalTool useFunctionsJSON
Alibaba Cloud PAI-EAS

$0.100 in / $0.150 out / 1M tokens

3 routes

Provider docs

2023-11-06

Researched 134d ago

128K

128,000 tokens

128K contextVisionMultimodalCode exec

No tracked provider route

Xiaomi MiMo-V2.5

Xiaomi MiMo-V2.5 is the lower-cost native omnimodal sibling in the MiMo-V2.5 series. OpenRouter describes it as supporting text, image, audio, and video inputs with text output, Pro-level agentic performance at roughly half the inference cost, and improved multimodal perception over MiMo-V2-Omni. Xiaomi's official April 22 release page highlights MiMo-V2.5 alongside MiMo-V2.5-Pro in benchmark data and says the V2.5 series will be open-sourced soon; no public weights/license were verified at research time.

2026-04-22

Researched 22d ago

1M

1,048,576 tokens

1M contextReasoningVisionMultimodalTool useFunctions
OpenRouter

$0.400 in / $2.00 out / 1M tokens

1 route

Provider docs
Kimi K2.6

Kimi K2.6 is Moonshot AI's latest agentic reasoning model, launched April 13 2026 as a code preview for Kimi Code subscribers. Built on a 1-trillion-parameter MoE architecture (32B active, 384 experts), it inherits K2.5's 256K context window and adds enhanced reliability for long-horizon agentic workflows — supporting 200–300 sequential tool calls without drift. Optimized for coding, multi-step agent planning, and vision-assisted tasks such as processing screenshots, PDFs, and spreadsheets.

2026-04-20

Researched 9d ago

262K

262,144 tokens

262K contextReasoningVisionMultimodalTool useFunctions
OpenRouter

$0.750 in / $3.50 out / 1M tokens

4 routes · 1 cache

Provider docs
SenseNova V6

SenseNova V6 is SenseTime's latest large model with advanced multimodal reasoning and text generation at low cost. Features multimodal long chain-of-thought reasoning and reinforcement learning enhancements.

2025-04-01

Researched 134d ago

No window data

VisionMultimodal
SenseTime API

Pricing not tracked / 1M tokens

1 route

Provider docs
Step-2

Updated variant of Step model from StepFun. Closed-source, API-only.

2024-09-01

Researched 134d ago

256K

256,000 tokens

256K contextVisionMultimodalFunctions
StepFun

Pricing not tracked / 1M tokens

1 route

Provider docs
Step-1.5V

Step-1.5V is StepFun's multimodal language model with vision capabilities, building on Step-1 with image understanding.

2024-06-01

Researched 134d ago

128K

128,000 tokens

128K contextVisionMultimodal
StepFun

Pricing not tracked / 1M tokens

1 route

Provider docs
SenseNova-U1-A3B

SenseNova-U1-A3B is SenseTime's open-source multimodal MoE model released April 28, 2026. ~3B activated parameters (MoE backbone). Shares the NEO-Unify architecture with SenseNova-U1-8B: no visual encoder or VAE, native unified text-and-image representation. Apache 2.0 license. HuggingFace: OpenSenseNova/SenseNova-U1-A3B-MoT. Source: https://github.com/OpenSenseNova/SenseNova-U1

2026-04-28

Researched 13d ago

No window data

VisionMultimodal

No tracked provider route

Phi-3 Vision

Phi-3 Vision is a sophisticated multimodal AI model from Microsoft, designed to adeptly integrate language and vision capabilities. Unlike traditional language models, it processes both text and images and can perform tasks such as optical character recognition, chart analysis, and image interpretation. Its architecture features an image encoder, a text-image connector, a projector for mapping image features, and the Phi-3 Mini language model. Despite its relatively small size of 4.2 billion parameters, it competes with larger models and suits devices with limited computational power. Phi-3 Vision's ability to handle up to 128K tokens supports complex multimodal reasoning. It draws upon high-quality and synthetic data for training while incorporating essential safety measures.

2024-05-21

Researched 134d ago

128K

128,000 tokens

128K contextVision
Fireworks AI

$0.200 in / $0.200 out / 1M tokens

3 routes

Provider docs
Reka Edge

7B dense multimodal model. Outperforms much larger models

2024-02-12

Researched 26d ago

64K

64,000 tokens

MultimodalJSON
OpenRouter

$0.100 in / $0.100 out / 1M tokens

2 routes

Provider docs
Mistral Large 2 (2407)

Flagship sparse MoE Mistral model (675B total, 41B active) with 256K context and multimodal capabilities. Leads benchmarks in complex reasoning and long-context processing.

2024-07-23

Researched 22d ago

128K

128,000 tokens

128K contextVisionJSON
Chutes AI

$0.500 in / $1.50 out / 1M tokens

3 routes

Claude 3 Sonnet

Claude 3 Sonnet by Anthropic is a versatile large language AI model, balancing intelligence and speed for diverse enterprise use cases. It is part of the Claude 3 family, positioned between the powerful Opus and the faster Haiku models. Sonnet excels in nuanced content creation, accurate summarization, and complex scientific query handling while also showcasing proficiency in non-English languages and coding tasks. Additionally, it enhances vision capabilities with exceptional skills in visual reasoning, such as interpreting charts, graphs, and transcribing text from imperfect images, which benefits industries like retail, logistics, and finance. Operated at twice the speed of Claude 3 Opus, Sonnet is efficient in context-sensitive customer support and multi-step workflows. It has achieved AI Safety Level 2 (ASL-2) and is accessible through multiple platforms, including Claude.ai, the Claude iOS app, the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI.

2024-03-04

Researched 26d ago

200K

200,000 tokens

200K contextReasoningVisionMultimodalJSONCode exec
AWS Bedrock

$3.00 in / $15.00 out / 1M tokens

2 routes · 1 cache

Provider docs
Reka Flash

21B multimodal model from Reka. State-of-the-art for 21B class, supports text/image/video

2024-02-12

Researched 134d ago

128K

128,000 tokens

128K contextMultimodal
Reka Platform

$0.200 in / $0.800 out / 1M tokens

1 route

Provider docs
Claude 3.7 Sonnet

Claude 3.7 Sonnet is Anthropic's advanced model with extended thinking capabilities, offering state-of-the-art reasoning for complex tasks.

2024-03-04

Researched 26d ago

200K

200,000 tokens

200K contextReasoningVisionMultimodalTool useFunctions
AWS Bedrock

$3.00 in / $15.00 out / 1M tokens

6 routes · 1 batch

Provider docs
Llama 3.2 11B Vision Instruct

Instruction-tuned 11B Llama 3.2 Vision model for image reasoning, visual question answering, document understanding, and captioning. NVIDIA NIM lists text plus image input, text output, and a 128K context window for the Llama 3.2 Vision collection.

2024-09-25

Researched 9d ago

128K

128,000 tokens

128K contextVisionMultimodalJSON
Fireworks AI

$0.200 in / $0.200 out / 1M tokens

5 routes

Provider docs
Llama 3.2 90B Vision Instruct

Instruction-tuned 90B Llama 3.2 Vision model for higher-capability image reasoning, visual question answering, visual grounding, and captioning. NVIDIA NIM lists text plus image input, text output, and a 128K context window for the Llama 3.2 Vision collection.

2024-09-25

Researched 9d ago

128K

128,000 tokens

128K contextVisionMultimodal
Bitdeer AI

$0.150 in / $0.450 out / 1M tokens

4 routes

Provider docs
Qwen3.6-Plus

Qwen3.6-Plus is a state-of-the-art multimodal AI model with massive capability upgrades over Qwen3.5. Features advanced agentic coding capabilities, enhanced visual reasoning across images and videos, improved STEM reasoning, precise information extraction from ultra-long contexts, and strong multilingual support. Sets new benchmarks in coding agents, general agents, and tool usage. Integrated reasoning, memory, and execution capabilities enable complex terminal operations, automated task execution, and long-horizon planning.

2026-04-01

Researched 3d ago

1M

1,000,000 tokens

1M contextVisionMultimodalTool useFunctions
Alibaba Cloud PAI-EAS

$0.325 in / $1.95 out / 1M tokens

2 routes

Provider docs
GPT-5.2

GPT-5.2 is OpenAI's incremental update in the GPT-5 series offering improvements in agentic coding and long-context performance at 128K context.

2025-12-11

Researched 1d ago

400K

400,000 tokens

400K contextReasoningVisionMultimodalTool useFunctions
OpenRouter

$1.75 in / $14.00 out / 1M tokens

2 routes

Provider docs
Pixtral Large

Large multimodal Mistral model delivering advanced vision-language understanding with support for high-resolution images and complex visual reasoning.

2024-11-18

Researched 26d ago

128K

128,000 tokens

128K contextVisionMultimodalJSON
Mistral AI Studio

$2.00 in / $6.00 out / 1M tokens

2 routes

Provider docs
Llama 3.2 11B Vision

Multimodal 11B parameter model balancing capability and computational efficiency

2024-09-25

Researched 26d ago

128K

128,000 tokens

128K contextVisionJSON
AWS Bedrock

$0.200 in / $0.270 out / 1M tokens

1 route

Provider docs
Llama 3.2 90B Vision

Advanced multimodal model with image reasoning, visual question answering, and document analysis

2024-09-25

Researched 26d ago

128K

128,000 tokens

128K contextVisionJSON
AWS Bedrock

$1.35 in / $1.80 out / 1M tokens

1 route

Provider docs
GLM-4V 9B

Vision-language variant of GLM-4.

2024-06-05

Researched 134d ago

131K

131,072 tokens

131K contextMultimodal
Replicate API

$0.050 in / $0.250 out / 1M tokens

1 route

Provider docs
PaliGemma 3B 896

PaliGemma 3B 896 is a versatile and lightweight vision-language model developed by Google, designed to process and integrate both images and text. Inspired by the PaLI-3 model, it employs components like the SigLIP vision model and the Gemma-2B language model, featuring a linear projection layer for seamless integration of visual and textual inputs. Capable of handling tasks such as image captioning, visual question answering, object detection, and segmentation, it supports multilingual text processing. Despite requiring task-specific fine-tuning for optimal performance, PaliGemma highlights strong capabilities across various vision-language applications, although it may encounter challenges with contextual understanding, biases, and computational demands 124.

2024-05-14

Researched 134d ago

512

512 tokens

VisionMultimodal
NVIDIA NIM

Pricing not tracked / 1M tokens

1 route

Provider docs
Qwen-Max

Closed-source flagship Qwen model with advanced reasoning capabilities for agent tasks.

2024-05-11

Researched 26d ago

128K

128,000 tokens

128K contextVisionJSON
OpenRouter

$1.04 in / $4.16 out / 1M tokens

1 route

Provider docs
Reka Core

Reka's frontier-class multimodal model for complex tasks. Approaches OpenAI/Google/Anthropic frontier models

2024-04-15

Researched 134d ago

128K

128,000 tokens

128K contextMultimodalFunctions
Reka Platform

$10.00 in / $25.00 out / 1M tokens

1 route

Provider docs
DeepSeek VL 7B

DeepSeek-VL 7B is an open-source vision-language model engineered for robust real-world applications. With its general multimodal understanding capabilities, it can process logical diagrams, web pages, formulas, scientific literature, natural images, and complex scenarios involving embodied intelligence. The model features a hybrid vision encoder that integrates SigLIP-L and SAM-B, allowing it to handle high-resolution (1024 x 1024) image inputs. Built on the DeepSeek-LLM-7b-base foundation, it is pre-trained on roughly 2 trillion text tokens and further trained on approximately 400 billion vision-language tokens. A standout variant, DeepSeek-VL-7b-chat, is specially optimized for conversational tasks, enhancing both performance and user experience by addressing the limitations of existing open-source multimodal models.

2024-03-15

Researched 134d ago

No window data

VisionMultimodal
Replicate API

$0.050 in / $0.250 out / 1M tokens

1 route

Provider docs
LLaVA 13B

Original LLaVA (Large Language-and-Vision Assistant) 13B model. Multimodal vision+language model combining a vision encoder with a language model for visual understanding tasks.

2023-04-17

Researched 134d ago

4K

4,000 tokens

VisionMultimodal
Replicate API

Pricing not tracked / 1M tokens

1 route

Provider docs
SenseNova-U1-8B

SenseNova-U1-8B is SenseTime's open-source 8B multimodal model released April 28, 2026. Dense MoT (Mixture-of-Tokens) backbone. Uses the NEO-Unify architecture that eliminates both the visual encoder and VAE, enabling native unified image understanding and generation in a single representation space — the first commercially viable model with this capability. Achieves commercial-grade image generation quality comparable to Qwen-Image 2.0 Pro. Apache 2.0 license. HuggingFace: OpenSenseNova/SenseNova-U1-8B-MoT. Source: https://github.com/OpenSenseNova/SenseNova-U1

2026-04-28

Researched 13d ago

No window data

VisionMultimodal

No tracked provider route

Nemotron 3 VoiceChat

12B speech-to-speech model for low-latency full-duplex voice conversations

2026-03-16

Researched 134d ago

No window data

VisionMultimodal

No tracked provider route

GPT-5 Image Mini

GPT-5 Image Mini is a cost-effective image generation model that combines GPT-5 Mini's language capabilities with image generation at 400K context.

2025-10-01

Researched 18d ago

400K

400,000 tokens

400K contextVisionMultimodal

No tracked provider route

Doubao 1.5 Pro Vision 32K

Doubao 1.5 Pro Vision 32K is ByteDance's multimodal language model released January 22, 2025 alongside Doubao 1.5 Pro. Incorporates improvements in multi-modal data synthesis, dynamic resolution, and multi-modal alignment for image understanding tasks.

2025-01-22

Researched 18d ago

32K

32,000 tokens

Vision

No tracked provider route

GPT-4o Audio Preview (12-17)

Updated GPT-4o audio model with improved multimodal audio-text understanding.

2024-12-17

Researched 134d ago

128K

128,000 tokens

128K contextVisionCode exec

No tracked provider route

2024-11-20

Researched 134d ago

128K

128,000 tokens

128K contextVisionCode exec

No tracked provider route

GPT-4o Audio Preview (10-01)

GPT-4o model with integrated audio I/O capabilities for multimodal interactions.

2024-10-01

Researched 134d ago

128K

128,000 tokens

128K contextVisionCode exec

No tracked provider route

Pixtral 12B Base

12B multimodal Mistral model combining text and vision capabilities for image understanding and visual reasoning tasks.

2024-09-12

Researched 134d ago

128K

128,000 tokens

128K contextVisionMultimodal

No tracked provider route

Pixtral 12B Instruct

Instruction-tuned 12B multimodal model for conversational vision-language tasks and image analysis with efficient inference.

2024-09-12

Researched 134d ago

128K

128,000 tokens

128K contextVisionMultimodal

No tracked provider route

2024-08-20

Researched 134d ago

128K

128,000 tokens

128K contextVisionMultimodal

No tracked provider route

ChatGPT-4o

The chatgpt-4o-latest model version continuously points to the version of GPT-4o used in ChatGPT, and is updated frequently, when there are significant changes.

2024-05-13

Researched 134d ago

128K

128,000 tokens

128K contextVisionCode exec

No tracked provider route

Gemini 1.0 Pro Vision

Gemini 1.0 Pro Vision is a multimodal large language model crafted by Google, excelling in tasks involving both visual and textual data. It boasts advanced capabilities in visual understanding, classification, and summarization, enabling the creation of content from images and videos. The model adeptly processes a range of visual and textual inputs, such as photographs, documents, and infographics, and is capable of generating image descriptions and object identification. Moreover, it supports zero-shot, one-shot, and few-shot learning, enhancing its adaptability to diverse applications. Despite its powerful features, Gemini 1.0 Pro Vision is slated for deprecation, with a removal date set for April 9, 2025, prompting users to transition to updated models like Gemini 1.5 Pro and Gemini 1.5 Flash 15.

2024-04-29

Researched 26d ago

12K

12,000 tokens

VisionJSON

No tracked provider route

DeepSeek VL 1.3B

DeepSeek VL 1.3B is an advanced vision-language (VL) model that integrates multimodal understanding capabilities, enabling it to process and interpret both images and text effectively. Featuring a SigLIP-L vision encoder for 384 x 384 pixel image inputs, it is built upon a foundation of extensive training on text and vision-language tokens. This open-source model supports tasks such as image captioning, visual question answering, and multimodal document understanding, while also excelling in scenarios requiring embodied intelligence. Despite its powerful features, it is compact with 1.3 billion parameters, making it resource-efficient for real-world applications and available on platforms like Hugging Face.

2024-03-15

Researched 134d ago

No window data

VisionMultimodal

No tracked provider route

2024-03-15

Researched 134d ago

No window data

VisionMultimodal

No tracked provider route

2024-03-15

Researched 134d ago

No window data

VisionMultimodal

No tracked provider route

Palmyra Vision

Palmyra Vision is Writer's cutting-edge multimodal LLM that excels at interpreting and generating text from images, making it an ideal solution for various enterprise applications. Its robust capabilities include extracting text from handwritten notes, classifying objects in images, and analyzing visual data like charts and graphs. Surpassing models such as GPT-4V and Gemini 1.0 Ultra with an 84.4% score on the VQAv2 benchmark, Palmyra Vision is designed for seamless integration within Writer's AI platform, enabling custom application creation with minimal engineering. It supports areas like compliance, e-commerce, finance, and healthcare while offering scalable pricing at $0.015 per image or video second, and $22.50 per million text words.

2024-02-27

Researched 134d ago

No window data

Vision

No tracked provider route

GPT-5

OpenAI's previous intelligent reasoning model with configurable reasoning effort. Released August 2025. Supports minimal, low, medium, and high reasoning levels. Succeeded by GPT-5.1 and later models.

2025-08-07

Researched 5d ago

400K

400,000 tokens

400K contextReasoningVisionMultimodalTool useFunctions
OpenAI API

$1.25 in / $10.00 out / 1M tokens

3 routes · 1 batch · 1 cache

Provider docs
GPT-5.1 Chat

GPT-5.1 Chat is the fast, lightweight conversational member of the GPT-5.1 family, optimized for low-latency chat at 128K context.

2025-12-01

Researched 18d ago

128K

128,000 tokens

128K contextVisionMultimodalTool useFunctionsJSON

No tracked provider route

GPT-5 Chat

GPT-5 Chat is OpenAI's conversational variant of GPT-5 designed for advanced multimodal, context-aware enterprise conversations at 128K context.

2025-10-01

Researched 18d ago

128K

128,000 tokens

128K contextVisionMultimodalTool useFunctionsJSON

No tracked provider route

GPT-5 Mini

Near-frontier intelligence for cost-sensitive, low-latency, high-volume workloads. Released August 2025. Replaces o4-mini (shutting down Oct 2026).

2025-08-07

Researched 5d ago

400K

400,000 tokens

400K contextReasoningVisionMultimodalTool useFunctions
OpenAI API

$0.250 in / $2.00 out / 1M tokens

3 routes · 1 batch · 1 cache

Provider docs
GPT-5 Pro

GPT-5 Pro is OpenAI's most advanced GPT-5 tier, offering major improvements in reasoning, code quality, and user experience for enterprise and power-user applications at 400K context.

2025-10-01

Researched 18d ago

400K

400,000 tokens

400K contextVisionMultimodalTool useFunctionsJSON

No tracked provider route

GPT-5 Nano

Fastest, cheapest GPT-5 variant for summarization and classification tasks. Also available via Realtime API.

2025-08-07

Researched 5d ago

400K

400,000 tokens

400K contextReasoningVisionMultimodalTool useFunctions
OpenAI API

$0.050 in / $0.400 out / 1M tokens

3 routes · 1 batch · 1 cache

Provider docs
GPT-5.4 Pro

Premium extended-reasoning GPT-5.4 variant producing smarter and more precise responses. Replacement for o3-deep-research and o4-mini-deep-research. No prompt caching discount.

2026-03-01

Researched 5d ago

1.1M

1,050,000 tokens

1.1M contextReasoningVisionMultimodalTool useFunctions
OpenAI API

$30.00 in / $180.00 out / 1M tokens

2 routes · 1 batch

Provider docs
Gemini 3 Flash

Speed-optimized Gemini 3 model from Google DeepMind with frontier intelligence. Combines high performance with lower cost and latency. 1M token context window.

2025-12-17

Researched 134d ago

1M

1,000,000 tokens

1M contextVisionMultimodalTool useFunctionsCode exec
GCP Vertex AI

$0.100 in / $0.400 out / 1M tokens

2 routes

Provider docs
Gemini 3 Pro

Google DeepMind's most advanced reasoning Gemini model. Part of the Gemini 3 series with frontier-class intelligence, multimodal understanding, and 1M token context window.

2025-12-11

Researched 134d ago

1M

1,000,000 tokens

1M contextVisionMultimodalTool useFunctionsCode exec
GCP Vertex AI

$1.25 in / $5.00 out / 1M tokens

2 routes

Provider docs
GPT-5.2 Pro

GPT-5.2 Pro is OpenAI's most advanced GPT-5.2 tier offering major improvements in agentic coding and long-context performance for enterprise use at 400K context.

2026-01-01

Researched 18d ago

400K

400,000 tokens

400K contextVisionMultimodalTool useFunctionsJSON

No tracked provider route

GPT-5.1 Codex

GPT-5.1-Codex is a coding-specialized version of GPT-5.1, optimized for software engineering and agentic coding workflows at 400K context.

2025-12-01

Researched 18d ago

400K

400,000 tokens

400K contextVisionMultimodalTool useFunctionsJSON

No tracked provider route

GPT-5 Codex

GPT-5 Codex is OpenAI's coding-specialized variant of GPT-5, optimized for software engineering workflows, code generation, and agentic coding tasks at 400K context.

2025-10-01

Researched 18d ago

400K

400,000 tokens

400K contextVisionMultimodalTool useFunctionsJSON

No tracked provider route

Gemini 3 Flash Preview

Frontier-class performance rivaling larger models at a fraction of the cost. Most intelligent Gemini model built for speed, combining frontier intelligence with superior search and grounding. $0.50 input / $3.00 output per 1M tokens.

2025-12-17

Researched 26d ago

1M

1,000,000 tokens

1M contextVisionMultimodalTool useFunctionsJSON
GCP Vertex AI

$0.500 in / $3.00 out / 1M tokens

3 routes

Provider docs
o3-pro

Advanced o3 reasoning model for complex math, science, and coding problems. Supports tools, vision, and extended thinking. Available to Pro users. Released June 10, 2025.

2025-06-10

Researched 26d ago

No window data

ReasoningVisionMultimodalTool useFunctionsJSON
OpenAI API

$20.00 in / $80.00 out / 1M tokens

2 routes

Provider docs
GPT-5.4 Image 2

GPT-5.4 Image 2 combines OpenAI's GPT-5.4 model with state-of-the-art image generation, enabling both advanced text understanding and high-quality image creation at 256K context.

2026-03-01

Researched 18d ago

256K

256,000 tokens

256K contextVisionMultimodal
OpenRouter

$8.00 in / $15.00 out / 1M tokens

1 route

Provider docs
Seed 1.6 Flash

Seed 1.6 Flash is ByteDance Seed's ultra-fast multimodal thinking model supporting text and visual understanding at 256K context, optimized for low-latency inference.

2026-03-01

Researched 18d ago

256K

256,000 tokens

256K contextReasoningVisionMultimodalTool useFunctions

No tracked provider route

GLM 4.5V

GLM-4.5V is a vision-language MoE model from Z.ai designed for multimodal agent applications, handling both image understanding and text generation at 64K context.

2026-01-01

Researched 18d ago

64K

64,000 tokens

VisionMultimodalTool useFunctions

No tracked provider route

GPT-5 Image

GPT-5 Image combines OpenAI's GPT-5 language model with state-of-the-art image generation, enabling both text understanding and image creation within a single 400K context model.

2025-10-01

Researched 18d ago

400K

400,000 tokens

400K contextVisionMultimodalTool useFunctions

No tracked provider route

Amazon Nova 2 Lite

Amazon Nova 2 Lite is a fast, cost-effective reasoning model for everyday workloads that processes text, images, and videos at 1M token context with improved reasoning over Nova Lite v1.

2026-03-01

Researched 18d ago

1M

1,000,000 tokens

1M contextReasoningVisionMultimodalTool useFunctions

No tracked provider route

Seed 1.6

Seed 1.6 is a general-purpose multimodal model from ByteDance Seed supporting text, image, and video inputs. It incorporates multimodal capabilities and deep thinking for complex tasks at 256K context.

2026-03-01

Researched 18d ago

256K

256,000 tokens

256K contextReasoningVisionMultimodalTool useFunctions

No tracked provider route

o3 Deep Research

o3-deep-research is OpenAI's advanced model for deep research, designed to tackle complex multi-step research tasks by synthesizing information from multiple sources at 200K context.

2025-10-10

Researched 1d ago

200K

200,000 tokens

200K contextReasoningVisionMultimodalTool useFunctions

No tracked provider route

Qwen3 VL 8B Instruct

Qwen3-VL-8B-Instruct is a compact 8B multimodal vision-language model from Alibaba, delivering high-fidelity image understanding and grounding at 128K context.

2025-09-18

Researched 1d ago

128K

128,000 tokens

128K contextVisionMultimodalTool useFunctionsJSON

No tracked provider route

GPT-4.1

OpenAI's GPT-4.1 model released April 2025, excelling at coding tasks, precise instruction following, and web development. Outperforms GPT-4o in these areas with a 1 million token context window. Available via API and in ChatGPT for Plus, Pro, Team, Enterprise, and Edu users.

2025-04-01

Researched 5d ago

1M

1,047,576 tokens

1M contextVisionMultimodalTool useFunctionsJSON
OpenAI API

$2.00 in / $8.00 out / 1M tokens

3 routes · 1 batch · 1 cache

Provider docs
GLM 4.6V

GLM-4.6V is Z.ai's large multimodal model for high-fidelity visual understanding and long-context reasoning across images, charts, and documents at 128K context.

2026-02-01

Researched 18d ago

128K

128,000 tokens

128K contextVisionMultimodalTool useFunctions

No tracked provider route

Qwen3 VL 30B A3B Instruct

Qwen3-VL-30B-A3B-Instruct is a multimodal MoE model from Alibaba unifying text generation with visual understanding for images, charts, and documents at 128K context.

2025-09-18

Researched 1d ago

128K

128,000 tokens

128K contextVisionMultimodalTool useFunctionsJSON

No tracked provider route

GPT-4.1 Mini

Fast and efficient small model from OpenAI replacing GPT-4o mini. Released April 2025 alongside GPT-4.1. Shows improvements in instruction-following, coding, and intelligence with a 1 million token context window. Available in ChatGPT for paid users.

2025-04-01

Researched 5d ago

1M

1,047,576 tokens

1M contextVisionMultimodalTool useFunctionsJSON
OpenAI API

$0.400 in / $1.60 out / 1M tokens

3 routes · 1 cache

Provider docs
Qwen3.6-27B

Open-weight dense Qwen3.6 27B model with native multimodal support across text, image, and video. Apache 2.0.

2026-04-27

Researched 1d ago

262K

262,144 tokens

262K contextReasoningVisionMultimodalTool useFunctions
Alibaba Cloud PAI-EAS

$0.320 in / $3.20 out / 1M tokens

2 routes

Provider docs