Vision-language variant of NVIDIA Nemotron Nano 8B with multimodal capabilities.
2025-03-01
Researched 134d ago
4K
4,000 tokens
Also known as: image input, visual input, vision-language
image input
284
matching active models
29
tracked providers
165
models with routes
Vision capability means a model can accept image or visual inputs alongside text, enabling document understanding, screenshot analysis, visual question answering, and image-grounded extraction. In LLMReference this concept is used as a model-selection filter when a visual input flag is present in the model seed.
Showing the first 80 decision-sorted matches, with model flags and provider-route evidence from seed data.
Vision-language variant of NVIDIA Nemotron Nano 8B with multimodal capabilities.
2025-03-01
Researched 134d ago
4K
4,000 tokens
Open-weight small dense Qwen3.5 model. Apache 2.0.
2026-03-02
Researched 1d ago
262K
262,144 tokens
2024-09-25
Researched 134d ago
—
No window data
No tracked provider route
2023-11-06
Researched 134d ago
128K
128,000 tokens
No tracked provider route
Xiaomi MiMo-V2.5 is the lower-cost native omnimodal sibling in the MiMo-V2.5 series. OpenRouter describes it as supporting text, image, audio, and video inputs with text output, Pro-level agentic performance at roughly half the inference cost, and improved multimodal perception over MiMo-V2-Omni. Xiaomi's official April 22 release page highlights MiMo-V2.5 alongside MiMo-V2.5-Pro in benchmark data and says the V2.5 series will be open-sourced soon; no public weights/license were verified at research time.
2026-04-22
Researched 22d ago
1M
1,048,576 tokens
Kimi K2.6 is Moonshot AI's latest agentic reasoning model, launched April 13 2026 as a code preview for Kimi Code subscribers. Built on a 1-trillion-parameter MoE architecture (32B active, 384 experts), it inherits K2.5's 256K context window and adds enhanced reliability for long-horizon agentic workflows — supporting 200–300 sequential tool calls without drift. Optimized for coding, multi-step agent planning, and vision-assisted tasks such as processing screenshots, PDFs, and spreadsheets.
2026-04-20
Researched 9d ago
262K
262,144 tokens
SenseNova V6 is SenseTime's latest large model with advanced multimodal reasoning and text generation at low cost. Features multimodal long chain-of-thought reasoning and reinforcement learning enhancements.
2025-04-01
Researched 134d ago
—
No window data
Updated variant of Step model from StepFun. Closed-source, API-only.
2024-09-01
Researched 134d ago
256K
256,000 tokens
Step-1.5V is StepFun's multimodal language model with vision capabilities, building on Step-1 with image understanding.
2024-06-01
Researched 134d ago
128K
128,000 tokens
SenseNova-U1-A3B is SenseTime's open-source multimodal MoE model released April 28, 2026. ~3B activated parameters (MoE backbone). Shares the NEO-Unify architecture with SenseNova-U1-8B: no visual encoder or VAE, native unified text-and-image representation. Apache 2.0 license. HuggingFace: OpenSenseNova/SenseNova-U1-A3B-MoT. Source: https://github.com/OpenSenseNova/SenseNova-U1
2026-04-28
Researched 13d ago
—
No window data
No tracked provider route
Phi-3 Vision is a sophisticated multimodal AI model from Microsoft, designed to adeptly integrate language and vision capabilities. Unlike traditional language models, it processes both text and images and can perform tasks such as optical character recognition, chart analysis, and image interpretation. Its architecture features an image encoder, a text-image connector, a projector for mapping image features, and the Phi-3 Mini language model. Despite its relatively small size of 4.2 billion parameters, it competes with larger models and suits devices with limited computational power. Phi-3 Vision's ability to handle up to 128K tokens supports complex multimodal reasoning. It draws upon high-quality and synthetic data for training while incorporating essential safety measures.
2024-05-21
Researched 134d ago
128K
128,000 tokens
7B dense multimodal model. Outperforms much larger models
2024-02-12
Researched 26d ago
64K
64,000 tokens
Flagship sparse MoE Mistral model (675B total, 41B active) with 256K context and multimodal capabilities. Leads benchmarks in complex reasoning and long-context processing.
2024-07-23
Researched 22d ago
128K
128,000 tokens
Claude 3 Sonnet by Anthropic is a versatile large language AI model, balancing intelligence and speed for diverse enterprise use cases. It is part of the Claude 3 family, positioned between the powerful Opus and the faster Haiku models. Sonnet excels in nuanced content creation, accurate summarization, and complex scientific query handling while also showcasing proficiency in non-English languages and coding tasks. Additionally, it enhances vision capabilities with exceptional skills in visual reasoning, such as interpreting charts, graphs, and transcribing text from imperfect images, which benefits industries like retail, logistics, and finance. Operated at twice the speed of Claude 3 Opus, Sonnet is efficient in context-sensitive customer support and multi-step workflows. It has achieved AI Safety Level 2 (ASL-2) and is accessible through multiple platforms, including Claude.ai, the Claude iOS app, the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI.
2024-03-04
Researched 26d ago
200K
200,000 tokens
21B multimodal model from Reka. State-of-the-art for 21B class, supports text/image/video
2024-02-12
Researched 134d ago
128K
128,000 tokens
Claude 3.7 Sonnet is Anthropic's advanced model with extended thinking capabilities, offering state-of-the-art reasoning for complex tasks.
2024-03-04
Researched 26d ago
200K
200,000 tokens
Instruction-tuned 11B Llama 3.2 Vision model for image reasoning, visual question answering, document understanding, and captioning. NVIDIA NIM lists text plus image input, text output, and a 128K context window for the Llama 3.2 Vision collection.
2024-09-25
Researched 9d ago
128K
128,000 tokens
Instruction-tuned 90B Llama 3.2 Vision model for higher-capability image reasoning, visual question answering, visual grounding, and captioning. NVIDIA NIM lists text plus image input, text output, and a 128K context window for the Llama 3.2 Vision collection.
2024-09-25
Researched 9d ago
128K
128,000 tokens
Qwen3.6-Plus is a state-of-the-art multimodal AI model with massive capability upgrades over Qwen3.5. Features advanced agentic coding capabilities, enhanced visual reasoning across images and videos, improved STEM reasoning, precise information extraction from ultra-long contexts, and strong multilingual support. Sets new benchmarks in coding agents, general agents, and tool usage. Integrated reasoning, memory, and execution capabilities enable complex terminal operations, automated task execution, and long-horizon planning.
2026-04-01
Researched 3d ago
1M
1,000,000 tokens
GPT-5.2 is OpenAI's incremental update in the GPT-5 series offering improvements in agentic coding and long-context performance at 128K context.
2025-12-11
Researched 1d ago
400K
400,000 tokens
12B vision-language model with multimodal understanding
2025-10-28
Researched 26d ago
—
No window data
Large multimodal Mistral model delivering advanced vision-language understanding with support for high-resolution images and complex visual reasoning.
2024-11-18
Researched 26d ago
128K
128,000 tokens
Multimodal 11B parameter model balancing capability and computational efficiency
2024-09-25
Researched 26d ago
128K
128,000 tokens
Advanced multimodal model with image reasoning, visual question answering, and document analysis
2024-09-25
Researched 26d ago
128K
128,000 tokens
Vision-language variant of GLM-4.
2024-06-05
Researched 134d ago
131K
131,072 tokens
PaliGemma 3B 896 is a versatile and lightweight vision-language model developed by Google, designed to process and integrate both images and text. Inspired by the PaLI-3 model, it employs components like the SigLIP vision model and the Gemma-2B language model, featuring a linear projection layer for seamless integration of visual and textual inputs. Capable of handling tasks such as image captioning, visual question answering, object detection, and segmentation, it supports multilingual text processing. Despite requiring task-specific fine-tuning for optimal performance, PaliGemma highlights strong capabilities across various vision-language applications, although it may encounter challenges with contextual understanding, biases, and computational demands 124.
2024-05-14
Researched 134d ago
512
512 tokens
Closed-source flagship Qwen model with advanced reasoning capabilities for agent tasks.
2024-05-11
Researched 26d ago
128K
128,000 tokens
Reka's frontier-class multimodal model for complex tasks. Approaches OpenAI/Google/Anthropic frontier models
2024-04-15
Researched 134d ago
128K
128,000 tokens
DeepSeek-VL 7B is an open-source vision-language model engineered for robust real-world applications. With its general multimodal understanding capabilities, it can process logical diagrams, web pages, formulas, scientific literature, natural images, and complex scenarios involving embodied intelligence. The model features a hybrid vision encoder that integrates SigLIP-L and SAM-B, allowing it to handle high-resolution (1024 x 1024) image inputs. Built on the DeepSeek-LLM-7b-base foundation, it is pre-trained on roughly 2 trillion text tokens and further trained on approximately 400 billion vision-language tokens. A standout variant, DeepSeek-VL-7b-chat, is specially optimized for conversational tasks, enhancing both performance and user experience by addressing the limitations of existing open-source multimodal models.
2024-03-15
Researched 134d ago
—
No window data
Gemini 1.5 Flash via Google Vertex AI (standard context)
2024-02-15
Researched 26d ago
1M
1,000,000 tokens
Gemini 1.5 Flash via Google Vertex AI (128K-1M context)
2024-02-15
Researched 26d ago
1M
1,000,000 tokens
Gemini 1.5 Pro via Google Vertex AI (standard context)
2024-02-15
Researched 26d ago
1M
1,000,000 tokens
Gemini 1.5 Pro via Google Vertex AI (128K-1M context)
2024-02-15
Researched 26d ago
1M
1,000,000 tokens
Gemini 1.0 Pro via Google Vertex AI
2023-12-06
Researched 26d ago
33K
32,768 tokens
Original LLaVA (Large Language-and-Vision Assistant) 13B model. Multimodal vision+language model combining a vision encoder with a language model for visual understanding tasks.
2023-04-17
Researched 134d ago
4K
4,000 tokens
SenseNova-U1-8B is SenseTime's open-source 8B multimodal model released April 28, 2026. Dense MoT (Mixture-of-Tokens) backbone. Uses the NEO-Unify architecture that eliminates both the visual encoder and VAE, enabling native unified image understanding and generation in a single representation space — the first commercially viable model with this capability. Achieves commercial-grade image generation quality comparable to Qwen-Image 2.0 Pro. Apache 2.0 license. HuggingFace: OpenSenseNova/SenseNova-U1-8B-MoT. Source: https://github.com/OpenSenseNova/SenseNova-U1
2026-04-28
Researched 13d ago
—
No window data
No tracked provider route
12B speech-to-speech model for low-latency full-duplex voice conversations
2026-03-16
Researched 134d ago
—
No window data
No tracked provider route
GPT-5 Image Mini is a cost-effective image generation model that combines GPT-5 Mini's language capabilities with image generation at 400K context.
2025-10-01
Researched 18d ago
400K
400,000 tokens
No tracked provider route
Doubao 1.5 Pro Vision 32K is ByteDance's multimodal language model released January 22, 2025 alongside Doubao 1.5 Pro. Incorporates improvements in multi-modal data synthesis, dynamic resolution, and multi-modal alignment for image understanding tasks.
2025-01-22
Researched 18d ago
32K
32,000 tokens
No tracked provider route
Updated GPT-4o audio model with improved multimodal audio-text understanding.
2024-12-17
Researched 134d ago
128K
128,000 tokens
No tracked provider route
2024-12-01
Researched 134d ago
32K
32,000 tokens
No tracked provider route
2024-12-01
Researched 134d ago
32K
32,000 tokens
No tracked provider route
2024-11-20
Researched 134d ago
128K
128,000 tokens
No tracked provider route
GPT-4o model with integrated audio I/O capabilities for multimodal interactions.
2024-10-01
Researched 134d ago
128K
128,000 tokens
No tracked provider route
12B multimodal Mistral model combining text and vision capabilities for image understanding and visual reasoning tasks.
2024-09-12
Researched 134d ago
128K
128,000 tokens
No tracked provider route
Instruction-tuned 12B multimodal model for conversational vision-language tasks and image analysis with efficient inference.
2024-09-12
Researched 134d ago
128K
128,000 tokens
No tracked provider route
2024-08-20
Researched 134d ago
128K
128,000 tokens
No tracked provider route
The chatgpt-4o-latest model version continuously points to the version of GPT-4o used in ChatGPT, and is updated frequently, when there are significant changes.
2024-05-13
Researched 134d ago
128K
128,000 tokens
No tracked provider route
Gemini 1.0 Pro Vision is a multimodal large language model crafted by Google, excelling in tasks involving both visual and textual data. It boasts advanced capabilities in visual understanding, classification, and summarization, enabling the creation of content from images and videos. The model adeptly processes a range of visual and textual inputs, such as photographs, documents, and infographics, and is capable of generating image descriptions and object identification. Moreover, it supports zero-shot, one-shot, and few-shot learning, enhancing its adaptability to diverse applications. Despite its powerful features, Gemini 1.0 Pro Vision is slated for deprecation, with a removal date set for April 9, 2025, prompting users to transition to updated models like Gemini 1.5 Pro and Gemini 1.5 Flash 15.
2024-04-29
Researched 26d ago
12K
12,000 tokens
No tracked provider route
DeepSeek VL 1.3B is an advanced vision-language (VL) model that integrates multimodal understanding capabilities, enabling it to process and interpret both images and text effectively. Featuring a SigLIP-L vision encoder for 384 x 384 pixel image inputs, it is built upon a foundation of extensive training on text and vision-language tokens. This open-source model supports tasks such as image captioning, visual question answering, and multimodal document understanding, while also excelling in scenarios requiring embodied intelligence. Despite its powerful features, it is compact with 1.3 billion parameters, making it resource-efficient for real-world applications and available on platforms like Hugging Face.
2024-03-15
Researched 134d ago
—
No window data
No tracked provider route
2024-03-15
Researched 134d ago
—
No window data
No tracked provider route
2024-03-15
Researched 134d ago
—
No window data
No tracked provider route
Palmyra Vision is Writer's cutting-edge multimodal LLM that excels at interpreting and generating text from images, making it an ideal solution for various enterprise applications. Its robust capabilities include extracting text from handwritten notes, classifying objects in images, and analyzing visual data like charts and graphs. Surpassing models such as GPT-4V and Gemini 1.0 Ultra with an 84.4% score on the VQAv2 benchmark, Palmyra Vision is designed for seamless integration within Writer's AI platform, enabling custom application creation with minimal engineering. It supports areas like compliance, e-commerce, finance, and healthcare while offering scalable pricing at $0.015 per image or video second, and $22.50 per million text words.
2024-02-27
Researched 134d ago
—
No window data
No tracked provider route
OpenAI's previous intelligent reasoning model with configurable reasoning effort. Released August 2025. Supports minimal, low, medium, and high reasoning levels. Succeeded by GPT-5.1 and later models.
2025-08-07
Researched 5d ago
400K
400,000 tokens
GPT-5.1 Chat is the fast, lightweight conversational member of the GPT-5.1 family, optimized for low-latency chat at 128K context.
2025-12-01
Researched 18d ago
128K
128,000 tokens
No tracked provider route
GPT-5 Chat is OpenAI's conversational variant of GPT-5 designed for advanced multimodal, context-aware enterprise conversations at 128K context.
2025-10-01
Researched 18d ago
128K
128,000 tokens
No tracked provider route
Near-frontier intelligence for cost-sensitive, low-latency, high-volume workloads. Released August 2025. Replaces o4-mini (shutting down Oct 2026).
2025-08-07
Researched 5d ago
400K
400,000 tokens
GPT-5 Pro is OpenAI's most advanced GPT-5 tier, offering major improvements in reasoning, code quality, and user experience for enterprise and power-user applications at 400K context.
2025-10-01
Researched 18d ago
400K
400,000 tokens
No tracked provider route
Fastest, cheapest GPT-5 variant for summarization and classification tasks. Also available via Realtime API.
2025-08-07
Researched 5d ago
400K
400,000 tokens
Premium extended-reasoning GPT-5.4 variant producing smarter and more precise responses. Replacement for o3-deep-research and o4-mini-deep-research. No prompt caching discount.
2026-03-01
Researched 5d ago
1.1M
1,050,000 tokens
Speed-optimized Gemini 3 model from Google DeepMind with frontier intelligence. Combines high performance with lower cost and latency. 1M token context window.
2025-12-17
Researched 134d ago
1M
1,000,000 tokens
Google DeepMind's most advanced reasoning Gemini model. Part of the Gemini 3 series with frontier-class intelligence, multimodal understanding, and 1M token context window.
2025-12-11
Researched 134d ago
1M
1,000,000 tokens
GPT-5.2 Pro is OpenAI's most advanced GPT-5.2 tier offering major improvements in agentic coding and long-context performance for enterprise use at 400K context.
2026-01-01
Researched 18d ago
400K
400,000 tokens
No tracked provider route
GPT-5.1-Codex is a coding-specialized version of GPT-5.1, optimized for software engineering and agentic coding workflows at 400K context.
2025-12-01
Researched 18d ago
400K
400,000 tokens
No tracked provider route
GPT-5 Codex is OpenAI's coding-specialized variant of GPT-5, optimized for software engineering workflows, code generation, and agentic coding tasks at 400K context.
2025-10-01
Researched 18d ago
400K
400,000 tokens
No tracked provider route
Frontier-class performance rivaling larger models at a fraction of the cost. Most intelligent Gemini model built for speed, combining frontier intelligence with superior search and grounding. $0.50 input / $3.00 output per 1M tokens.
2025-12-17
Researched 26d ago
1M
1,000,000 tokens
Advanced o3 reasoning model for complex math, science, and coding problems. Supports tools, vision, and extended thinking. Available to Pro users. Released June 10, 2025.
2025-06-10
Researched 26d ago
—
No window data
GPT-5.4 Image 2 combines OpenAI's GPT-5.4 model with state-of-the-art image generation, enabling both advanced text understanding and high-quality image creation at 256K context.
2026-03-01
Researched 18d ago
256K
256,000 tokens
Seed 1.6 Flash is ByteDance Seed's ultra-fast multimodal thinking model supporting text and visual understanding at 256K context, optimized for low-latency inference.
2026-03-01
Researched 18d ago
256K
256,000 tokens
No tracked provider route
GLM-4.5V is a vision-language MoE model from Z.ai designed for multimodal agent applications, handling both image understanding and text generation at 64K context.
2026-01-01
Researched 18d ago
64K
64,000 tokens
No tracked provider route
GPT-5 Image combines OpenAI's GPT-5 language model with state-of-the-art image generation, enabling both text understanding and image creation within a single 400K context model.
2025-10-01
Researched 18d ago
400K
400,000 tokens
No tracked provider route
Amazon Nova 2 Lite is a fast, cost-effective reasoning model for everyday workloads that processes text, images, and videos at 1M token context with improved reasoning over Nova Lite v1.
2026-03-01
Researched 18d ago
1M
1,000,000 tokens
No tracked provider route
Seed 1.6 is a general-purpose multimodal model from ByteDance Seed supporting text, image, and video inputs. It incorporates multimodal capabilities and deep thinking for complex tasks at 256K context.
2026-03-01
Researched 18d ago
256K
256,000 tokens
No tracked provider route
o3-deep-research is OpenAI's advanced model for deep research, designed to tackle complex multi-step research tasks by synthesizing information from multiple sources at 200K context.
2025-10-10
Researched 1d ago
200K
200,000 tokens
No tracked provider route
Qwen3-VL-8B-Instruct is a compact 8B multimodal vision-language model from Alibaba, delivering high-fidelity image understanding and grounding at 128K context.
2025-09-18
Researched 1d ago
128K
128,000 tokens
No tracked provider route
OpenAI's GPT-4.1 model released April 2025, excelling at coding tasks, precise instruction following, and web development. Outperforms GPT-4o in these areas with a 1 million token context window. Available via API and in ChatGPT for Plus, Pro, Team, Enterprise, and Edu users.
2025-04-01
Researched 5d ago
1M
1,047,576 tokens
GLM-4.6V is Z.ai's large multimodal model for high-fidelity visual understanding and long-context reasoning across images, charts, and documents at 128K context.
2026-02-01
Researched 18d ago
128K
128,000 tokens
No tracked provider route
Qwen3-VL-30B-A3B-Instruct is a multimodal MoE model from Alibaba unifying text generation with visual understanding for images, charts, and documents at 128K context.
2025-09-18
Researched 1d ago
128K
128,000 tokens
No tracked provider route
Fast and efficient small model from OpenAI replacing GPT-4o mini. Released April 2025 alongside GPT-4.1. Shows improvements in instruction-following, coding, and intelligence with a 1 million token context window. Available in ChatGPT for paid users.
2025-04-01
Researched 5d ago
1M
1,047,576 tokens
Open-weight dense Qwen3.6 27B model with native multimodal support across text, image, and video. Apache 2.0.
2026-04-27
Researched 1d ago
262K
262,144 tokens