Vision-language variant of NVIDIA Nemotron Nano 8B with multimodal capabilities.
2025-03-01
Researched 173d ago
4k
4,000 tokens
Also known as: multi-modal, multimodality
See matching models with benchmark scores and pricing.
443
matching active models
46
tracked providers
299
models with routes
Multimodal refers to LLMs extended to process and generate across multiple data modalities, such as text paired with images, audio, or video, via unified tokenization and cross-attention mechanisms. This enables tasks like visual question answering or captioning, integrating modality-specific encoders into the core transformer.
Showing the first 80 matches, sorted by decision relevance, with tracked capability and provider-route evidence.
Vision-language variant of NVIDIA Nemotron Nano 8B with multimodal capabilities.
2025-03-01
Researched 173d ago
4k
4,000 tokens
Mistral Medium 3 Instruct is MistralAI's Mistral Medium model. It offers a 128K-token context window.
2025-10-01
Researched 19d ago
128k
128,000 tokens
Mistral Large 3 675B Instruct is MistralAI's Mistral Large model. It offers a 128K-token context window and scores 70.2 on τ-bench.
2025-12-01
Researched 4d ago
128k
128,000 tokens
Qwen3.5-9B is Alibaba's Qwen3.5 model with multimodal text and image input. It offers a 256K-token context window with weights openly available for self-hosting.
2026-03-02
Researched 35d ago
262k
262,144 tokens
Llama Guard 3 11B Vision is Meta's Llama Guard model. Weights are openly available for self-hosting.
2024-09-25
Researched 35d ago
128k
128,000 tokens
No tracked provider route
GPT-4 Vision Preview is OpenAI's GPT-4 model with multimodal text and image input. It is deprecated (originally released 2023-11-06); use it only for reproducing earlier results or evaluating drift over time.
2023-11-06
Researched 35d ago
128k
128,000 tokens
No tracked provider route
Xiaomi MiMo-V2.5 is the lower-cost native omnimodal sibling in the MiMo-V2.5 series. OpenRouter describes it as supporting text, image, audio, and video inputs with text output, Pro-level agentic performance at roughly half the inference cost, and improved multimodal perception over MiMo-V2-Omni. Xiaomi's official April 22 release page highlights MiMo-V2.5 alongside MiMo-V2.5-Pro in benchmark data and says the V2.5 series will be open-sourced soon; no public weights/license were verified at research time.
2026-04-22
Researched 28d ago
1.05m
1,048,576 tokens
Kimi K2.6 is Moonshot AI's multimodal agentic coding model, released April 20 2026 under a Modified MIT license. Built on a 1-trillion-parameter MoE architecture (32B active, 384 experts with 8 selected per token plus 1 shared expert, 61 layers), it features a 262K context window and up to 65,536 output tokens. Supports native image and video inputs (screenshots, PDFs, spreadsheets). Designed for long-horizon coding with agent swarms of up to 300 sub-agents and 4,000 coordinated steps; Moonshot AI cites 200–300 sequential tool calls without task drift. Key benchmarks: SWE-bench Verified 80.2%, SWE-bench Pro 58.6%, LiveCodeBench v6 89.6%, GPQA Diamond 90.5%, Terminal-Bench 2.0 66.7%. Chatbot Arena Elo 1454 (2026-04-28 snapshot).
2026-04-20
Researched 29d ago
262k
262,144 tokens
SenseNova V6 is SenseTime's latest large model with advanced multimodal reasoning and text generation at low cost. Features multimodal long chain-of-thought reasoning and reinforcement learning enhancements.
2025-04-01
Researched 35d ago
128k
128,000 tokens
Step-2 is StepFun's Step model with multimodal text and image input. It offers a 256K-token context window.
2024-09-01
Researched 32d ago
256k
256,000 tokens
Step-1.5V is StepFun's multimodal language model with vision capabilities, building on Step-1 with image understanding.
2024-06-01
Researched 173d ago
128k
128,000 tokens
SenseNova-U1-A3B is SenseTime's open-source multimodal MoE model released April 28, 2026. ~3B activated parameters (MoE backbone). Shares the NEO-Unify architecture with SenseNova-U1-8B: no visual encoder or VAE, native unified text-and-image representation. Apache 2.0 license. HuggingFace: OpenSenseNova/SenseNova-U1-A3B-MoT. Source: https://github.com/OpenSenseNova/SenseNova-U1
2026-04-28
Researched 35d ago
—
No window data
No tracked provider route
Kimi K2.7-Code is Moonshot AI's coding-focused multimodal model released June 12, 2026, built on Kimi K2.6. Uses the same 1-trillion-parameter MoE architecture (32B active parameters, 384 experts with 8 selected per token, 61 layers) with a 262K context window and MoonViT vision encoder (400M parameters). Reports +21.8% on Moonshot's Kimi Code Bench v2, +11.0% on Program Bench, +31.5% on MLS Bench Lite versus K2.6, with approximately 30% fewer reasoning tokens. Forces thinking mode on by default and preserves reasoning content across multi-turn interactions for agentic use. Available via Kimi platform API and HuggingFace under Modified MIT license.
2026-06-12
Researched 3d ago
262k
262,144 tokens
Holo3.1-35B-A3B is H Company's flagship open-weights 35B (3B active) sparse MoE computer-use VLM released June 1, 2026 under Apache 2.0. Achieves 79.3% on AndroidWorld and >25% improvement on the Holotab harness over Holo3. Supports native function-calling, FP8/NVFP4 quantization for DGX Spark, and local self-hosting.
2026-06-01
Researched 16d ago
262k
262,144 tokens
Holo3.1-9B is H Company's 9B-parameter open-weights computer-use VLM released June 1, 2026 under Apache 2.0. Part of the Holo3.1 family. Achieves 71% on AndroidWorld. Supports native function-calling and structured JSON output for web, desktop, and mobile agent workflows.
2026-06-01
Researched 16d ago
262k
262,144 tokens
No tracked provider route
Holo3.1-4B is H Company's 4B-parameter open-weights computer-use VLM released June 1, 2026 under Apache 2.0. Part of the Holo3.1 family. Achieves 71% on AndroidWorld (up from 58% for Holo3). Supports native function-calling, web/desktop/mobile environments, and local deployment via quantized checkpoints.
2026-06-01
Researched 16d ago
262k
262,144 tokens
No tracked provider route
Holo3.1-0.8B is H Company's smallest 0.8B-parameter open-weights computer-use VLM, released June 1, 2026 under Apache 2.0. Part of the Holo3.1 family spanning 0.8B-35B. Fine-tuned from Qwen3.5-0.8B. Supports native function-calling and structured JSON output for local/edge deployment.
2026-06-01
Researched 16d ago
262k
262,144 tokens
No tracked provider route
Phi-3 Vision is a sophisticated multimodal AI model from Microsoft, designed to adeptly integrate language and vision capabilities. Unlike traditional language models, it processes both text and images and can perform tasks such as optical character recognition, chart analysis, and image interpretation. Its architecture features an image encoder, a text-image connector, a projector for mapping image features, and the Phi-3 Mini language model. Despite its relatively small size of 4.2 billion parameters, it competes with larger models and suits devices with limited computational power. Phi-3 Vision's ability to handle up to 128K tokens supports complex multimodal reasoning. It draws upon high-quality and synthetic data for training while incorporating essential safety measures.
2024-05-21
Researched 173d ago
128k
128,000 tokens
Holo3-122B-A10B is H Company's flagship 122B (10B active) sparse MoE computer-use VLM. Released March 31, 2026, available via the H Company API only (no public open weights). Achieves 78.85% on OSWorld-Verified - SOTA at launch. Paid-tier API access at $0.40/$3.00 per 1M input/output tokens.
2026-03-31
Researched 16d ago
66k
65,536 tokens
Amazon Nova Premier is Amazon's most capable standard Bedrock Nova understanding model for complex reasoning, agentic workflows, and model distillation. It supports a 1M-token context window, text/image/video inputs, text output, reasoning, tool calling, and prompt caching; use it as the standard Bedrock Nova frontier pick instead of Nova 2 Omni early-access Forge checkpoints.
2025-03-17
Researched 34d ago
1m
1,000,000 tokens
Nova Pro is Amazon's Nova model. It offers a 300K-token context window.
2025-03-17
Researched 19d ago
300k
300,000 tokens
Reka Edge is Reka's Reka model with multimodal text and image input. It offers a 64K-token context window.
2024-02-12
Researched 35d ago
64k
64,000 tokens
Flagship sparse MoE Mistral model (675B total, 41B active) with 256K context and multimodal capabilities. Leads benchmarks in complex reasoning and long-context processing.
2024-07-23
Researched 61d ago
128k
128,000 tokens
Nova Lite is Amazon's Nova model. It offers a 300K-token context window.
2025-03-17
Researched 19d ago
300k
300,000 tokens
Claude 3 Sonnet by Anthropic is a versatile large language AI model, balancing intelligence and speed for diverse enterprise use cases. It is part of the Claude 3 family, positioned between the powerful Opus and the faster Haiku models. Sonnet excels in nuanced content creation, accurate summarization, and complex scientific query handling while also showcasing proficiency in non-English languages and coding tasks. Additionally, it enhances vision capabilities with exceptional skills in visual reasoning, such as interpreting charts, graphs, and transcribing text from imperfect images, which benefits industries like retail, logistics, and finance. Operated at twice the speed of Claude 3 Opus, Sonnet is efficient in context-sensitive customer support and multi-step workflows. It has achieved AI Safety Level 2 (ASL-2) and is accessible through multiple platforms, including Claude.ai, the Claude iOS app, the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI.
2024-03-04
Researched 65d ago
200k
200,000 tokens
21B multimodal model from Reka. State-of-the-art for 21B class, supports text/image/video
2024-02-12
Researched 173d ago
128k
128,000 tokens
Kimi K2.5 is Moonshot AI's Kimi model focused on code generation and software engineering. It offers a 256K-token context window and scores 87.9 on GPQA.
2026-03-15
Researched 19d ago
256k
256,000 tokens
Instruction-tuned 11B Llama 3.2 Vision model for image reasoning, visual question answering, document understanding, and captioning. NVIDIA NIM lists text plus image input, text output, and a 128K context window for the Llama 3.2 Vision collection.
2024-09-25
Researched 22d ago
128k
128,000 tokens
Instruction-tuned 90B Llama 3.2 Vision model for higher-capability image reasoning, visual question answering, visual grounding, and captioning. NVIDIA NIM lists text plus image input, text output, and a 128K context window for the Llama 3.2 Vision collection.
2024-09-25
Researched 22d ago
128k
128,000 tokens
Claude 3.7 Sonnet is Anthropic's advanced model with extended thinking capabilities, offering state-of-the-art reasoning for complex tasks.
2024-03-04
Researched 65d ago
200k
200,000 tokens
Qwen3.6-Plus is Alibaba Cloud's GA Qwen3.6 flagship for long-context reasoning, coding, tool use, and multimodal workflows. DashScope lists it with a 1M-token context window, structured output support, and standard public token pricing.
2026-04-01
Researched 34d ago
1m
1,000,000 tokens
Nemotron-Nano-12B-v2-VL is NVIDIA AI's Nemotron Nano 2 model with multimodal text and image input. It was released 2025-10-28.
2025-10-28
Researched 35d ago
—
No window data
OpenAI o3 reasoning model with advanced multi-step problem-solving capabilities.
2025-04-16
Researched 15d ago
200k
200,000 tokens
Large multimodal Mistral model delivering advanced vision-language understanding with support for high-resolution images and complex visual reasoning.
2024-11-18
Researched 65d ago
128k
128,000 tokens
ERNIE 4.5 is Baidu AI's ERNIE model. It offers an 8K-token context window.
2025-03-16
Researched 19d ago
8k
8,000 tokens
Cosmos 3 Nano is NVIDIA's 16B-parameter omnimodel optimized for efficient inference on workstation-grade hardware (NVIDIA RTX PRO 6000). Architecture: dual-tower Mixture-of-Transformers with an 8B autoregressive Reasoner and an 8B diffusion-based Generator. The Reasoner supports up to 256K tokens of context for vision-language reasoning; the Generator produces video up to 720p at variable frame rates (default 189 frames). Natively handles text, image, video, audio (48kHz stereo), and robot action trajectories across 10+ robot embodiments including Franka Panda, UR, Google robot, and UMI. BF16 precision only. Available as open weights on Hugging Face and via the Cosmos 3 Reasoner NIM (NIM_MODEL_SIZE=nano). Intended for real-time robotics inference and edge-adjacent deployment. Robot action input/output is preserved in this description because the model schema does not have a dedicated action modality field.
2026-05-31
Researched 22d ago
256k
256,000 tokens
Cosmos 3 Super is NVIDIA's flagship 64B-parameter omnimodel for physical AI, designed for large-scale synthetic data generation and high-fidelity simulation on NVIDIA Hopper and Blackwell datacenter GPUs. Architecture: dual-tower Mixture-of-Transformers with a 32B autoregressive Reasoner and a 32B diffusion-based Generator. Supports 256K token reasoning context, 720p video generation at variable frame rates, and 10+ robot embodiment action domains. Ranked #1 among open models on Physics-IQ, PAI-Bench, R-Bench, RoboLab, RoboArena, VANTAGE-Bench, TAR, and Artificial Analysis image/video leaderboards (Computex 2026). Training data: 1.3B data points across 393 datasets (2024-2026). Inference performance (vLLM-Omni): ~55s for 50-step video on 8xH200. Available as open weights on Hugging Face and via Cosmos 3 Reasoner NIM (NIM_MODEL_SIZE=super). Robot action input/output is preserved in this description because the model schema does not have a dedicated action modality field.
2026-05-31
Researched 22d ago
256k
256,000 tokens
Perceptron Mk1 is a closed-source vision-language model for image and video understanding, OCR, object detection, captioning, video QA, and embodied reasoning. Perceptron documents Mk1 with 32K context, reasoning support, and standard pricing of $0.15 per 1M input tokens and $1.50 per 1M output tokens.
2026-05-12
Researched 35d ago
33k
32,768 tokens
Qianfan-OCR-Fast is Baidu Qianfan's speed-optimized OCR specialist surfaced on OpenRouter. It builds on the Qianfan-OCR document intelligence line for image-to-text, document parsing, layout analysis, chart understanding, and OCR-heavy extraction workflows.
2026-04-20
Researched 36d ago
66k
65,536 tokens
SEA-LION V4 27B Instruct is AI Singapore's Gemma 3 27B-based regional language model for Southeast Asian language tasks. It extends the SEA-LION line with continued pretraining and post-training for Burmese, Indonesian, Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai, Vietnamese, and English workloads, and is available through Cloudflare Workers AI.
2025-08-25
Researched 20d ago
128k
128,000 tokens
Baidu ERNIE X1 advanced model with enhanced reasoning for enterprise applications.
2025-03-16
Researched 19d ago
—
No window data
MiniMax-01 combines MiniMax-Text-01 and MiniMax-VL-01, pairing a 456B-total-parameter MoE language model with multimodal understanding for long-context text generation and vision-language tasks.
2025-01-14
Researched 1d ago
4m
4,000,000 tokens
Multimodal 11B parameter model balancing capability and computational efficiency
2024-09-25
Researched 65d ago
128k
128,000 tokens
Advanced multimodal model with image reasoning, visual question answering, and document analysis
2024-09-25
Researched 65d ago
128k
128,000 tokens
Instruction-tuned 12B multimodal model for conversational vision-language tasks and image analysis with efficient inference.
2024-09-12
Researched 173d ago
128k
128,000 tokens
GLM-4V 9B is Tsinghua Knowledge Engineering Group (THUDM)'s GLM-4 model with multimodal text and image input. It offers a 128K-token context window and scores 48.3 on MMMU.
2024-06-05
Researched 35d ago
131k
131,072 tokens
PaliGemma 3B 896 is a versatile and lightweight vision-language model developed by Google, designed to process and integrate both images and text. Inspired by the PaLI-3 model, it employs components like the SigLIP vision model and the Gemma-2B language model, featuring a linear projection layer for seamless integration of visual and textual inputs. Capable of handling tasks such as image captioning, visual question answering, object detection, and segmentation, it supports multilingual text processing. Despite requiring task-specific fine-tuning for optimal performance, PaliGemma highlights strong capabilities across various vision-language applications, although it may encounter challenges with contextual understanding, biases, and computational demands 124.
2024-05-14
Researched 173d ago
512
512 tokens
Closed-source flagship Qwen model with advanced reasoning capabilities for agent tasks.
2024-05-11
Researched 65d ago
128k
128,000 tokens
Reka's frontier-class multimodal model for complex tasks. Approaches OpenAI/Google/Anthropic frontier models
2024-04-15
Researched 173d ago
128k
128,000 tokens
DeepSeek-VL 7B is an open-source vision-language model engineered for robust real-world applications. With its general multimodal understanding capabilities, it can process logical diagrams, web pages, formulas, scientific literature, natural images, and complex scenarios involving embodied intelligence. The model features a hybrid vision encoder that integrates SigLIP-L and SAM-B, allowing it to handle high-resolution (1024 x 1024) image inputs. Built on the DeepSeek-LLM-7b-base foundation, it is pre-trained on roughly 2 trillion text tokens and further trained on approximately 400 billion vision-language tokens. A standout variant, DeepSeek-VL-7b-chat, is specially optimized for conversational tasks, enhancing both performance and user experience by addressing the limitations of existing open-source multimodal models.
2024-03-15
Researched 173d ago
—
No window data
Gemini 1.5 Flash on Google Vertex AI is Google DeepMind's Gemini 1.5 model with multimodal text and image input. It offers a 1M-token context window.
2024-02-15
Researched 35d ago
1m
1,000,000 tokens
Gemini 1.5 Flash on Google Vertex AI (Extended Context) is Google DeepMind's Gemini 1.5 model with multimodal text and image input. It offers a 1M-token context window.
2024-02-15
Researched 35d ago
1m
1,000,000 tokens
Gemini 1.5 Pro on Google Vertex AI is Google DeepMind's Gemini 1.5 model with multimodal text and image input. It offers a 1M-token context window.
2024-02-15
Researched 35d ago
1m
1,000,000 tokens
Gemini 1.5 Pro on Google Vertex AI (Extended Context) is Google DeepMind's Gemini 1.5 model with multimodal text and image input. It offers a 1M-token context window.
2024-02-15
Researched 35d ago
1m
1,000,000 tokens
Gemini 1.0 Pro on Google Vertex AI is Google DeepMind's Gemini 1.0 model with multimodal text and image input. It offers a 32K-token context window.
2023-12-06
Researched 35d ago
33k
32,768 tokens
Original LLaVA (Large Language-and-Vision Assistant) 13B model. Multimodal vision+language model combining a vision encoder with a language model for visual understanding tasks.
2023-04-17
Researched 173d ago
4k
4,000 tokens
xAI's dedicated agentic coding model. 314B MoE architecture, 100+ tokens/second. Built for web development, debugging, and MCP-native multi-step agentic tasks. Accepts text and image inputs. Available in us-east-1 and eu-west-1. API aliases: grok-code-fast-1, grok-code-fast. Public API opened June 1, 2026. Pricing: $1/$2 per 1M tokens in/out.
2026-06-01
Researched 14d ago
256k
256,000 tokens
No tracked provider route
Cosmos 3 Nano Policy DROID is a 16B-parameter robotics policy model fine-tuned from Cosmos 3 Nano on the DROID dataset. Given natural language instructions and visual observations from a robot camera (image or video), it generates robot action trajectories (JSON 1D list) for manipulation and control tasks. Compatible with multiple robot embodiments including Franka Panda (single/dual), UR, Google robot, WidowX 250, UMI, and Agibot. Supports 16-400 frame action sequences in various DoF configurations (9D-57D). Intended as a reference implementation for post-training Cosmos 3 Nano on specific robot platforms. The action output modality is represented in prose because the current model schema only has text, vision, video, audio, and related capability flags.
2026-05-31
Researched 22d ago
4k
4,000 tokens
No tracked provider route
Cosmos 3 Super Image2Video is a 64B-parameter fine-tuned variant of Cosmos 3 Super specialized for temporally coherent image-to-video generation. Takes a single image (jpg/png/webp at 256p-720p) plus an optional text prompt (up to 4096 tokens) and outputs MP4 video with 5-400 frames (default 189) at up to 720p, with optional muxed AAC stereo audio at 48kHz. Ranked #1 on Artificial Analysis image-to-video leaderboard (open models). Available via Hugging Face Diffusers and vLLM-Omni.
2026-05-31
Researched 22d ago
4k
4,000 tokens
No tracked provider route
Cosmos 3 Super Text2Image is a 64B-parameter fine-tuned variant of Cosmos 3 Super specialized for high-fidelity text-to-image generation. Takes text prompts up to 4096 tokens and outputs JPEG images at 256p, 480p, or 720p in aspect ratios 16:9, 4:3, 1:1, 3:4, or 9:16. Ranked #1 on Artificial Analysis text-to-image leaderboard (open models). Available via Hugging Face Diffusers (DiffusionPipeline) and vLLM-Omni.
2026-05-31
Researched 22d ago
4k
4,000 tokens
No tracked provider route
Google's strongest agentic and coding model, outperforming Gemini 3.1 Pro on coding and agentic benchmarks (Terminal-Bench 2.1: 76.2%, GDPval-AA: 1656 Elo, CharXiv Reasoning: 84.2%). Multimodal: text, vision, video, audio input. 1M-token context, 65K max output. Pricing: $1.50/$9.00 per 1M tokens in/out.
2026-05-19
Researched 14d ago
1m
1,000,000 tokens
No tracked provider route
SenseNova-U1-8B is SenseTime's open-source 8B multimodal model released April 28, 2026. Dense MoT (Mixture-of-Tokens) backbone. Uses the NEO-Unify architecture that eliminates both the visual encoder and VAE, enabling native unified image understanding and generation in a single representation space — the first commercially viable model with this capability. Achieves commercial-grade image generation quality comparable to Qwen-Image 2.0 Pro. Apache 2.0 license. HuggingFace: OpenSenseNova/SenseNova-U1-8B-MoT. Source: https://github.com/OpenSenseNova/SenseNova-U1
2026-04-28
Researched 35d ago
—
No window data
No tracked provider route
Holo3-35B-A3B is H Company's 35B (3B active) open-weights sparse MoE computer-use VLM released March 31, 2026 under Apache 2.0. Fine-tuned from Qwen3.5-35B-A3B, it achieves 77.8% on OSWorld-Verified and is available for self-hosting through H Company's official Hugging Face repository. The official config exposes a 262,144-token context window.
2026-03-31
Researched 16d ago
262k
262,144 tokens
No tracked provider route
Nemotron 3 VoiceChat is NVIDIA AI's Nemotron 3 model with multimodal text and image input. It was released 2026-03-16.
2026-03-16
Researched 35d ago
—
No window data
No tracked provider route
GPT-5 Image Mini is a cost-effective image generation model that combines GPT-5 Mini's language capabilities with image generation at 400K context.
2025-10-01
Researched 57d ago
400k
400,000 tokens
No tracked provider route
Gemini 2.0 Flash Lite Preview (02-05). Retiring June 1, 2026. Migrate to Gemini 2.5 or Gemini 3 series.
2025-02-05
Researched 19d ago
1m
1,000,000 tokens
No tracked provider route
Gemini 2.0 Pro (Experimental 02-05) is Google DeepMind's Gemini 2.0 model. Its knowledge cutoff is 2024-08.
2025-02-05
Researched 19d ago
1m
1,000,000 tokens
No tracked provider route
Doubao 1.5 Pro Vision 32K is ByteDance's multimodal language model released January 22, 2025 alongside Doubao 1.5 Pro. Incorporates improvements in multi-modal data synthesis, dynamic resolution, and multi-modal alignment for image understanding tasks.
2025-01-22
Researched 19d ago
32k
32,000 tokens
No tracked provider route
GPT-4o Audio Preview (12-17) is OpenAI's GPT-4o Audio model. It offers a 128K-token context window.
2024-12-17
Researched 35d ago
128k
128,000 tokens
No tracked provider route
Doubao Vision Lite 32K is ByteDance's Doubao Vision model. It offers a 32K-token context window.
2024-12-01
Researched 35d ago
32k
32,000 tokens
No tracked provider route
Doubao Vision Pro 32K is ByteDance's Doubao Vision model. It offers a 32K-token context window.
2024-12-01
Researched 35d ago
32k
32,000 tokens
No tracked provider route
GPT-4o (11-20) is OpenAI's GPT-4o model. It offers a 128K-token context window.
2024-11-20
Researched 35d ago
128k
128,000 tokens
No tracked provider route
GPT-4o model with integrated audio I/O capabilities for multimodal interactions.
2024-10-01
Researched 173d ago
128k
128,000 tokens
No tracked provider route
12B multimodal Mistral model combining text and vision capabilities for image understanding and visual reasoning tasks.
2024-09-12
Researched 173d ago
128k
128,000 tokens
No tracked provider route
Phi 3.5 Vision Instruct is Microsoft Research's Phi-3 model with multimodal text and image input. It offers a 128K-token context window with weights openly available for self-hosting and scores 43 on MMMU.
2024-08-20
Researched 35d ago
128k
128,000 tokens
No tracked provider route
The chatgpt-4o-latest model version continuously points to the version of GPT-4o used in ChatGPT, and is updated frequently, when there are significant changes.
2024-05-13
Researched 173d ago
128k
128,000 tokens
No tracked provider route
Gemini 1.0 Pro Vision is a multimodal large language model crafted by Google, excelling in tasks involving both visual and textual data. It boasts advanced capabilities in visual understanding, classification, and summarization, enabling the creation of content from images and videos. The model adeptly processes a range of visual and textual inputs, such as photographs, documents, and infographics, and is capable of generating image descriptions and object identification. Moreover, it supports zero-shot, one-shot, and few-shot learning, enhancing its adaptability to diverse applications. Despite its powerful features, Gemini 1.0 Pro Vision is slated for deprecation, with a removal date set for April 9, 2025, prompting users to transition to updated models like Gemini 1.5 Pro and Gemini 1.5 Flash 15.
2024-04-29
Researched 65d ago
12k
12,000 tokens
No tracked provider route
DeepSeek VL 1.3B is an advanced vision-language (VL) model that integrates multimodal understanding capabilities, enabling it to process and interpret both images and text effectively. Featuring a SigLIP-L vision encoder for 384 x 384 pixel image inputs, it is built upon a foundation of extensive training on text and vision-language tokens. This open-source model supports tasks such as image captioning, visual question answering, and multimodal document understanding, while also excelling in scenarios requiring embodied intelligence. Despite its powerful features, it is compact with 1.3 billion parameters, making it resource-efficient for real-world applications and available on platforms like Hugging Face.
2024-03-15
Researched 173d ago
—
No window data
No tracked provider route
DeepSeek VL 1.3B Chat is DeepSeek's DeepSeek VL model. Weights are openly available for self-hosting.
2024-03-15
Researched 35d ago
—
No window data
No tracked provider route
DeepSeek VL 7B Chat is DeepSeek's DeepSeek VL model. Weights are openly available for self-hosting.
2024-03-15
Researched 35d ago
—
No window data
No tracked provider route