multimodal

Pricing not tracked / 1M tokens

1 route

Mistral Medium 3 Instruct

Mistral Medium 3 Instruct is MistralAI's Mistral Medium model. It offers a 128K-token context window.

2025-10-01

Researched 19d ago

128k

128,000 tokens

128k contextVisionMultimodalBatch

Mistral AI Studio

$0.400 in / $2.00 out / 1M tokens

2 routes · 1 batch

Mistral Large 3 675B Instruct

Mistral Large 3 675B Instruct is MistralAI's Mistral Large model. It offers a 128K-token context window and scores 70.2 on τ-bench.

2025-12-01

Researched 4d ago

128k

128,000 tokens

128k contextVisionMultimodalJSONBatchPrompt cache

$0.500 in / $1.50 out / 1M tokens

6 routes · 1 batch · 1 cache

Qwen3.5-9B

Qwen3.5-9B is Alibaba's Qwen3.5 model with multimodal text and image input. It offers a 256K-token context window with weights openly available for self-hosting.

2026-03-02

Researched 35d ago

262k

262,144 tokens

262k contextVisionMultimodalTool useFunctionsJSON

Alibaba Cloud PAI-EAS

$0.100 in / $0.150 out / 1M tokens

3 routes

Llama Guard 3 11B Vision

Llama Guard 3 11B Vision is Meta's Llama Guard model. Weights are openly available for self-hosting.

2024-09-25

Researched 35d ago

128k

128,000 tokens

128k contextVision

No tracked provider route

GPT-4 Vision Preview

GPT-4 Vision Preview is OpenAI's GPT-4 model with multimodal text and image input. It is deprecated (originally released 2023-11-06); use it only for reproducing earlier results or evaluating drift over time.

2023-11-06

Researched 35d ago

128k

128,000 tokens

128k contextVisionMultimodalCode exec

No tracked provider route

Xiaomi MiMo-V2.5

Xiaomi MiMo-V2.5 is the lower-cost native omnimodal sibling in the MiMo-V2.5 series. OpenRouter describes it as supporting text, image, audio, and video inputs with text output, Pro-level agentic performance at roughly half the inference cost, and improved multimodal perception over MiMo-V2-Omni. Xiaomi's official April 22 release page highlights MiMo-V2.5 alongside MiMo-V2.5-Pro in benchmark data and says the V2.5 series will be open-sourced soon; no public weights/license were verified at research time.

2026-04-22

Researched 28d ago

1.05m

1,048,576 tokens

1.05m contextReasoningVisionMultimodalTool useFunctions

$0.140 in / $0.280 out / 1M tokens

2 routes · 1 cache

Kimi K2.6

Kimi K2.6 is Moonshot AI's multimodal agentic coding model, released April 20 2026 under a Modified MIT license. Built on a 1-trillion-parameter MoE architecture (32B active, 384 experts with 8 selected per token plus 1 shared expert, 61 layers), it features a 262K context window and up to 65,536 output tokens. Supports native image and video inputs (screenshots, PDFs, spreadsheets). Designed for long-horizon coding with agent swarms of up to 300 sub-agents and 4,000 coordinated steps; Moonshot AI cites 200–300 sequential tool calls without task drift. Key benchmarks: SWE-bench Verified 80.2%, SWE-bench Pro 58.6%, LiveCodeBench v6 89.6%, GPQA Diamond 90.5%, Terminal-Bench 2.0 66.7%. Chatbot Arena Elo 1454 (2026-04-28 snapshot).

2026-04-20

Researched 29d ago

262k

262,144 tokens

262k contextReasoningVisionMultimodalTool useFunctions

Novita AI

$0.800 in / $3.40 out / 1M tokens

9 routes · 3 cache

SenseNova V6

SenseNova V6 is SenseTime's latest large model with advanced multimodal reasoning and text generation at low cost. Features multimodal long chain-of-thought reasoning and reinforcement learning enhancements.

2025-04-01

Researched 35d ago

128k

128,000 tokens

128k contextVisionMultimodal

SenseTime API

Pricing not tracked / 1M tokens

1 route

Step-2

Step-2 is StepFun's Step model with multimodal text and image input. It offers a 256K-token context window.

2024-09-01

Researched 32d ago

256k

256,000 tokens

256k contextVisionMultimodalFunctions

StepFun

Pricing not tracked / 1M tokens

1 route

Step-1.5V

Step-1.5V is StepFun's multimodal language model with vision capabilities, building on Step-1 with image understanding.

2024-06-01

Researched 173d ago

128k

128,000 tokens

128k contextVisionMultimodal

StepFun

Pricing not tracked / 1M tokens

1 route

SenseNova-U1-A3B

SenseNova-U1-A3B is SenseTime's open-source multimodal MoE model released April 28, 2026. ~3B activated parameters (MoE backbone). Shares the NEO-Unify architecture with SenseNova-U1-8B: no visual encoder or VAE, native unified text-and-image representation. Apache 2.0 license. HuggingFace: OpenSenseNova/SenseNova-U1-A3B-MoT. Source: https://github.com/OpenSenseNova/SenseNova-U1

2026-04-28

Researched 35d ago

—

No window data

VisionMultimodal

No tracked provider route

Kimi K2.7-Code

Kimi K2.7-Code is Moonshot AI's coding-focused multimodal model released June 12, 2026, built on Kimi K2.6. Uses the same 1-trillion-parameter MoE architecture (32B active parameters, 384 experts with 8 selected per token, 61 layers) with a 262K context window and MoonViT vision encoder (400M parameters). Reports +21.8% on Moonshot's Kimi Code Bench v2, +11.0% on Program Bench, +31.5% on MLS Bench Lite versus K2.6, with approximately 30% fewer reasoning tokens. Forces thinking mode on by default and preserves reasoning content across multi-turn interactions for agentic use. Available via Kimi platform API and HuggingFace under Modified MIT license.

2026-06-12

Researched 3d ago

262k

262,144 tokens

262k contextReasoningVisionMultimodalTool useFunctions

$0.612 in / $3.07 out / 1M tokens

2 routes · 2 cache

Holo3.1-35B-A3B

Holo3.1-35B-A3B is H Company's flagship open-weights 35B (3B active) sparse MoE computer-use VLM released June 1, 2026 under Apache 2.0. Achieves 79.3% on AndroidWorld and >25% improvement on the Holotab harness over Holo3. Supports native function-calling, FP8/NVFP4 quantization for DGX Spark, and local self-hosting.

2026-06-01

Researched 16d ago

262k

262,144 tokens

262k contextVisionMultimodalTool useFunctionsJSON

H Company API

$0.250 in / $1.80 out / 1M tokens

1 route

Holo3.1-9B

Holo3.1-9B is H Company's 9B-parameter open-weights computer-use VLM released June 1, 2026 under Apache 2.0. Part of the Holo3.1 family. Achieves 71% on AndroidWorld. Supports native function-calling and structured JSON output for web, desktop, and mobile agent workflows.

2026-06-01

Researched 16d ago

262k

262,144 tokens

262k contextVisionMultimodalTool useFunctionsJSON

No tracked provider route

Holo3.1-4B

Holo3.1-4B is H Company's 4B-parameter open-weights computer-use VLM released June 1, 2026 under Apache 2.0. Part of the Holo3.1 family. Achieves 71% on AndroidWorld (up from 58% for Holo3). Supports native function-calling, web/desktop/mobile environments, and local deployment via quantized checkpoints.

2026-06-01

Researched 16d ago

262k

262,144 tokens

262k contextVisionMultimodalTool useFunctionsJSON

No tracked provider route

Holo3.1-0.8B

Holo3.1-0.8B is H Company's smallest 0.8B-parameter open-weights computer-use VLM, released June 1, 2026 under Apache 2.0. Part of the Holo3.1 family spanning 0.8B-35B. Fine-tuned from Qwen3.5-0.8B. Supports native function-calling and structured JSON output for local/edge deployment.

2026-06-01

Researched 16d ago

262k

262,144 tokens

262k contextVisionMultimodalTool useFunctionsJSON

No tracked provider route

Phi-3 Vision

Phi-3 Vision is a sophisticated multimodal AI model from Microsoft, designed to adeptly integrate language and vision capabilities. Unlike traditional language models, it processes both text and images and can perform tasks such as optical character recognition, chart analysis, and image interpretation. Its architecture features an image encoder, a text-image connector, a projector for mapping image features, and the Phi-3 Mini language model. Despite its relatively small size of 4.2 billion parameters, it competes with larger models and suits devices with limited computational power. Phi-3 Vision's ability to handle up to 128K tokens supports complex multimodal reasoning. It draws upon high-quality and synthetic data for training while incorporating essential safety measures.

2024-05-21

Researched 173d ago

128k

128,000 tokens

128k contextVision

Fireworks AI

$0.200 in / $0.200 out / 1M tokens

3 routes

Holo3-122B-A10B

Holo3-122B-A10B is H Company's flagship 122B (10B active) sparse MoE computer-use VLM. Released March 31, 2026, available via the H Company API only (no public open weights). Achieves 78.85% on OSWorld-Verified - SOTA at launch. Paid-tier API access at $0.40/$3.00 per 1M input/output tokens.

2026-03-31

Researched 16d ago

66k

65,536 tokens

VisionMultimodalTool useFunctionsJSON

H Company API

$0.400 in / $3.00 out / 1M tokens

1 route

Amazon Nova Premier

Amazon Nova Premier is Amazon's most capable standard Bedrock Nova understanding model for complex reasoning, agentic workflows, and model distillation. It supports a 1M-token context window, text/image/video inputs, text output, reasoning, tool calling, and prompt caching; use it as the standard Bedrock Nova frontier pick instead of Nova 2 Omni early-access Forge checkpoints.

2025-03-17

Researched 34d ago

1,000,000 tokens

1m contextReasoningVisionMultimodalTool useFunctions

$2.50 in / $12.50 out / 1M tokens

2 routes · 1 batch · 1 cache

Nova Pro

Nova Pro is Amazon's Nova model. It offers a 300K-token context window.

2025-03-17

Researched 19d ago

300k

300,000 tokens

300k contextVisionMultimodalJSON

$0.800 in / $3.20 out / 1M tokens

2 routes

Reka Edge

Reka Edge is Reka's Reka model with multimodal text and image input. It offers a 64K-token context window.

2024-02-12

Researched 35d ago

64k

64,000 tokens

MultimodalJSON

$0.100 in / $0.100 out / 1M tokens

2 routes

Mistral Large 2 (2407)

Flagship sparse MoE Mistral model (675B total, 41B active) with 256K context and multimodal capabilities. Leads benchmarks in complex reasoning and long-context processing.

2024-07-23

Researched 61d ago

128k

128,000 tokens

128k contextVisionJSON

Chutes AI

$0.500 in / $1.50 out / 1M tokens

3 routes

Nova Lite

Nova Lite is Amazon's Nova model. It offers a 300K-token context window.

2025-03-17

Researched 19d ago

300k

300,000 tokens

300k contextVisionMultimodalJSON

$0.060 in / $0.240 out / 1M tokens

2 routes

Claude 3 Sonnet

Claude 3 Sonnet by Anthropic is a versatile large language AI model, balancing intelligence and speed for diverse enterprise use cases. It is part of the Claude 3 family, positioned between the powerful Opus and the faster Haiku models. Sonnet excels in nuanced content creation, accurate summarization, and complex scientific query handling while also showcasing proficiency in non-English languages and coding tasks. Additionally, it enhances vision capabilities with exceptional skills in visual reasoning, such as interpreting charts, graphs, and transcribing text from imperfect images, which benefits industries like retail, logistics, and finance. Operated at twice the speed of Claude 3 Opus, Sonnet is efficient in context-sensitive customer support and multi-step workflows. It has achieved AI Safety Level 2 (ASL-2) and is accessible through multiple platforms, including Claude.ai, the Claude iOS app, the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI.

2024-03-04

Researched 65d ago

200k

200,000 tokens

200k contextReasoningVisionMultimodalJSONCode exec

$3.00 in / $15.00 out / 1M tokens

2 routes · 1 cache

Reka Flash

21B multimodal model from Reka. State-of-the-art for 21B class, supports text/image/video

2024-02-12

Researched 173d ago

128k

128,000 tokens

128k contextMultimodal

Reka Platform

$0.200 in / $0.800 out / 1M tokens

1 route

Kimi K2.5

Kimi K2.5 is Moonshot AI's Kimi model focused on code generation and software engineering. It offers a 256K-token context window and scores 87.9 on GPQA.

2026-03-15

Researched 19d ago

256k

256,000 tokens

256k contextVisionMultimodalFunctionsJSONPrompt cache

$0.440 in / $2.00 out / 1M tokens

9 routes · 1 cache

Llama 3.2 11B Vision Instruct

Instruction-tuned 11B Llama 3.2 Vision model for image reasoning, visual question answering, document understanding, and captioning. NVIDIA NIM lists text plus image input, text output, and a 128K context window for the Llama 3.2 Vision collection.

2024-09-25

Researched 22d ago

128k

128,000 tokens

128k contextVisionMultimodalJSON

Vercel AI Gateway

$0.160 in / $0.160 out / 1M tokens

8 routes

Llama 3.2 90B Vision Instruct

Instruction-tuned 90B Llama 3.2 Vision model for higher-capability image reasoning, visual question answering, visual grounding, and captioning. NVIDIA NIM lists text plus image input, text output, and a 128K context window for the Llama 3.2 Vision collection.

2024-09-25

Researched 22d ago

128k

128,000 tokens

128k contextVisionMultimodal

Bitdeer AI

$0.150 in / $0.450 out / 1M tokens

6 routes

Claude 3.7 Sonnet

Claude 3.7 Sonnet is Anthropic's advanced model with extended thinking capabilities, offering state-of-the-art reasoning for complex tasks.

2024-03-04

Researched 65d ago

200k

200,000 tokens

200k contextReasoningVisionMultimodalTool useFunctions

$3.00 in / $15.00 out / 1M tokens

6 routes · 1 batch

Qwen3.6-Plus

Qwen3.6-Plus is Alibaba Cloud's GA Qwen3.6 flagship for long-context reasoning, coding, tool use, and multimodal workflows. DashScope lists it with a 1M-token context window, structured output support, and standard public token pricing.

2026-04-01

Researched 34d ago

1,000,000 tokens

1m contextVisionMultimodalTool useFunctionsJSON

Alibaba Cloud PAI-EAS

$0.325 in / $1.95 out / 1M tokens

3 routes · 2 cache

Nemotron-Nano-12B-v2-VL

Nemotron-Nano-12B-v2-VL is NVIDIA AI's Nemotron Nano 2 model with multimodal text and image input. It was released 2025-10-28.

2025-10-28

Researched 35d ago

—

No window data

VisionMultimodalJSON

$0.200 in / $0.600 out / 1M tokens

3 routes

OpenAI o3 reasoning model with advanced multi-step problem-solving capabilities.

2025-04-16

Researched 15d ago

200k

200,000 tokens

200k contextReasoningVisionMultimodalTool useFunctions

OpenAI API

$2.00 in / $8.00 out / 1M tokens

3 routes · 1 batch · 2 cache

Pixtral Large

Large multimodal Mistral model delivering advanced vision-language understanding with support for high-resolution images and complex visual reasoning.

2024-11-18

Researched 65d ago

128k

128,000 tokens

128k contextVisionMultimodalJSON

Mistral AI Studio

$2.00 in / $6.00 out / 1M tokens

3 routes

ERNIE 4.5

ERNIE 4.5 is Baidu AI's ERNIE model. It offers an 8K-token context window.

2025-03-16

Researched 19d ago

8,000 tokens

VisionMultimodal

Fireworks AI

$1.20 in / $1.20 out / 1M tokens

2 routes

Cosmos 3 Nano

Cosmos 3 Nano is NVIDIA's 16B-parameter omnimodel optimized for efficient inference on workstation-grade hardware (NVIDIA RTX PRO 6000). Architecture: dual-tower Mixture-of-Transformers with an 8B autoregressive Reasoner and an 8B diffusion-based Generator. The Reasoner supports up to 256K tokens of context for vision-language reasoning; the Generator produces video up to 720p at variable frame rates (default 189 frames). Natively handles text, image, video, audio (48kHz stereo), and robot action trajectories across 10+ robot embodiments including Franka Panda, UR, Google robot, and UMI. BF16 precision only. Available as open weights on Hugging Face and via the Cosmos 3 Reasoner NIM (NIM_MODEL_SIZE=nano). Intended for real-time robotics inference and edge-adjacent deployment. Robot action input/output is preserved in this description because the model schema does not have a dedicated action modality field.

2026-05-31

Researched 22d ago

256k

256,000 tokens

256k contextReasoningVisionMultimodalAudio

Pricing not tracked / 1M tokens

1 route

Cosmos 3 Super

Cosmos 3 Super is NVIDIA's flagship 64B-parameter omnimodel for physical AI, designed for large-scale synthetic data generation and high-fidelity simulation on NVIDIA Hopper and Blackwell datacenter GPUs. Architecture: dual-tower Mixture-of-Transformers with a 32B autoregressive Reasoner and a 32B diffusion-based Generator. Supports 256K token reasoning context, 720p video generation at variable frame rates, and 10+ robot embodiment action domains. Ranked #1 among open models on Physics-IQ, PAI-Bench, R-Bench, RoboLab, RoboArena, VANTAGE-Bench, TAR, and Artificial Analysis image/video leaderboards (Computex 2026). Training data: 1.3B data points across 393 datasets (2024-2026). Inference performance (vLLM-Omni): ~55s for 50-step video on 8xH200. Available as open weights on Hugging Face and via Cosmos 3 Reasoner NIM (NIM_MODEL_SIZE=super). Robot action input/output is preserved in this description because the model schema does not have a dedicated action modality field.

2026-05-31

Researched 22d ago

256k

256,000 tokens

256k contextReasoningVisionMultimodalAudio

Pricing not tracked / 1M tokens

1 route

Perceptron Mk1

Perceptron Mk1 is a closed-source vision-language model for image and video understanding, OCR, object detection, captioning, video QA, and embodied reasoning. Perceptron documents Mk1 with 32K context, reasoning support, and standard pricing of $0.15 per 1M input tokens and $1.50 per 1M output tokens.

2026-05-12

Researched 35d ago

33k

32,768 tokens

ReasoningVisionMultimodalJSON

$0.150 in / $1.50 out / 1M tokens

1 route

Qianfan-OCR-Fast

Qianfan-OCR-Fast is Baidu Qianfan's speed-optimized OCR specialist surfaced on OpenRouter. It builds on the Qianfan-OCR document intelligence line for image-to-text, document parsing, layout analysis, chart understanding, and OCR-heavy extraction workflows.

2026-04-20

Researched 36d ago

66k

65,536 tokens

VisionMultimodal

$0.680 in / $2.81 out / 1M tokens

1 route

SEA-LION V4 27B Instruct

SEA-LION V4 27B Instruct is AI Singapore's Gemma 3 27B-based regional language model for Southeast Asian language tasks. It extends the SEA-LION line with continued pretraining and post-training for Burmese, Indonesian, Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai, Vietnamese, and English workloads, and is available through Cloudflare Workers AI.

2025-08-25

Researched 20d ago

128k

128,000 tokens

128k contextVisionMultimodalTool useFunctionsJSON

Cloudflare Workers AI

Pricing not tracked / 1M tokens

1 route

ERNIE X1

Baidu ERNIE X1 advanced model with enhanced reasoning for enterprise applications.

2025-03-16

Researched 19d ago

—

No window data

VisionMultimodal

Baidu Qianfan

$0.300 in / $1.18 out / 1M tokens

1 route

MiniMax-01

MiniMax-01 combines MiniMax-Text-01 and MiniMax-VL-01, pairing a 456B-total-parameter MoE language model with multimodal understanding for long-context text generation and vision-language tasks.

2025-01-14

Researched 1d ago

4,000,000 tokens

4m contextVisionMultimodal

$0.200 in / $1.10 out / 1M tokens

1 route

Llama 3.2 11B Vision

Multimodal 11B parameter model balancing capability and computational efficiency

2024-09-25

Researched 65d ago

128k

128,000 tokens

128k contextVisionJSON

$0.200 in / $0.270 out / 1M tokens

1 route

Llama 3.2 90B Vision

Advanced multimodal model with image reasoning, visual question answering, and document analysis

2024-09-25

Researched 65d ago

128k

128,000 tokens

128k contextVisionJSON

$1.35 in / $1.80 out / 1M tokens

1 route

Pixtral 12B Instruct

Instruction-tuned 12B multimodal model for conversational vision-language tasks and image analysis with efficient inference.

2024-09-12

Researched 173d ago

128k

128,000 tokens

128k contextVisionMultimodal

Vercel AI Gateway

$0.150 in / $0.150 out / 1M tokens

1 route

GLM-4V 9B

GLM-4V 9B is Tsinghua Knowledge Engineering Group (THUDM)'s GLM-4 model with multimodal text and image input. It offers a 128K-token context window and scores 48.3 on MMMU.

2024-06-05

Researched 35d ago

131k

131,072 tokens

131k contextMultimodal

Replicate API

$0.050 in / $0.250 out / 1M tokens

1 route

PaliGemma 3B 896

PaliGemma 3B 896 is a versatile and lightweight vision-language model developed by Google, designed to process and integrate both images and text. Inspired by the PaLI-3 model, it employs components like the SigLIP vision model and the Gemma-2B language model, featuring a linear projection layer for seamless integration of visual and textual inputs. Capable of handling tasks such as image captioning, visual question answering, object detection, and segmentation, it supports multilingual text processing. Despite requiring task-specific fine-tuning for optimal performance, PaliGemma highlights strong capabilities across various vision-language applications, although it may encounter challenges with contextual understanding, biases, and computational demands 124.

2024-05-14

Researched 173d ago

512

512 tokens

VisionMultimodal

Pricing not tracked / 1M tokens

1 route

Qwen-Max

Closed-source flagship Qwen model with advanced reasoning capabilities for agent tasks.

2024-05-11

Researched 65d ago

128k

128,000 tokens

128k contextVisionJSON

$1.04 in / $4.16 out / 1M tokens

1 route

Reka Core

Reka's frontier-class multimodal model for complex tasks. Approaches OpenAI/Google/Anthropic frontier models

2024-04-15

Researched 173d ago

128k

128,000 tokens

128k contextMultimodalFunctions

Reka Platform

$2.00 in / $6.00 out / 1M tokens

1 route

DeepSeek VL 7B

DeepSeek-VL 7B is an open-source vision-language model engineered for robust real-world applications. With its general multimodal understanding capabilities, it can process logical diagrams, web pages, formulas, scientific literature, natural images, and complex scenarios involving embodied intelligence. The model features a hybrid vision encoder that integrates SigLIP-L and SAM-B, allowing it to handle high-resolution (1024 x 1024) image inputs. Built on the DeepSeek-LLM-7b-base foundation, it is pre-trained on roughly 2 trillion text tokens and further trained on approximately 400 billion vision-language tokens. A standout variant, DeepSeek-VL-7b-chat, is specially optimized for conversational tasks, enhancing both performance and user experience by addressing the limitations of existing open-source multimodal models.

2024-03-15

Researched 173d ago

—

No window data

VisionMultimodal

Replicate API

$0.050 in / $0.250 out / 1M tokens

1 route

Gemini 1.5 Flash on Google Vertex AI

Gemini 1.5 Flash on Google Vertex AI is Google DeepMind's Gemini 1.5 model with multimodal text and image input. It offers a 1M-token context window.

2024-02-15

Researched 35d ago

1,000,000 tokens

1m contextVisionMultimodalJSON

$0.035 in / $0.105 out / 1M tokens

1 route

Gemini 1.5 Flash on Google Vertex AI (Extended Context)

Gemini 1.5 Flash on Google Vertex AI (Extended Context) is Google DeepMind's Gemini 1.5 model with multimodal text and image input. It offers a 1M-token context window.

2024-02-15

Researched 35d ago

1,000,000 tokens

1m contextVisionMultimodalJSON

$0.070 in / $0.210 out / 1M tokens

1 route

Gemini 1.5 Pro on Google Vertex AI

Gemini 1.5 Pro on Google Vertex AI is Google DeepMind's Gemini 1.5 model with multimodal text and image input. It offers a 1M-token context window.

2024-02-15

Researched 35d ago

1,000,000 tokens

1m contextVisionMultimodalJSON

$0.125 in / $0.375 out / 1M tokens

1 route

Gemini 1.5 Pro on Google Vertex AI (Extended Context)

Gemini 1.5 Pro on Google Vertex AI (Extended Context) is Google DeepMind's Gemini 1.5 model with multimodal text and image input. It offers a 1M-token context window.

2024-02-15

Researched 35d ago

1,000,000 tokens

1m contextVisionMultimodalJSON

$0.250 in / $0.750 out / 1M tokens

1 route

Gemini 1.0 Pro on Google Vertex AI

Gemini 1.0 Pro on Google Vertex AI is Google DeepMind's Gemini 1.0 model with multimodal text and image input. It offers a 32K-token context window.

2023-12-06

Researched 35d ago

33k

32,768 tokens

VisionMultimodalJSON

$0.125 in / $0.375 out / 1M tokens

1 route

LLaVA 13B

Original LLaVA (Large Language-and-Vision Assistant) 13B model. Multimodal vision+language model combining a vision encoder with a language model for visual understanding tasks.

2023-04-17

Researched 173d ago

4,000 tokens

VisionMultimodal

Replicate API

Pricing not tracked / 1M tokens

1 route