LLM Reference
Concepts & capability filters
Capability filtercapabilityintermediate

Prompt caching

Also known as: context caching, cache reads, cached prompts

reuse repeated prompt tokens

See matching models with benchmark scores and pricing.

116

matching active models

31

tracked providers

116

models with routes

model.prompt_cachingmodelProvider.cache_readmodelProvider.cache_write_*

Definition

Prompt caching lets a provider charge or execute repeated prompt prefixes differently when the same context is reused across requests. It matters for long system prompts, retrieval-heavy applications, and agent loops where stable instructions or documents are sent repeatedly.

Models With Prompt caching

Showing the first 80 matches, sorted by decision relevance, with tracked capability and provider-route evidence.

116 matches
Mistral Large 3 675B Instruct

Mistral Large 3 675B Instruct is MistralAI's Mistral Large model. It offers a 128K-token context window and scores 70.2 on τ-bench.

2025-12-01

Researched 8d ago

128k

128,000 tokens

128k contextVisionMultimodalJSONBatchPrompt cache
AWS Bedrock

$0.500 in / $1.50 out / 1M tokens

6 routes · 1 batch · 1 cache

Provider docs
Xiaomi MiMo-V2.5

Xiaomi MiMo-V2.5 is the lower-cost native omnimodal sibling in the MiMo-V2.5 series. OpenRouter describes it as supporting text, image, audio, and video inputs with text output, Pro-level agentic performance at roughly half the inference cost, and improved multimodal perception over MiMo-V2-Omni. Xiaomi's official April 22 release page highlights MiMo-V2.5 alongside MiMo-V2.5-Pro in benchmark data and says the V2.5 series will be open-sourced soon; no public weights/license were verified at research time.

2026-04-22

Researched 32d ago

1.05m

1,048,576 tokens

1.05m contextReasoningVisionMultimodalTool useFunctions
OpenRouter

$0.140 in / $0.280 out / 1M tokens

2 routes · 1 cache

Provider docs
Xiaomi MiMo-V2.5-Pro

Xiaomi's April 22, 2026 public-beta flagship in the MiMo-V2.5 series. The official Xiaomi MiMo page describes MiMo-V2.5-Pro as its most capable model to date, focused on general agentic capability, complex software engineering, long-horizon tasks, and ultra-long-context instruction following. OpenRouter lists it as text-to-text with 1,048,576 token context, 131,072 max completion tokens, reasoning controls, tool use, and response_format support. Xiaomi says the V2.5 series will be open-sourced soon, but no public weights/license were verified at research time.

2026-04-22

Researched 32d ago

1.05m

1,048,576 tokens

1.05m contextTool useFunctionsJSONPrompt cache
OpenRouter

$0.435 in / $0.870 out / 1M tokens

3 routes · 2 cache

Provider docs
Kimi K2.6

Kimi K2.6 is Moonshot AI's multimodal agentic coding model, released April 20 2026 under a Modified MIT license. Built on a 1-trillion-parameter MoE architecture (32B active, 384 experts with 8 selected per token plus 1 shared expert, 61 layers), it features a 262K context window and up to 65,536 output tokens. Supports native image and video inputs (screenshots, PDFs, spreadsheets). Designed for long-horizon coding with agent swarms of up to 300 sub-agents and 4,000 coordinated steps; Moonshot AI cites 200–300 sequential tool calls without task drift. Key benchmarks: SWE-bench Verified 80.2%, SWE-bench Pro 58.6%, LiveCodeBench v6 89.6%, GPQA Diamond 90.5%, Terminal-Bench 2.0 66.7%. Chatbot Arena Elo 1454 (2026-04-28 snapshot).

2026-04-20

Researched 2d ago

262k

262,144 tokens

262k contextReasoningVisionMultimodalTool useFunctions
Novita AI

$0.800 in / $3.40 out / 1M tokens

9 routes · 3 cache

Provider docs
Kimi K2.7-Code

Kimi K2.7-Code is Moonshot AI's coding-focused multimodal model released June 12, 2026, built on Kimi K2.6. Uses the same 1-trillion-parameter MoE architecture (32B active parameters, 384 experts with 8 selected per token, 61 layers) with a 262K context window and MoonViT vision encoder (400M parameters). Reports +21.8% on Moonshot's Kimi Code Bench v2, +11.0% on Program Bench, +31.5% on MLS Bench Lite versus K2.6, with approximately 30% fewer reasoning tokens. Forces thinking mode on by default and preserves reasoning content across multi-turn interactions for agentic use. Available via Kimi platform API and HuggingFace under Modified MIT license.

2026-06-12

Researched 2d ago

262k

262,144 tokens

262k contextReasoningVisionMultimodalTool useFunctions
OpenRouter

$0.612 in / $3.07 out / 1M tokens

2 routes · 2 cache

Provider docs
GLM-5.1

Post-training variant of GLM-5 from Z.ai (Zhipu AI) with enhanced agentic coding capabilities. Released April 7, 2026. 754B parameters (40B active) in Mixture of Experts architecture, 200K token context, 128K max output. Supports autonomous plan–execute–test–fix–optimize loops for up to 8 hours without human intervention. Trained entirely on Huawei Ascend hardware (no Nvidia). Key benchmarks: SWE-bench Pro 58.4 (world #1 at release, surpassing GPT-5.4 57.7 and Claude Opus 4.6 57.3), GPQA Diamond 86.2, AIME 2026 95.3, Terminal-Bench 2.0 63.5, MCP-Atlas 71.8, Chatbot Arena Elo 1475 (June 16, 2026, arena.ai). Available via Z.ai API ($1.40/$4.40 per 1M input/output tokens) and open weights on Hugging Face under MIT license.

2026-04-07

Researched 3d ago

200k

200,000 tokens

200k contextReasoningTool useFunctionsJSONCode exec
OpenRouter

$1.05 in / $3.50 out / 1M tokens

5 routes · 2 cache

Provider docs
Amazon Nova Premier

Amazon Nova Premier is Amazon's most capable standard Bedrock Nova understanding model for complex reasoning, agentic workflows, and model distillation. It supports a 1M-token context window, text/image/video inputs, text output, reasoning, tool calling, and prompt caching; use it as the standard Bedrock Nova frontier pick instead of Nova 2 Omni early-access Forge checkpoints.

2025-03-17

Researched 3d ago

1m

1,000,000 tokens

1m contextReasoningVisionMultimodalTool useFunctions
AWS Bedrock

$2.50 in / $12.50 out / 1M tokens

2 routes · 1 batch · 1 cache

Provider docs
Claude 3 Sonnet

Claude 3 Sonnet by Anthropic is a versatile large language AI model, balancing intelligence and speed for diverse enterprise use cases. It is part of the Claude 3 family, positioned between the powerful Opus and the faster Haiku models. Sonnet excels in nuanced content creation, accurate summarization, and complex scientific query handling while also showcasing proficiency in non-English languages and coding tasks. Additionally, it enhances vision capabilities with exceptional skills in visual reasoning, such as interpreting charts, graphs, and transcribing text from imperfect images, which benefits industries like retail, logistics, and finance. Operated at twice the speed of Claude 3 Opus, Sonnet is efficient in context-sensitive customer support and multi-step workflows. It has achieved AI Safety Level 2 (ASL-2) and is accessible through multiple platforms, including Claude.ai, the Claude iOS app, the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI.

2024-03-04

Researched 69d ago

200k

200,000 tokens

200k contextReasoningVisionMultimodalJSONCode exec
AWS Bedrock

$3.00 in / $15.00 out / 1M tokens

2 routes · 1 cache

Provider docs
Kimi K2.5

Kimi K2.5 is Moonshot AI's Kimi model focused on code generation and software engineering. It offers a 256K-token context window and scores 87.9 on GPQA.

2026-03-15

Researched 23d ago

256k

256,000 tokens

256k contextVisionMultimodalFunctionsJSONPrompt cache
OpenRouter

$0.440 in / $2.00 out / 1M tokens

9 routes · 1 cache

Provider docs
GLM-5

Flagship open-weight foundation model from Zhipu AI with 744B parameters (40B active per token) in Mixture of Experts architecture. Trained on 28.5T tokens using DeepSeek Sparse Attention on Huawei Ascend hardware. Achieves state-of-the-art performance on coding and agentic benchmarks (SWE-bench Verified: 77.8%). Supports autonomous planning, multi-step tool use, and self-correction.

2026-02-11

Researched 69d ago

200k

200,000 tokens

200k contextReasoningTool useFunctionsJSONPrompt cache
OpenRouter

$0.600 in / $2.08 out / 1M tokens

7 routes · 1 cache

Provider docs
Nemotron 3 Super-120B-A12B

NVIDIA Nemotron 3 Super-120B-A12B is a 120B total / 12B active hybrid Latent MoE model with interleaved Mamba-2 and MoE layers for agentic, reasoning, and conversational tasks. Fireworks lists the NVFP4 variant for on-demand deployment with 262k context.

2026-03-11

Researched 26d ago

1.05m

1,048,576 tokens

1.05m contextJSONPrompt cache
OpenRouter

$0.090 in / $0.450 out / 1M tokens

6 routes · 1 cache

Provider docs
DeepSeek V4 Pro

DeepSeek V4 Pro is DeepSeek's flagship open-weights model, released April 24 2026 under the MIT license. Architecture: 1.6T total / 49B active parameters, MoE with Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) hybrid — requiring only 27% of inference FLOPs vs standard 1M-context transformers — plus Manifold-Constrained Hyper-Connections (mHC) and Muon Optimizer. Context window: 1,000,000 tokens; max output: 384,000 tokens (Think Max mode requires >=384K context). Text-only (no vision/image input). Supports three reasoning modes: Non-Think, Think High, Think Max. Function calling, tool use, and structured outputs supported. Key benchmarks: SWE-bench Verified 80.6%, SWE-bench Pro 55.4%, LiveCodeBench 93.5%, GPQA Diamond 90.1%, MMLU-Pro 87.5%, Terminal-Bench 2.0 59.1% on BenchLM's independent June 2026 harness, and Chatbot Arena 1456 (2026-06-16). Current API pricing: $0.435/$0.87 per 1M input/output tokens; DeepSeek made the former 75% promotional rate permanent in May 2026.

2026-04-24

Researched 3d ago

1m

1,000,000 tokens

1m contextReasoningTool useFunctionsJSONPrompt cache
DeepSeek Platform

$0.435 in / $0.870 out / 1M tokens

5 routes · 3 cache

Provider docs
SOLAR 10.7B

SOLAR 10.7B is a robust large language model created by Upstage AI in South Korea, featuring 10.7 billion parameters. It is tailored for high efficiency and performance through its innovative "Depth Up-Scaling" (DUS) approach, which deepens the model's layers rather than widening them, allowing for enhanced capabilities without significantly increasing computational costs. This method distinguishes it from other models that utilize more complex techniques like Mixture of Experts. By integrating pre-trained weights from the Mistral 7B model with the Llama 2 framework, SOLAR 10.7B achieves notable performance, outpacing even some models with up to 30 billion parameters. Available under the Apache 2.0 license, it also includes a finely-tuned instruction-based variant under CC-BY-NC-4.0, optimized for single-turn conversations and diverse NLP tasks, albeit with limitations in handling multi-turn dialogue and complex context. The model is grounded in the transformer architecture, widely adopted in advanced language models.

2024-06-24

Researched 26d ago

4k

4,000 tokens

JSONPrompt cache
Fireworks AI

$0.200 in / $0.200 out / 1M tokens

5 routes · 1 cache

Provider docs
Qwen3.7-Max

Alibaba's closed-weight flagship language model, announced at the 2026 Alibaba Cloud Summit (May 20). Scored 56.6 on Artificial Analysis Intelligence Index at launch—highest-ranked Chinese model. 1M-token context with prompt caching (up to 90% discount). Pricing: $2.50/$7.50 per 1M tokens in/out.

2026-05-19

Researched today

1m

1,000,000 tokens

1m contextReasoningTool useFunctionsJSONCode exec
Novita AI

$1.25 in / $3.75 out / 1M tokens

4 routes · 3 cache

Provider docs
Qwen3.6-Plus

Qwen3.6-Plus is Alibaba Cloud's GA Qwen3.6 flagship for long-context reasoning, coding, tool use, and multimodal workflows. DashScope lists it with a 1M-token context window, structured output support, and standard public token pricing.

2026-04-01

Researched 38d ago

1m

1,000,000 tokens

1m contextVisionMultimodalTool useFunctionsJSON
Alibaba Cloud PAI-EAS

$0.325 in / $1.95 out / 1M tokens

3 routes · 2 cache

Provider docs
o3

OpenAI o3 reasoning model with advanced multi-step problem-solving capabilities.

2025-04-16

Researched 19d ago

200k

200,000 tokens

200k contextReasoningVisionMultimodalTool useFunctions
OpenAI API

$2.00 in / $8.00 out / 1M tokens

3 routes · 1 batch · 2 cache

Provider docs
Ring-2.6-1T

Ring-2.6-1T is InclusionAI's MIT-licensed trillion-parameter MoE reasoning model for agent workflows, engineering tasks, scientific analysis, and enterprise automation. It supports high and xhigh reasoning effort modes and entered OpenRouter's Programming top 10 in the 2026-05-18 audit.

2026-05-08

Researched 39d ago

262k

262,144 tokens

262k contextReasoningTool useFunctionsJSONPrompt cache
OpenRouter

$0.075 in / $0.625 out / 1M tokens

2 routes · 1 cache

Provider docs
GPT-5

OpenAI's previous intelligent reasoning model with configurable reasoning effort. Released August 2025. Supports minimal, low, medium, and high reasoning levels. Succeeded by GPT-5.1 and later models.

2025-08-07

Researched 48d ago

400k

400,000 tokens

400k contextReasoningVisionMultimodalTool useFunctions
OpenAI API

$1.25 in / $10.00 out / 1M tokens

4 routes · 1 batch · 2 cache

Provider docs
GPT-5 Chat

GPT-5 Chat is OpenAI's conversational variant of GPT-5 designed for advanced multimodal, context-aware enterprise conversations at 128K context.

2025-10-01

Researched 61d ago

128k

128,000 tokens

128k contextVisionMultimodalTool useFunctionsJSON
OpenRouter

$1.25 in / $10.00 out / 1M tokens

2 routes · 2 cache

Provider docs
GPT-5.1 Chat

GPT-5.1 Chat is the fast, lightweight conversational member of the GPT-5.1 family, optimized for low-latency chat at 128K context.

2025-12-01

Researched 61d ago

128k

128,000 tokens

128k contextVisionMultimodalTool useFunctionsJSON
OpenRouter

$1.25 in / $10.00 out / 1M tokens

1 route · 1 cache

Provider docs
GPT-5 Mini

Near-frontier intelligence for cost-sensitive, low-latency, high-volume workloads. Released August 2025. Replaces o4-mini (shutting down Oct 2026).

2025-08-07

Researched 48d ago

400k

400,000 tokens

400k contextReasoningVisionMultimodalTool useFunctions
OpenAI API

$0.250 in / $2.00 out / 1M tokens

4 routes · 1 batch · 2 cache

Provider docs
gpt-oss-120b

OpenAI open-weight model with 120 billion parameters. Text-only model supporting reasoning, function calling, and structured outputs. Free for self-hosting. Released August 5, 2025.

2025-08-05

Researched 69d ago

131k

131,072 tokens

131k contextTool useFunctionsJSONPrompt cache
OpenRouter

$0.039 in / $0.180 out / 1M tokens

10 routes · 1 cache

Provider docs
Gemini 3 Flash

Gemini 3 Flash is Google's speed-optimized Gemini 3 model, available in public preview via the Gemini API and Vertex AI. It supports text, image, audio, and video inputs with a 1M token context window and is priced at $0.50 per 1M input tokens and $3.00 per 1M output tokens.

2025-12-17

Researched 41d ago

1m

1,000,000 tokens

1m contextVisionMultimodalAudioTool useFunctions
GCP Vertex AI

$0.500 in / $3.00 out / 1M tokens

4 routes · 1 cache

Provider docs
GPT-5 Nano

Fastest, cheapest GPT-5 variant for summarization and classification tasks. Also available via Realtime API.

2025-08-07

Researched 48d ago

400k

400,000 tokens

400k contextReasoningVisionMultimodalTool useFunctions
OpenAI API

$0.050 in / $0.400 out / 1M tokens

4 routes · 1 batch · 2 cache

Provider docs
GPT OSS Safeguard 20B

GPT OSS Safeguard 20B is OpenAI's GPT-OSS model focused on content moderation and safety classification. It offers a 128K-token context window with weights openly available for self-hosting.

2025-08-05

Researched 39d ago

131k

131,072 tokens

131k contextTool useFunctionsJSONPrompt cache
AWS Bedrock

$0.070 in / $0.200 out / 1M tokens

4 routes · 1 cache

Provider docs
GPT-5.1 Codex

GPT-5.1-Codex is a coding-specialized version of GPT-5.1, optimized for software engineering and agentic coding workflows at 400K context.

2025-12-01

Researched 61d ago

400k

400,000 tokens

400k contextVisionMultimodalTool useFunctionsJSON
OpenRouter

$1.25 in / $10.00 out / 1M tokens

2 routes · 2 cache

Provider docs
GPT-5 Codex

GPT-5 Codex is OpenAI's coding-specialized variant of GPT-5, optimized for software engineering workflows, code generation, and agentic coding tasks at 400K context.

2025-10-01

Researched 61d ago

400k

400,000 tokens

400k contextVisionMultimodalTool useFunctionsJSON
OpenRouter

$1.25 in / $10.00 out / 1M tokens

2 routes · 2 cache

Provider docs
GLM 4.5V

GLM-4.5V is a vision-language MoE model from Z.ai designed for multimodal agent applications, handling both image understanding and text generation at 64K context.

2026-01-01

Researched 36d ago

64k

64,000 tokens

VisionMultimodalTool useFunctionsPrompt cache
Novita AI

$0.600 in / $1.80 out / 1M tokens

2 routes · 1 cache

Provider docs
Seed 1.6

Seed 1.6 is a general-purpose multimodal model from ByteDance Seed supporting text, image, and video inputs. It incorporates multimodal capabilities and deep thinking for complex tasks at 256K context.

2026-03-01

Researched 61d ago

256k

256,000 tokens

256k contextReasoningVisionMultimodalTool useFunctions
Vercel AI Gateway

$0.250 in / $2.00 out / 1M tokens

1 route · 1 cache

Provider docs
Amazon Nova 2 Lite

Amazon Nova 2 Lite is a fast, cost-effective reasoning model for everyday workloads that processes text, images, and videos at 1M token context with improved reasoning over Nova Lite v1.

2025-12-02

Researched 3d ago

1m

1,000,000 tokens

1m contextReasoningVisionMultimodalTool useFunctions
Vercel AI Gateway

$0.300 in / $2.50 out / 1M tokens

1 route · 1 cache

Provider docs
o3 Deep Research

o3-deep-research is OpenAI's advanced model for deep research, designed to tackle complex multi-step research tasks by synthesizing information from multiple sources at 200K context.

2025-10-10

Researched 44d ago

200k

200,000 tokens

200k contextReasoningVisionMultimodalTool useFunctions
Vercel AI Gateway

$10.00 in / $40.00 out / 1M tokens

1 route · 1 cache

Provider docs
GPT-4.1

OpenAI's GPT-4.1 model released April 2025, excelling at coding tasks, precise instruction following, and web development. Outperforms GPT-4o in these areas with a 1 million token context window. Available via API and in ChatGPT for Plus, Pro, Team, Enterprise, and Edu users.

2025-04-01

Researched 48d ago

1.05m

1,047,576 tokens

1.05m contextVisionMultimodalTool useFunctionsJSON
OpenAI API

$2.00 in / $8.00 out / 1M tokens

4 routes · 1 batch · 2 cache

Provider docs
GLM 4.6V

GLM-4.6V is Z.ai's large multimodal model for high-fidelity visual understanding and long-context reasoning across images, charts, and documents at 128K context.

2026-02-01

Researched 36d ago

128k

128,000 tokens

128k contextVisionMultimodalTool useFunctionsPrompt cache
Novita AI

$0.300 in / $0.900 out / 1M tokens

2 routes · 1 cache

Provider docs
GPT-4.1 Mini

Fast and efficient small model from OpenAI replacing GPT-4o mini. Released April 2025 alongside GPT-4.1. Shows improvements in instruction-following, coding, and intelligence with a 1 million token context window. Available in ChatGPT for paid users.

2025-04-01

Researched 48d ago

1.05m

1,047,576 tokens

1.05m contextVisionMultimodalTool useFunctionsJSON
OpenAI API

$0.400 in / $1.60 out / 1M tokens

4 routes · 2 cache

Provider docs
KAT Coder Pro V2

KAT-Coder-Pro V2 is Kwaipilot's flagship agentic coding model, achieving 79.6% on SWE-Bench Verified (March 2026). Designed for complex enterprise software engineering tasks, multi-system coordination, and SaaS integration. Uses a 'Specialize-then-Unify' training paradigm with five specialized expert domains. Context: 256K tokens. Max output: 256K tokens (on Streamlake endpoint). Available via Vercel AI Gateway and OpenRouter.

2026-03-27

Researched 36d ago

256k

256,000 tokens

256k contextTool useFunctionsJSONCode execPrompt cache
Novita AI

$0.300 in / $1.20 out / 1M tokens

3 routes · 1 cache

Provider docs
Claude 3.5 Haiku

Claude 3.5 Haiku is Anthropic's latest AI model, known for its speed and efficiency while maintaining high intelligence. It is optimized for applications needing rapid response, like interactive chatbots and real-time content moderation. Initially text-only, future plans include image input capabilities. It excels in delivering fast, accurate code suggestions, processing and categorizing information swiftly, and handling large volumes of user interactions. Priced accessibly, it offers advanced coding, tool use, and reasoning abilities. Though initially surpassing Claude 3 Haiku in benchmarks, its pricing reflects its enhanced performance 123457.

2024-10-22

Researched 26d ago

200k

200,000 tokens

200k contextReasoningVisionJSONCode execBatch
Anthropic

$0.800 in / $4.00 out / 1M tokens

6 routes · 1 batch · 2 cache

Provider docs
Claude Sonnet 4.5

Claude Sonnet 4.5 is Anthropic's Claude 4.5 model with multimodal text and image input and an optional reasoning mode. It offers a 200K-token context window and scores 86 on MMLU PRO.

2025-09-29

Researched 39d ago

200k

200,000 tokens

200k contextReasoningVisionMultimodalTool useFunctions
Anthropic

$3.00 in / $15.00 out / 1M tokens

8 routes · 1 batch · 2 cache

Provider docs
DeepSeek V3.1

Enhanced reasoning and grounded retrieval model from DeepSeek with multimodal text and image understanding.

2025-08-21

Researched 36d ago

64k

64,000 tokens

VisionMultimodalJSONCode execPrompt cache
Novita AI

$0.270 in / $1.00 out / 1M tokens

8 routes · 1 cache

Provider docs
Claude Opus 4.5

Claude Opus 4.5 is Anthropic's Claude 4.5 model with multimodal text and image input and an optional reasoning mode. It offers a 200K-token context window and scores 80.7 on MMMU.

2025-11-01

Researched 39d ago

200k

200,000 tokens

200k contextReasoningVisionMultimodalTool useFunctions
Anthropic

$5.00 in / $25.00 out / 1M tokens

6 routes · 1 batch · 2 cache

Provider docs
Claude 3.5 Sonnet

Claude 3.5 Sonnet, the latest in Anthropic's line of large language models, merges state-of-the-art reasoning, coding, and natural language understanding capabilities with advanced multi-modal processing. Released in October 2024, it excels in benchmarks against previous models and competitors, thanks to its scalable attention mechanisms and massive neural network architecture. Its dynamic routing enables specialization in various tasks, supporting applications from software development and data analysis to customer support and content creation. Users benefit from its "Artifacts" feature for real-time collaborative workflows and can access the model through platforms like Claude.ai and APIs at competitive pricing rates.

2024-06-20

Researched 69d ago

200k

200,000 tokens

200k contextReasoningVisionMultimodalFunctionsJSON
Anthropic

$3.00 in / $15.00 out / 1M tokens

6 routes · 1 cache

Provider docs
GPT-4o

OpenAI GPT-4o: Flagship multimodal model with vision, function calling, and broad capability. $2.50/M input, $10/M output.

2024-05-13

Researched 48d ago

128k

128,000 tokens

128k contextVisionMultimodalTool useFunctionsJSON
OpenAI API

$2.50 in / $10.00 out / 1M tokens

5 routes · 1 batch · 2 cache

Provider docs
Qwen3-Max

Alibaba's Qwen3-Max, flagship model with improved multilingual and reasoning capabilities.

2025-04-28

Researched 69d ago

262k

262,144 tokens

262k contextVisionMultimodalTool useFunctionsJSON
OpenRouter

$0.780 in / $3.90 out / 1M tokens

3 routes · 1 cache

Provider docs
Qwen3.7-Plus

Alibaba's multimodal agentic model with text, image, and video input. Combines vision-language understanding with full agentic capabilities: deep reasoning, self-programming, tool invocation, and autonomous iteration. GUI grounding: 79.0 on ScreenSpot Pro. Max output 66K tokens. Pricing: $0.40/$1.60 per 1M tokens in/out.

2026-06-03

Researched 18d ago

1m

1,000,000 tokens

1m contextReasoningVisionMultimodalTool useFunctions
OpenRouter

$0.320 in / $1.28 out / 1M tokens

2 routes · 1 cache

Provider docs
Claude Sonnet 4.6

Claude Sonnet 4.6 is Anthropic's best combination of speed and intelligence. Proprietary decoder-only model with 1M-token context, 64K max output, multimodal vision, extended thinking, and function calling. Available via Anthropic API, AWS Bedrock, GCP Vertex AI, and OpenRouter at $3/1M input and $15/1M output tokens.

2026-02-17

Researched 15d ago

1m

1,000,000 tokens

1m contextReasoningVisionMultimodalTool useFunctions
Anthropic

$3.00 in / $15.00 out / 1M tokens

6 routes · 1 batch · 3 cache

Provider docs
Claude Opus 4.7

Claude Opus 4.7 is Anthropic's generally available flagship model with 1M context, 128K max output, adaptive thinking, and a new tokenizer with roughly 555K words per 1M tokens.

2026-04-16

Researched today

1m

1,000,000 tokens

1m contextReasoningVisionMultimodalTool useFunctions
Anthropic

$5.00 in / $25.00 out / 1M tokens

6 routes · 1 batch · 3 cache

Provider docs
Claude Opus 4.6

Claude Opus 4.6 is Anthropic's Claude 4.6 model with multimodal text and image input and an optional reasoning mode. It offers a 1M-token context window and scores 80.8 on SWE-bench Verified.

2026-02-05

Researched 39d ago

1m

1,000,000 tokens

1m contextReasoningVisionMultimodalTool useFunctions
Anthropic

$5.00 in / $25.00 out / 1M tokens

6 routes · 1 batch · 4 cache

Provider docs
Qwen3.6 Max Preview

Qwen3.6-Max-Preview is a proprietary frontier model from Alibaba Cloud built on a sparse MoE architecture, available for preview as part of the Qwen3.6 series.

2026-04-20

Researched 46d ago

256k

256,000 tokens

256k contextReasoningVisionMultimodalTool useFunctions
Alibaba Cloud PAI-EAS

$1.04 in / $6.24 out / 1M tokens

3 routes · 1 cache

Provider docs
Claude Opus 4.8

Claude Opus 4.8 is Anthropic's flagship Claude 4.8 model, released May 28, 2026 for agentic coding, long-horizon reasoning, computer use, and professional knowledge work. It supports text and image inputs, adaptive reasoning, tool use, structured outputs, computer-use tools, prompt caching, Batch API, Dynamic Workflows parallel subagents, a 1M-token context window on Anthropic API/Bedrock/Vertex, and 128K max output. Key datapack rows: SWE-bench Pro 69.2%, SWE-bench Verified 88.6%, Terminal-Bench 2.1 74.6%, HLE with tools 57.9%, OSWorld-Verified 83.4%, GDPval-AA 1890 Elo, and MCP-Atlas 82.2%. Standard Anthropic API pricing is $5/M input and $25/M output.

2026-05-28

Researched 3d ago

1m

1,000,000 tokens

1m contextReasoningVisionMultimodalTool useFunctions
Anthropic

$5.00 in / $25.00 out / 1M tokens

6 routes · 1 batch · 1 cache

Provider docs
GLM 4.7

GLM-4.7 is Z.ai's flagship text model featuring enhanced programming capabilities and deeper reasoning at 200K context, succeeding GLM-4.6.

2026-03-01

Researched 36d ago

200k

200,000 tokens

200k contextTool useFunctionsJSONCode execPrompt cache
Fireworks AI

$0.600 in / $2.20 out / 1M tokens

3 routes · 1 cache

Provider docs
DeepSeek V4 Flash

DeepSeek V4 Flash is a 284B parameter (13B activated) Mixture-of-Experts language model with 1M-token context. Features a hybrid attention architecture combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) for efficient long-context inference. Supports thinking and non-thinking modes. Legacy API aliases deepseek-chat and deepseek-reasoner map to this model's non-thinking and thinking modes respectively. Pricing: $0.14/1M input, $0.28/1M output (cache hit: $0.0028/1M input). MIT licensed.

2026-04-24

Researched 14d ago

1m

1,000,000 tokens

1m contextReasoningTool useFunctionsJSONPrompt cache
OpenRouter

$0.0983 in / $0.1966 out / 1M tokens

5 routes · 2 cache

Provider docs
Step 3.7 Flash

Step 3.7 Flash is StepFun's open-weights multimodal Mixture-of-Experts model for agentic coding, tool use, long-context reasoning, image understanding, and video understanding. It combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder, activates about 11B parameters per token, supports a 256K-token context window, and exposes low, medium, and high reasoning levels for speed/depth tradeoffs. StepFun reports leading open-model results on ClawEval-1.1, SimpleVQA with Search, and SWE-bench Pro at launch. Weights are available on Hugging Face under Apache 2.0.

2026-05-29

Researched 29d ago

256k

256,000 tokens

256k contextReasoningVisionMultimodalTool useFunctions
OpenRouter

$0.200 in / $1.15 out / 1M tokens

3 routes · 1 cache

Provider docs
MAI-Code-1-Flash

MAI-Code-1-Flash is Microsoft AI's lightweight agentic coding model built directly inside GitHub Copilot's production harness. It is designed for fast everyday developer workflows, adaptive thinking by task complexity, multi-turn instruction following, and token-efficient coding. Microsoft reports 51.2% on SWE-bench Pro versus 35.2% for Claude Haiku 4.5 in the same Copilot harness, plus stronger results on SWE-bench Verified, SWE-bench Multilingual, and Terminal-Bench 2.0 without publishing exact scores for those secondary benchmarks.

2026-06-02

Researched 16d ago

256k

256,000 tokens

256k contextReasoningTool usePrompt cache
Microsoft Foundry

$0.750 in / $4.50 out / 1M tokens

1 route · 1 cache

Provider docs
DeepSeek V3.2

DeepSeek V3.2 is DeepSeek's DeepSeek V3 model. It offers a 160K-token context window with weights openly available for self-hosting and scores 70 on SWE-bench Verified.

2025-12-01

Researched 38d ago

160k

160,000 tokens

160k contextJSONCode execPrompt cache
OpenRouter

$0.252 in / $0.378 out / 1M tokens

7 routes · 1 cache

Provider docs
DeepSeek R1 0528

DeepSeek R1 0528 is DeepSeek's DeepSeek R1 model with an optional reasoning mode. It offers a 130K-token context window with weights openly available for self-hosting and scores 81 on GPQA.

2025-05-28

Researched 37d ago

130k

130,000 tokens

130k contextReasoningJSONCode execPrompt cache
Fireworks AI

$0.560 in / $1.68 out / 1M tokens

7 routes · 2 cache

Provider docs
Kimi K2 Thinking

Extended thinking variant of Kimi K2 with native reasoning capabilities. 256K context.

2025-01-01

Researched 23d ago

256k

256,000 tokens

256k contextReasoningJSONPrompt cache
AWS Bedrock

$0.600 in / $2.50 out / 1M tokens

7 routes · 1 cache

Provider docs
Qwen3-Coder-480B-A35B-Instruct

Qwen3-Coder-480B-A35B-Instruct is Alibaba's flagship open-source code generation and agentic model, released July 22, 2025 under the Apache 2.0 license. The model has 480 billion total parameters with 35 billion active parameters per token, organized across 62 transformer layers with 160 specialized expert networks and 8 experts activated per token. It uses Grouped Query Attention (GQA) with 96 query heads and 8 key-value heads and supports a native context window of 262,144 tokens, extendable to 1 million tokens via YaRN position scaling. The model is purpose-built for software engineering tasks and agentic workflows: code generation, code review, test writing, multi-step debugging, and browser-based agentic task execution. On release, it achieved state-of-the-art results among open models on Agentic Coding, Agentic Browser-Use, and Agentic Tool-Use benchmarks, with performance comparable to Claude Sonnet 4 on these tasks. Available via Fireworks AI, Google Vertex AI, NVIDIA NIM, AWS Bedrock, Novita AI, and the Vercel AI Gateway.

2025-07-22

Researched 8d ago

262k

262,144 tokens

262k contextTool useFunctionsJSONCode execPrompt cache
Novita AI

$0.380 in / $1.55 out / 1M tokens

6 routes · 1 cache

Provider docs
Gemini 3.1 Pro Preview

Google: Gemini 3.1 Pro Preview available via OpenRouter. Pricing: $2/1M input, $12/1M output.

2026-02-19

Researched 8d ago

1m

1,000,000 tokens

1m contextVisionMultimodalTool useFunctionsJSON
GCP Vertex AI

$2.00 in / $12.00 out / 1M tokens

5 routes · 1 cache

Provider docs
MiniMax M2

MiniMax M2 is a language model from MiniMax. It offers a 197K-token context window.

2025-10-01

Researched 36d ago

197k

197,000 tokens

197k contextJSONPrompt cache
Fireworks AI

$0.900 in / $0.900 out / 1M tokens

5 routes · 1 cache

Provider docs
Gemini 2.5 Flash

Google: Gemini 2.5 Flash available via OpenRouter. Pricing: $0.3/1M input, $2.5/1M output.

2025-06-17

Researched 69d ago

1m

1,000,000 tokens

1m contextVisionMultimodalTool useFunctionsJSON
GCP Vertex AI

$0.300 in / $2.50 out / 1M tokens

5 routes · 1 cache

Provider docs
MiniMax M2.5

MiniMax: MiniMax M2.5 (free) available via OpenRouter. Pricing: $null/1M input, $null/1M output.

2025-03-01

Researched 36d ago

197k

197,000 tokens

197k contextJSONPrompt cache
OpenRouter

$0.150 in / $1.15 out / 1M tokens

5 routes · 1 cache

Provider docs
Gemini 3.5 Flash

Gemini 3.5 Flash is Google DeepMind's generally available Flash model for sustained frontier-level performance on agentic and coding tasks. It supports multimodal inputs, native thinking, tool and function calling, structured outputs, code execution, search grounding, batch processing, and long contexts up to 1M tokens.

2026-05-19

Researched 15d ago

1.05m

1,048,576 tokens

1.05m contextReasoningVisionMultimodalAudioTool use
GCP Vertex AI

$1.50 in / $9.00 out / 1M tokens

4 routes · 2 batch · 3 cache

Provider docs
MiniMax M2.7

MiniMax M2.7 is MiniMax's self-improving frontier model, released March 18, 2026. It introduces native multi-agent collaboration, complex skill orchestration, and early recursive self-improvement capabilities. The model uses 10B active parameters, supports a 204,800-token context window, and was released alongside MiniMax-M2.7-highspeed, a 66% faster latency-optimized variant. Public provider listings price standard M2.7 at $0.30 per 1M input tokens and $1.20 per 1M output tokens. MiniMax M3 supersedes it for million-token, multimodal, and computer-use workflows, while M2.7 remains the lower-cost text-only route when 200K context is enough.

2026-03-18

Researched 26d ago

205k

204,800 tokens

205k contextReasoningTool useFunctionsJSONPrompt cache
OpenRouter

$0.279 in / $1.20 out / 1M tokens

4 routes · 1 cache

Provider docs
DeepSeek V3.1 Terminus

DeepSeek: DeepSeek V3.1 Terminus available via OpenRouter. Pricing: $0.21/1M input, $0.79/1M output.

2025-09-22

Researched 37d ago

164k

163,800 tokens

164k contextJSONPrompt cache
OpenRouter

$0.210 in / $0.790 out / 1M tokens

4 routes · 1 cache

Provider docs
Gemini 2.5 Flash Lite

Google: Gemini 2.5 Flash Lite available via OpenRouter. Pricing: $0.1/1M input, $0.4/1M output.

2025-07-22

Researched 69d ago

1m

1,000,000 tokens

1m contextVisionMultimodalTool useFunctionsJSON
GCP Vertex AI

$0.100 in / $0.400 out / 1M tokens

4 routes · 1 cache

Provider docs
Gemini 2.5 Pro

Google DeepMind's most capable Gemini 2.5 model with native thinking/reasoning support. Features a 1M-token context window, multimodal inputs (text, image, audio, video), function calling, and strong performance across coding, mathematics, and scientific reasoning tasks.

2025-06-17

Researched 22d ago

1m

1,000,000 tokens

1m contextReasoningVisionMultimodalTool useFunctions
GCP Vertex AI

$1.25 in / $10.00 out / 1M tokens

4 routes · 2 batch · 3 cache

Provider docs
GLM-4.5

GLM-4.5 is Tsinghua Knowledge Engineering Group (THUDM)'s GLM-4 model. It offers a 128K-token context window.

2025-01-01

Researched 36d ago

128k

128,000 tokens

128k contextJSONPrompt cache
Fireworks AI

$0.900 in / $0.900 out / 1M tokens

4 routes · 1 cache

Provider docs
GLM-4.5-Air

GLM-4.5-Air is Tsinghua Knowledge Engineering Group (THUDM)'s GLM-4 model. It offers a 128K-token context window.

2025-01-01

Researched 36d ago

128k

128,000 tokens

128k contextJSONPrompt cache
Novita AI

$0.130 in / $0.850 out / 1M tokens

4 routes · 1 cache

Provider docs
GLM-4.6

GLM-4.6 is Tsinghua Knowledge Engineering Group (THUDM)'s GLM-4 model. It offers a 198K-token context window.

2025-01-01

Researched 36d ago

198k

198,000 tokens

198k contextJSONPrompt cache
Fireworks AI

$0.900 in / $0.900 out / 1M tokens

4 routes · 1 cache

Provider docs
GPT-4o-mini

OpenAI: GPT-4o-mini available via OpenRouter. Pricing: $0.15/1M input, $0.6/1M output.

2024-07-18

Researched 48d ago

128k

128,000 tokens

128k contextJSONPrompt cacheBatchFine-tune
OpenAI API

$0.150 in / $0.600 out / 1M tokens

4 routes · 2 cache

Provider docs
Gemini 3.1 Flash-Lite

Gemini 3.1 Flash-Lite is Google's generally available low-latency Gemini 3.1 model, launched May 7, 2026. It is optimized for high-volume, cost-sensitive workloads with text, image, and video inputs, a 1M token context window, and a 66K token maximum output. The GA model uses the stable API ID gemini-3.1-flash-lite and replaces gemini-3.1-flash-lite-preview, which is scheduled to shut down on May 25, 2026. Pricing is $0.25 per 1M input tokens and $1.50 per 1M output tokens.

2026-05-07

Researched 8d ago

1.05m

1,048,576 tokens

1.05m contextVisionMultimodalTool useFunctionsJSON
Google AI Studio

$0.250 in / $1.50 out / 1M tokens

3 routes · 1 cache

Provider docs
MiniMax M2.5 Highspeed

MiniMax M2.5 Highspeed is MiniMax's inference-optimized variant of M2.5, released simultaneously in February 2026. It delivers identical intelligence and outputs to standard M2.5 through a specialized inference engine at lower latency. The model supports a 204,800-token context window, 131,072-token max output, function calling, structured output, and reasoning. API model ID: MiniMax-M2.5-highspeed. It is designed for latency-sensitive interactive applications and automated agent pipelines.

2026-02-12

Researched 36d ago

205k

204,800 tokens

205k contextReasoningTool useFunctionsJSONPrompt cache
Novita AI

$0.600 in / $2.40 out / 1M tokens

3 routes · 1 cache

Provider docs
MiniMax M2.7 Highspeed

MiniMax M2.7 Highspeed is the inference-optimized variant of MiniMax M2.7, released simultaneously on March 18, 2026. It reaches 100 tokens per second output speed, about 66% faster than standard M2.7, while preserving identical intelligence and outputs through engine optimization rather than weight changes. It supports a 204,800-token context window, 131,072-token max output, function calling, structured output, and reasoning. API model ID: MiniMax-M2.7-highspeed.

2026-03-18

Researched 54d ago

205k

204,800 tokens

205k contextReasoningTool useFunctionsJSONPrompt cache
Vercel AI Gateway

$0.600 in / $2.40 out / 1M tokens

2 routes · 1 cache

Provider docs
GPT-5.4 Nano

GPT-5.4 Nano is the smallest and fastest variant in the GPT-5.4 family, optimized for edge deployment and low-latency tasks. Model ID: gpt-5.4-nano.

2026-03-05

Researched 23d ago

400k

400,000 tokens

400k contextVisionMultimodalTool useFunctionsJSON
OpenAI API

$0.200 in / $1.25 out / 1M tokens

3 routes · 1 batch · 3 cache

Provider docs
GLM-5V-Turbo

First native multimodal variant of GLM-5 with CogViT visual encoder. Specialized for design-to-code tasks—converts mockups, screenshots, Figma exports, and hand-drawn sketches into HTML, CSS, and JavaScript. Trained with reinforcement learning across 30+ task types with INT8 quantization. Achieved 94.8 on Design2Code benchmark (vs Claude Opus 4.6: 77.3). Supports image, video, and text inputs natively.

2026-04-01

Researched 69d ago

200k

200,000 tokens

200k contextReasoningVisionMultimodalTool useFunctions
OpenRouter

$1.20 in / $4.00 out / 1M tokens

2 routes · 1 cache

Provider docs
GPT Image 1.5

GPT Image 1.5 is OpenAI's improved image generation model, released December 16, 2025. Delivers 4× faster generation than GPT Image 1, improved text rendering, precise logo and face preservation, and better instruction following. Approximately 20% cheaper than GPT Image 1 with per-image pricing at $0.009/$0.034/$0.133 for 1024×1024 (low/medium/high). Became the default for ChatGPT Images on release. Supports landscape (1536×1024) and portrait (1024×1536) orientations.

2025-12-16

Researched 48d ago

No window data

VisionMultimodalPrompt cache
Vercel AI Gateway

$5.00 in / $32.00 out / 1M tokens

2 routes · 1 cache

Provider docs
GPT-5.6 Luna

OpenAI's fast, low-cost GPT-5.6 model. Luna is the affordable tier in the Sol, Terra, and Luna lineup, optimized for latency-sensitive applications such as summarization, drafting, autocomplete, and routine automation. It is available only to select trusted partners until broad API access launches.

2026-06-26

Researched 1d ago

No window data

VisionMultimodalTool useFunctionsPrompt cache
OpenAI API

$1.00 in / $6.00 out / 1M tokens

1 route · 1 cache

Provider docs
GPT-5.4 Mini

GPT-5.4 Mini is a smaller, cost-efficient variant of GPT-5.4 with a 400K token context window. Designed for tasks requiring long-context processing at lower cost. Model ID: gpt-5.4-mini.

2026-03-05

Researched 14d ago

400k

400,000 tokens

400k contextReasoningVisionMultimodalTool useFunctions
OpenAI API

$0.750 in / $4.50 out / 1M tokens

3 routes · 1 batch · 3 cache

Provider docs
GLM-5 Turbo

Purpose-built variant of GLM-5 optimized for agent orchestration and complex automated workflows. Features native agent-friendly training emphasizing tool use, command following, and persistent task execution. Designed for OpenClaw (Lobster Agent) workflows with ~0.67% tool-call error rate. Achieves 20% higher inference performance vs base GLM-5.

2026-03-01

Researched 69d ago

200k

200,000 tokens

200k contextReasoningTool useFunctionsJSONPrompt cache
OpenRouter

$1.20 in / $4.00 out / 1M tokens

2 routes · 1 cache

Provider docs
GPT Image 1 Mini

GPT Image 1 Mini is OpenAI's cost-efficient image generation model, released at OpenAI DevDay 2025 on October 6, 2025. Approximately 80% cheaper than GPT Image 1 with per-image pricing at $0.005/$0.011/$0.036 (low/medium/high, 1024×1024). Targets high-volume, cost-sensitive API workflows. Supports the same quality tiers as the main GPT Image line.

2025-10-06

Researched 48d ago

No window data

VisionMultimodalPrompt cache
Vercel AI Gateway

$2.00 in / $8.00 out / 1M tokens

2 routes · 1 cache

Provider docs