LLM ReferenceLLM Reference
Concepts & capability filters
Capability filtercapabilityintermediate

Prompt caching

Also known as: context caching, cache reads, cached prompts

reuse repeated prompt tokens

30

matching active models

14

tracked providers

30

models with routes

model.prompt_cachingmodelProvider.cache_readmodelProvider.cache_write_*

Definition

Prompt caching lets a provider charge or execute repeated prompt prefixes differently when the same context is reused across requests. It matters for long system prompts, retrieval-heavy applications, and agent loops where stable instructions or documents are sent repeatedly.

Models With Prompt caching

Sorted by decision relevance, with model flags and provider-route evidence from seed data.

30 matches
Kimi K2.6

Kimi K2.6 is Moonshot AI's latest agentic reasoning model, launched April 13 2026 as a code preview for Kimi Code subscribers. Built on a 1-trillion-parameter MoE architecture (32B active, 384 experts), it inherits K2.5's 256K context window and adds enhanced reliability for long-horizon agentic workflows — supporting 200–300 sequential tool calls without drift. Optimized for coding, multi-step agent planning, and vision-assisted tasks such as processing screenshots, PDFs, and spreadsheets.

2026-04-20

Researched 9d ago

262K

262,144 tokens

262K contextReasoningVisionMultimodalTool useFunctions
OpenRouter

$0.750 in / $3.50 out / 1M tokens

4 routes · 1 cache

Provider docs
Claude 3 Sonnet

Claude 3 Sonnet by Anthropic is a versatile large language AI model, balancing intelligence and speed for diverse enterprise use cases. It is part of the Claude 3 family, positioned between the powerful Opus and the faster Haiku models. Sonnet excels in nuanced content creation, accurate summarization, and complex scientific query handling while also showcasing proficiency in non-English languages and coding tasks. Additionally, it enhances vision capabilities with exceptional skills in visual reasoning, such as interpreting charts, graphs, and transcribing text from imperfect images, which benefits industries like retail, logistics, and finance. Operated at twice the speed of Claude 3 Opus, Sonnet is efficient in context-sensitive customer support and multi-step workflows. It has achieved AI Safety Level 2 (ASL-2) and is accessible through multiple platforms, including Claude.ai, the Claude iOS app, the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI.

2024-03-04

Researched 26d ago

200K

200,000 tokens

200K contextReasoningVisionMultimodalJSONCode exec
AWS Bedrock

$3.00 in / $15.00 out / 1M tokens

2 routes · 1 cache

Provider docs
DeepSeek V4 Pro

DeepSeek V4 Pro is the flagship 1.6T parameter (49B activated) Mixture-of-Experts language model with 1M-token context. Features hybrid attention (CSA+HCA) requiring only 27% of inference FLOPs vs DeepSeek-V3.2 at 1M context, Manifold-Constrained Hyper-Connections (mHC), and Muon Optimizer for training stability. Achieves 93.5% on LiveCodeBench, 89.8% on IMOAnswerBench, and 90.1% on MMLU. Supports Non-Think, Think High, and Think Max reasoning modes. Pricing: $1.74/1M input, $3.48/1M output (cache hit: $0.145/1M input). MIT licensed. Pricing note: DeepSeek API docs state that deepseek-v4-pro is currently offered at a 75% discount, extended until 2026/05/31 15:59 UTC.

2026-04-24

Researched 1d ago

1M

1,000,000 tokens

1M contextReasoningTool useFunctionsJSONPrompt cache
DeepSeek Platform

$0.435 in / $0.870 out / 1M tokens

3 routes · 1 cache

Provider docs
o3

OpenAI o3 reasoning model with advanced multi-step problem-solving capabilities.

2025-03-31

Researched 5d ago

200K

200,000 tokens

200K contextReasoningJSONCode execPrompt cacheBatch
OpenAI API

$2.00 in / $8.00 out / 1M tokens

2 routes · 1 batch · 1 cache

Provider docs
GPT-5

OpenAI's previous intelligent reasoning model with configurable reasoning effort. Released August 2025. Supports minimal, low, medium, and high reasoning levels. Succeeded by GPT-5.1 and later models.

2025-08-07

Researched 5d ago

400K

400,000 tokens

400K contextReasoningVisionMultimodalTool useFunctions
OpenAI API

$1.25 in / $10.00 out / 1M tokens

3 routes · 1 batch · 1 cache

Provider docs
GPT-5 Mini

Near-frontier intelligence for cost-sensitive, low-latency, high-volume workloads. Released August 2025. Replaces o4-mini (shutting down Oct 2026).

2025-08-07

Researched 5d ago

400K

400,000 tokens

400K contextReasoningVisionMultimodalTool useFunctions
OpenAI API

$0.250 in / $2.00 out / 1M tokens

3 routes · 1 batch · 1 cache

Provider docs
GPT-5 Nano

Fastest, cheapest GPT-5 variant for summarization and classification tasks. Also available via Realtime API.

2025-08-07

Researched 5d ago

400K

400,000 tokens

400K contextReasoningVisionMultimodalTool useFunctions
OpenAI API

$0.050 in / $0.400 out / 1M tokens

3 routes · 1 batch · 1 cache

Provider docs
GPT-4.1

OpenAI's GPT-4.1 model released April 2025, excelling at coding tasks, precise instruction following, and web development. Outperforms GPT-4o in these areas with a 1 million token context window. Available via API and in ChatGPT for Plus, Pro, Team, Enterprise, and Edu users.

2025-04-01

Researched 5d ago

1M

1,047,576 tokens

1M contextVisionMultimodalTool useFunctionsJSON
OpenAI API

$2.00 in / $8.00 out / 1M tokens

3 routes · 1 batch · 1 cache

Provider docs
GPT-4.1 Mini

Fast and efficient small model from OpenAI replacing GPT-4o mini. Released April 2025 alongside GPT-4.1. Shows improvements in instruction-following, coding, and intelligence with a 1 million token context window. Available in ChatGPT for paid users.

2025-04-01

Researched 5d ago

1M

1,047,576 tokens

1M contextVisionMultimodalTool useFunctionsJSON
OpenAI API

$0.400 in / $1.60 out / 1M tokens

3 routes · 1 cache

Provider docs
Claude 3.5 Haiku

Claude 3.5 Haiku is Anthropic's latest AI model, known for its speed and efficiency while maintaining high intelligence. It is optimized for applications needing rapid response, like interactive chatbots and real-time content moderation. Initially text-only, future plans include image input capabilities. It excels in delivering fast, accurate code suggestions, processing and categorizing information swiftly, and handling large volumes of user interactions. Priced accessibly, it offers advanced coding, tool use, and reasoning abilities. Though initially surpassing Claude 3 Haiku in benchmarks, its pricing reflects its enhanced performance 123457.

2024-10-22

Researched 26d ago

200k

200,000 tokens

200k contextReasoningVisionJSONCode execBatch
Anthropic

$0.800 in / $4.00 out / 1M tokens

5 routes · 1 batch · 1 cache

Provider docs
Claude 3.5 Sonnet

Claude 3.5 Sonnet, the latest in Anthropic's line of large language models, merges state-of-the-art reasoning, coding, and natural language understanding capabilities with advanced multi-modal processing. Released in October 2024, it excels in benchmarks against previous models and competitors, thanks to its scalable attention mechanisms and massive neural network architecture. Its dynamic routing enables specialization in various tasks, supporting applications from software development and data analysis to customer support and content creation. Users benefit from its "Artifacts" feature for real-time collaborative workflows and can access the model through platforms like Claude.ai and APIs at competitive pricing rates.

2024-06-20

Researched 26d ago

200K

200,000 tokens

200K contextReasoningVisionMultimodalFunctionsJSON
Anthropic

$3.00 in / $15.00 out / 1M tokens

6 routes · 1 cache

Provider docs
GPT-4o

OpenAI GPT-4o: Flagship multimodal model with vision, function calling, and broad capability. $2.50/M input, $10/M output.

2024-05-13

Researched 5d ago

128K

128,000 tokens

128K contextVisionMultimodalTool useFunctionsJSON
OpenAI API

$2.50 in / $10.00 out / 1M tokens

4 routes · 1 batch · 1 cache

Provider docs
Claude Sonnet 4.6

Claude Sonnet 4.6 is Anthropic's best combination of speed and intelligence. Proprietary decoder-only model with 1M-token context, 64K max output, multimodal vision, extended thinking, and function calling. Available via Anthropic API, AWS Bedrock, GCP Vertex AI, and OpenRouter at $3/1M input and $15/1M output tokens.

2026-02-17

Researched 7d ago

1M

1,000,000 tokens

1M contextReasoningVisionMultimodalTool useFunctions
Anthropic

$3.00 in / $15.00 out / 1M tokens

4 routes · 1 batch · 1 cache

Provider docs
Claude Opus 4.7

Claude Opus 4.7 is Anthropic's generally available flagship model with 1M context, 128K max output, adaptive thinking, and a new tokenizer with roughly 555K words per 1M tokens.

2026-04-16

Researched 1d ago

1M

1,000,000 tokens

1M contextReasoningVisionMultimodalTool useFunctions
Anthropic

$5.00 in / $25.00 out / 1M tokens

5 routes · 1 batch · 1 cache

Provider docs
Claude Opus 4.6

Claude Opus 4.6 available on AWS Bedrock

2026-02-05

Researched 26d ago

1M

1,000,000 tokens

1M contextReasoningVisionMultimodalTool useFunctions
Anthropic

$5.00 in / $25.00 out / 1M tokens

4 routes · 1 batch · 1 cache

Provider docs
DeepSeek V4 Flash

DeepSeek V4 Flash is a 284B parameter (13B activated) Mixture-of-Experts language model with 1M-token context. Features a hybrid attention architecture combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) for efficient long-context inference. Supports thinking and non-thinking modes. Legacy API aliases deepseek-chat and deepseek-reasoner map to this model's non-thinking and thinking modes respectively. Pricing: $0.14/1M input, $0.28/1M output (cache hit: $0.0028/1M input). MIT licensed.

2026-04-24

Researched 1d ago

1M

1,000,000 tokens

1M contextReasoningTool useFunctionsJSONPrompt cache
DeepSeek Platform

$0.140 in / $0.280 out / 1M tokens

2 routes · 1 cache

Provider docs
GPT-4o-mini

OpenAI: GPT-4o-mini available via OpenRouter. Pricing: $0.15/1M input, $0.6/1M output.

2024-07-18

Researched 5d ago

128K

128,000 tokens

128K contextJSONPrompt cacheBatchFine-tune
OpenAI API

$0.150 in / $0.600 out / 1M tokens

3 routes · 1 cache

Provider docs
GPT-5.4 Nano

GPT-5.4 Nano is the smallest and fastest variant in the GPT-5.4 family, optimized for edge deployment and low-latency tasks. Model ID: gpt-5.4-nano.

2026-03-05

Researched 5d ago

400K

400,000 tokens

400K contextMultimodalTool useFunctionsJSONCode exec
OpenAI API

$0.200 in / $1.25 out / 1M tokens

2 routes · 1 batch · 1 cache

Provider docs
GPT-5.4 Mini

GPT-5.4 Mini is a smaller, cost-efficient variant of GPT-5.4 with a 400K token context window. Designed for tasks requiring long-context processing at lower cost. Model ID: gpt-5.4-mini.

2026-03-05

Researched 5d ago

400K

400,000 tokens

400K contextReasoningMultimodalTool useFunctionsJSON
OpenAI API

$0.750 in / $4.50 out / 1M tokens

2 routes · 1 batch · 1 cache

Provider docs
GPT-5.5

GPT-5.5 is OpenAI's fully retrained agentic model, released April 23, 2026. Optimized for agentic coding, computer use, knowledge work, and early scientific research. Achieves 82.7% on Terminal-Bench 2.0, 84.9% on GDPval, and 58.6% on SWE-Bench Pro. Individual factual claims are 23% more likely to be correct versus GPT-5.4, with factual errors 3% less frequent. Uses fewer tokens than GPT-5.4 for equivalent tasks. Supports text and image inputs. Available to ChatGPT Plus, Business, and Enterprise subscribers; API access coming soon. Model ID: gpt-5.5.

2026-04-23

Researched 5d ago

1.1M

1,050,000 tokens

1.1M contextReasoningVisionMultimodalTool useFunctions
OpenAI API

$5.00 in / $30.00 out / 1M tokens

2 routes · 1 batch · 1 cache

Provider docs
GPT-5.4

GPT-5.4 is OpenAI's flagship frontier reasoning model, released March 5, 2026. It incorporates advances from GPT-5.3-Codex for coding and agentic workflows, and adds 'Thinking' mode with editable reasoning plans. Key capabilities include computer use (navigating interfaces via Playwright), image understanding and generation integration, full-stack web app generation, tool calling, and deep research. Knowledge cutoff is August 31, 2025. Model ID: gpt-5.4.

2026-03-05

Researched 5d ago

1.1M

1,050,000 tokens

1.1M contextReasoningMultimodalTool useFunctionsJSON
OpenRouter

$2.50 in / $15.00 out / 1M tokens

2 routes · 1 batch · 1 cache

Provider docs
GPT-5.3-Codex

Most capable agentic coding model from OpenAI. Optimized for long-horizon, agentic coding tasks in the Codex CLI and API. Note: GPT-5.3-Codex-Spark is a distinct ChatGPT Pro research preview (not API-accessible).

2026-02-05

Researched 5d ago

400K

400,000 tokens

400K contextReasoningVisionTool useFunctionsJSON
OpenAI API

$1.75 in / $14.00 out / 1M tokens

2 routes · 1 cache

Provider docs
Grok 4.3

xAI's Grok 4.3 is the GA API model that supersedes the Grok 4.3 Beta identifier. It supports native video input processing, document generation workflows, tool use, structured outputs, and Grok Computer integration. xAI's current model metadata lists API ID grok-4.3 with a 1,000,000 token prompt window.

2026-05-05

Researched 1d ago

1M

1,000,000 tokens

1M contextReasoningVisionMultimodalTool useFunctions
OpenRouter

$1.25 in / $2.50 out / 1M tokens

2 routes · 2 cache

Provider docs
GPT Realtime 2

GPT Realtime 2 is OpenAI's second-generation real-time voice model, released May 7, 2026. It is a GPT-5-class speech-to-speech model for voice agents with five reasoning intensity levels, parallel tool calls, spoken preambles, and recovery behavior on failed tasks. The model supports audio and text interaction through the Realtime API with a 128K token context window. Audio token pricing is $32 per 1M input tokens, $0.40 per 1M cached input tokens, and $64 per 1M output tokens.

2026-05-07

Researched 5d ago

131K

131,072 tokens

131K contextReasoningMultimodalAudioTool useFunctions
OpenAI API

$32.00 in / $64.00 out / 1M tokens

1 route · 1 cache

Provider docs
GPT-5.5 Instant

GPT-5.5 Instant is OpenAI's latest Instant model used in ChatGPT, released May 5, 2026 as the new default ChatGPT model and exposed in the API as chat-latest. OpenAI says the update improves factuality, image analysis, STEM answers, web-search decisions, personalization from past chats/files/connected Gmail, and concise conversational style. OpenAI reports 52.5% fewer hallucinated claims than GPT-5.3 Instant on high-stakes prompts and 37.3% fewer inaccurate claims on difficult conversations flagged for factual errors.

2026-05-05

Researched 5d ago

400K

400,000 tokens

400K contextVisionMultimodalTool useFunctionsJSON
OpenAI API

$1.50 in / $6.00 out / 1M tokens

1 route

Provider docs
GPT Image 2

OpenAI's image generation model succeeding DALL-E 3, released April 21, 2026. Uses an autoregressive architecture with native reasoning — the model plans structure and composition before generating pixels. Supports up to 4K (4096×4096) resolution, achieves ~99% character-level text accuracy across Latin, CJK, Hindi, and Bengali scripts, and generates images ~2× faster than DALL-E 3. Debuted at #1 on the Image Arena leaderboard by a +242-point margin. Powered by the GPT-5.4 backbone. API: $8/M image input tokens, $2/M cached, $30/M image output tokens. ChatGPT GA April 22, 2026; API access May 2026.

2026-04-21

Researched 5d ago

No window data

ReasoningVisionMultimodalPrompt cache
OpenAI API

$5.00 in / $30.00 out / 1M tokens

1 route · 1 cache

Provider docs
chatgpt-image-latest

Latest ChatGPT image generation model alias. Points to the current default image generation model in the API.

2025-12-16

Researched 5d ago

No window data

MultimodalPrompt cache
OpenAI API

$5.00 in / $10.00 out / 1M tokens

1 route · 1 cache

Provider docs
gpt-realtime

Realtime model capable of text and audio inputs and outputs via the Realtime API.

2025-10-06

Researched 5d ago

32K

32,000 tokens

MultimodalAudioPrompt cache
OpenAI API

$4.00 in / $16.00 out / 1M tokens

1 route · 1 cache

Provider docs
gpt-realtime-mini

Cost-efficient realtime voice model for the Realtime API.

2025-10-06

Researched 5d ago

32K

32,000 tokens

MultimodalAudioPrompt cache
OpenAI API

$0.600 in / $2.40 out / 1M tokens

1 route · 1 cache

Provider docs