Capability filtercapabilitybeginner

Audio

Also known as: speech, voice, realtime audio

speech or audio I/O

matching active models

tracked providers

models with routes

model.audio

Definition

Audio capability covers model support for speech, realtime voice, transcription, translation, or audio input/output. In model selection it is a modality filter: use it when the workload needs sound, then verify the provider route and product surface.

Models With Audio

Sorted by decision relevance, with model flags and provider-route evidence from seed data.

14 matches

ModelReleaseContextCapabilitiesProvider route

GPT Audio

Audio model for inputs/outputs via Chat Completions API. Replaces deprecated gpt-4o-audio-preview-2024-12-17.

2024-10-01

Researched 5d ago

128K

128,000 tokens

128K contextMultimodalAudio

OpenAI API

$2.50 in / $10.00 out / 1M tokens

2 routes

Provider docs

GPT Audio Mini

Cost-efficient audio model for inputs/outputs via Chat Completions API.

2024-10-01

Researched 5d ago

128K

128,000 tokens

128K contextMultimodalAudio

OpenAI API

$0.600 in / $2.40 out / 1M tokens

2 routes

Provider docs

GPT Realtime 2

GPT Realtime 2 is OpenAI's second-generation real-time voice model, released May 7, 2026. It is a GPT-5-class speech-to-speech model for voice agents with five reasoning intensity levels, parallel tool calls, spoken preambles, and recovery behavior on failed tasks. The model supports audio and text interaction through the Realtime API with a 128K token context window. Audio token pricing is $32 per 1M input tokens, $0.40 per 1M cached input tokens, and $64 per 1M output tokens.

2026-05-07

Researched 5d ago

131K

131,072 tokens

131K contextReasoningMultimodalAudioTool useFunctions

OpenAI API

$32.00 in / $64.00 out / 1M tokens

1 route · 1 cache

Provider docs

GPT Realtime Translate

GPT Realtime Translate is OpenAI's live speech-to-speech translation model, released May 7, 2026. It translates spoken input from 70+ languages into 13 output languages in real time without requiring speakers to pause or complete full sentences. The model is exposed through the /v1/realtime/translations endpoint and is priced per minute at $0.034 rather than per token.

2026-05-07

Researched 5d ago

—

No window data

MultimodalAudio

OpenAI API

Pricing not tracked / 1M tokens

1 route

Provider docs

GPT Realtime Whisper

GPT Realtime Whisper is OpenAI's streaming speech-to-text model, released May 7, 2026. It transcribes spoken audio live as a speaker talks rather than waiting for utterance completion, making it suitable for live captions, meeting notes, classroom transcripts, and real-time agent pipelines. The model is exposed through /v1/realtime/transcription_sessions and is priced per minute at $0.017 rather than per token.

2026-05-07

Researched 5d ago

—

No window data

Audio

OpenAI API

Pricing not tracked / 1M tokens

1 route

Provider docs

Gemini Deep Research Max Preview

Maximum-comprehensiveness version of Google's Deep Research agent, built on Gemini 3.1 Pro and released April 21, 2026. Spends more compute than the standard preview to consult more sources, refine reports, and capture nuanced details. Designed for accuracy-critical long-form investigations synthesizing information from hundreds of sources. Supports MCP servers, File Search, and multi-step planning. Context: 1M tokens; max output: 65,536 tokens. Runs at Gemini 3.1 Pro rates ($2.00/$12.00 per MTok). API ID: deep-research-max-preview-04-2026.

2026-04-21

Researched 13d ago

1,000,000 tokens

1M contextVisionMultimodalAudioTool useFunctions

Google AI Studio

$2.00 in / $12.00 out / 1M tokens

1 route

Provider docs

Gemini Deep Research Preview

Google's agentic deep research model built on Gemini 3.1 Pro, released April 21, 2026. Designed for speed and efficiency in autonomous multi-step research: ingests text, images, PDFs, audio, and video to produce comprehensive cited reports from public web sources and private workspace data. Supports collaborative planning, visualization, MCP servers, and File Search. Context window: 1M tokens; max output: 65,536 tokens. Runs at Gemini 3.1 Pro rates ($2.00/$12.00 per MTok). API ID: deep-research-preview-04-2026.

2026-04-21

Researched 13d ago

1,000,000 tokens

1M contextVisionMultimodalAudioTool useFunctions

Google AI Studio

$2.00 in / $12.00 out / 1M tokens

1 route

Provider docs

Qwen3.5-Omni Flash

Qwen3.5-Omni Flash is Alibaba's lower-latency omnimodal API model, released March 30, 2026. It keeps the Qwen3.5-Omni text, image, audio, and video input surface while reducing cost and latency for short video analysis and high-throughput multimodal workloads. API model ID: qwen3.5-omni-flash.

2026-03-30

Researched 11d ago

262K

262,144 tokens

262K contextVisionMultimodalAudioTool useFunctions

Alibaba Cloud PAI-EAS

$0.100 in / $0.800 out / 1M tokens

1 route

Provider docs

Qwen3.5-Omni Plus

Qwen3.5-Omni Plus is Alibaba's flagship omnimodal API model, released March 30, 2026. It processes text, images, audio, and video simultaneously and returns text or speech responses. The Plus variant targets maximum quality for multimodal and real-time interaction workloads with API model ID qwen3.5-omni-plus.

2026-03-30

Researched 11d ago

262K

262,144 tokens

262K contextReasoningVisionMultimodalAudioTool use

Alibaba Cloud PAI-EAS

$0.400 in / $4.80 out / 1M tokens

1 route

Provider docs

Voxtral Mini Transcribe 2

Batch speech-to-text transcription model with speaker diarization. Public Mistral pricing is $0.003 per minute.

2026-02-04

Researched 1d ago

33K

32,768 tokens

MultimodalAudio

Mistral AI Studio

Pricing not tracked / 1M tokens

1 route

Provider docs

ERNIE 5.0

ERNIE 5.0 is Baidu's fifth-generation flagship foundation model, officially launched January 22, 2026 (preview at Baidu World November 13, 2025). It is a fully native multimodal model supporting text, image, audio, and video understanding and generation under a unified autoregressive framework, trained simultaneously across modalities from scratch. With 2.4 trillion total parameters and ultra-sparse MoE activation engaging <3% of parameters per inference, it delivers frontier performance at high efficiency. Available on Baidu AI Cloud's Qianfan MaaS platform. API model IDs: ernie-5.0; ernie-5.0-thinking-preview (thinking mode).

2026-01-22

Researched 6d ago

128K

128,000 tokens

128K contextReasoningVisionMultimodalAudioTool use

Baidu Qianfan

$0.890 in / $3.54 out / 1M tokens

1 route

Provider docs

gpt-realtime

Realtime model capable of text and audio inputs and outputs via the Realtime API.

2025-10-06

Researched 5d ago

32K

32,000 tokens

MultimodalAudioPrompt cache

OpenAI API

$4.00 in / $16.00 out / 1M tokens

1 route · 1 cache

Provider docs

gpt-realtime-mini

Cost-efficient realtime voice model for the Realtime API.

2025-10-06

Researched 5d ago

32K

32,000 tokens

MultimodalAudioPrompt cache

OpenAI API

$0.600 in / $2.40 out / 1M tokens

1 route · 1 cache

Provider docs

Voxtral TTS

Text-to-speech model with zero-shot voice cloning, multilingual output, and real-time streaming support.

2026-03-23

Researched 1d ago

—

No window data

MultimodalAudio

No tracked provider route