Audio model for inputs/outputs via Chat Completions API. Replaces deprecated gpt-4o-audio-preview-2024-12-17.
2024-10-01
Researched 5d ago
128K
128,000 tokens
Also known as: speech, voice, realtime audio
speech or audio I/O
14
matching active models
6
tracked providers
13
models with routes
Audio capability covers model support for speech, realtime voice, transcription, translation, or audio input/output. In model selection it is a modality filter: use it when the workload needs sound, then verify the provider route and product surface.
Sorted by decision relevance, with model flags and provider-route evidence from seed data.
Audio model for inputs/outputs via Chat Completions API. Replaces deprecated gpt-4o-audio-preview-2024-12-17.
2024-10-01
Researched 5d ago
128K
128,000 tokens
Cost-efficient audio model for inputs/outputs via Chat Completions API.
2024-10-01
Researched 5d ago
128K
128,000 tokens
GPT Realtime 2 is OpenAI's second-generation real-time voice model, released May 7, 2026. It is a GPT-5-class speech-to-speech model for voice agents with five reasoning intensity levels, parallel tool calls, spoken preambles, and recovery behavior on failed tasks. The model supports audio and text interaction through the Realtime API with a 128K token context window. Audio token pricing is $32 per 1M input tokens, $0.40 per 1M cached input tokens, and $64 per 1M output tokens.
2026-05-07
Researched 5d ago
131K
131,072 tokens
GPT Realtime Translate is OpenAI's live speech-to-speech translation model, released May 7, 2026. It translates spoken input from 70+ languages into 13 output languages in real time without requiring speakers to pause or complete full sentences. The model is exposed through the /v1/realtime/translations endpoint and is priced per minute at $0.034 rather than per token.
2026-05-07
Researched 5d ago
—
No window data
GPT Realtime Whisper is OpenAI's streaming speech-to-text model, released May 7, 2026. It transcribes spoken audio live as a speaker talks rather than waiting for utterance completion, making it suitable for live captions, meeting notes, classroom transcripts, and real-time agent pipelines. The model is exposed through /v1/realtime/transcription_sessions and is priced per minute at $0.017 rather than per token.
2026-05-07
Researched 5d ago
—
No window data
Maximum-comprehensiveness version of Google's Deep Research agent, built on Gemini 3.1 Pro and released April 21, 2026. Spends more compute than the standard preview to consult more sources, refine reports, and capture nuanced details. Designed for accuracy-critical long-form investigations synthesizing information from hundreds of sources. Supports MCP servers, File Search, and multi-step planning. Context: 1M tokens; max output: 65,536 tokens. Runs at Gemini 3.1 Pro rates ($2.00/$12.00 per MTok). API ID: deep-research-max-preview-04-2026.
2026-04-21
Researched 13d ago
1M
1,000,000 tokens
Google's agentic deep research model built on Gemini 3.1 Pro, released April 21, 2026. Designed for speed and efficiency in autonomous multi-step research: ingests text, images, PDFs, audio, and video to produce comprehensive cited reports from public web sources and private workspace data. Supports collaborative planning, visualization, MCP servers, and File Search. Context window: 1M tokens; max output: 65,536 tokens. Runs at Gemini 3.1 Pro rates ($2.00/$12.00 per MTok). API ID: deep-research-preview-04-2026.
2026-04-21
Researched 13d ago
1M
1,000,000 tokens
Qwen3.5-Omni Flash is Alibaba's lower-latency omnimodal API model, released March 30, 2026. It keeps the Qwen3.5-Omni text, image, audio, and video input surface while reducing cost and latency for short video analysis and high-throughput multimodal workloads. API model ID: qwen3.5-omni-flash.
2026-03-30
Researched 11d ago
262K
262,144 tokens
Qwen3.5-Omni Plus is Alibaba's flagship omnimodal API model, released March 30, 2026. It processes text, images, audio, and video simultaneously and returns text or speech responses. The Plus variant targets maximum quality for multimodal and real-time interaction workloads with API model ID qwen3.5-omni-plus.
2026-03-30
Researched 11d ago
262K
262,144 tokens
Batch speech-to-text transcription model with speaker diarization. Public Mistral pricing is $0.003 per minute.
2026-02-04
Researched 1d ago
33K
32,768 tokens
ERNIE 5.0 is Baidu's fifth-generation flagship foundation model, officially launched January 22, 2026 (preview at Baidu World November 13, 2025). It is a fully native multimodal model supporting text, image, audio, and video understanding and generation under a unified autoregressive framework, trained simultaneously across modalities from scratch. With 2.4 trillion total parameters and ultra-sparse MoE activation engaging <3% of parameters per inference, it delivers frontier performance at high efficiency. Available on Baidu AI Cloud's Qianfan MaaS platform. API model IDs: ernie-5.0; ernie-5.0-thinking-preview (thinking mode).
2026-01-22
Researched 6d ago
128K
128,000 tokens
Realtime model capable of text and audio inputs and outputs via the Realtime API.
2025-10-06
Researched 5d ago
32K
32,000 tokens
Cost-efficient realtime voice model for the Realtime API.
2025-10-06
Researched 5d ago
32K
32,000 tokens
Text-to-speech model with zero-shot voice cloning, multilingual output, and real-time streaming support.
2026-03-23
Researched 1d ago
—
No window data
No tracked provider route