LLM Reference
Concepts & capability filters
Capability filtercapabilitybeginner

Audio

Also known as: speech, voice, realtime audio

speech or audio I/O

See matching models with benchmark scores and pricing.

105

matching active models

22

tracked providers

68

models with routes

model.audio

Definition

Audio capability covers model support for speech, realtime voice, transcription, translation, or audio input/output. In model selection it is a modality filter: use it when the workload needs sound, then verify the provider route and product surface.

Models With Audio

Showing the first 80 matches, sorted by decision relevance, with tracked capability and provider-route evidence.

105 matches
Cosmos 3 Nano

Cosmos 3 Nano is NVIDIA's 16B-parameter omnimodel optimized for efficient inference on workstation-grade hardware (NVIDIA RTX PRO 6000). Architecture: dual-tower Mixture-of-Transformers with an 8B autoregressive Reasoner and an 8B diffusion-based Generator. The Reasoner supports up to 256K tokens of context for vision-language reasoning; the Generator produces video up to 720p at variable frame rates (default 189 frames). Natively handles text, image, video, audio (48kHz stereo), and robot action trajectories across 10+ robot embodiments including Franka Panda, UR, Google robot, and UMI. BF16 precision only. Available as open weights on Hugging Face and via the Cosmos 3 Reasoner NIM (NIM_MODEL_SIZE=nano). Intended for real-time robotics inference and edge-adjacent deployment. Robot action input/output is preserved in this description because the model schema does not have a dedicated action modality field.

2026-05-31

Researched 28d ago

256k

256,000 tokens

256k contextReasoningVisionMultimodalAudio
NVIDIA NIM

Pricing not tracked / 1M tokens

1 route

Provider docs
Cosmos 3 Super

Cosmos 3 Super is NVIDIA's flagship 64B-parameter omnimodel for physical AI, designed for large-scale synthetic data generation and high-fidelity simulation on NVIDIA Hopper and Blackwell datacenter GPUs. Architecture: dual-tower Mixture-of-Transformers with a 32B autoregressive Reasoner and a 32B diffusion-based Generator. Supports 256K token reasoning context, 720p video generation at variable frame rates, and 10+ robot embodiment action domains. Ranked #1 among open models on Physics-IQ, PAI-Bench, R-Bench, RoboLab, RoboArena, VANTAGE-Bench, TAR, and Artificial Analysis image/video leaderboards (Computex 2026). Training data: 1.3B data points across 393 datasets (2024-2026). Inference performance (vLLM-Omni): ~55s for 50-step video on 8xH200. Available as open weights on Hugging Face and via Cosmos 3 Reasoner NIM (NIM_MODEL_SIZE=super). Robot action input/output is preserved in this description because the model schema does not have a dedicated action modality field.

2026-05-31

Researched 28d ago

256k

256,000 tokens

256k contextReasoningVisionMultimodalAudio
NVIDIA NIM

Pricing not tracked / 1M tokens

1 route

Provider docs
Cosmos 3 Super Image2Video

Cosmos 3 Super Image2Video is a 64B-parameter fine-tuned variant of Cosmos 3 Super specialized for temporally coherent image-to-video generation. Takes a single image (jpg/png/webp at 256p-720p) plus an optional text prompt (up to 4096 tokens) and outputs MP4 video with 5-400 frames (default 189) at up to 720p, with optional muxed AAC stereo audio at 48kHz. Ranked #1 on Artificial Analysis image-to-video leaderboard (open models). Available via Hugging Face Diffusers and vLLM-Omni.

2026-05-31

Researched 28d ago

4k

4,000 tokens

VisionMultimodalAudioFine-tune

No tracked provider route

Nemotron 3 VoiceChat

Nemotron 3 VoiceChat is NVIDIA AI's Nemotron 3 model with multimodal text and image input. It was released 2026-03-16.

2026-03-16

Researched 41d ago

No window data

VisionMultimodalAudio

No tracked provider route

GPT-4o Audio Preview (12-17)

GPT-4o Audio Preview (12-17) is OpenAI's GPT-4o Audio model. It offers a 128K-token context window.

2024-12-17

Researched 41d ago

128k

128,000 tokens

128k contextVisionAudioCode exec

No tracked provider route

GPT-4o Audio Preview (10-01)

GPT-4o model with integrated audio I/O capabilities for multimodal interactions.

2024-10-01

Researched 179d ago

128k

128,000 tokens

128k contextVisionAudioCode exec

No tracked provider route

Cosmos 3 Edge

Cosmos 3 Edge is NVIDIA's announced preview size variant for real-time edge inference in the Cosmos 3 family. It is intended for lower-compute physical AI workloads than Cosmos 3 Nano. NVIDIA has not disclosed parameters, context window, license terms, public weights, or API access as of 2026-06-01; the released Cosmos 3 rows cover Nano, Super, Text2Image, Image2Video, and Nano Policy DROID.

-

Researched 28d ago

No window data

VisionMultimodalAudio

No tracked provider route

Gemini 3 Flash

Gemini 3 Flash is Google's speed-optimized Gemini 3 model, available in public preview via the Gemini API and Vertex AI. It supports text, image, audio, and video inputs with a 1M token context window and is priced at $0.50 per 1M input tokens and $3.00 per 1M output tokens.

2025-12-17

Researched 43d ago

1m

1,000,000 tokens

1m contextVisionMultimodalAudioTool useFunctions
GCP Vertex AI

$0.500 in / $3.00 out / 1M tokens

4 routes · 1 cache

Provider docs
Granite 4.0 1B Speech

IBM Granite 4.0 1B Speech is a multilingual ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) model. Supports English, French, German, Spanish, Portuguese, and Japanese with higher English ASR accuracy and faster inference via speculative decoding. Features keyword biasing for names and acronyms.

2026-03-06

Researched 179d ago

No window data

Audio

No tracked provider route

Gemini 3.5 Flash

Gemini 3.5 Flash is Google DeepMind's generally available Flash model for sustained frontier-level performance on agentic and coding tasks. It supports multimodal inputs, native thinking, tool and function calling, structured outputs, code execution, search grounding, batch processing, and long contexts up to 1M tokens.

2026-05-19

Researched 17d ago

1.05m

1,048,576 tokens

1.05m contextReasoningVisionMultimodalAudioTool use
GCP Vertex AI

$1.50 in / $9.00 out / 1M tokens

4 routes · 2 batch · 3 cache

Provider docs
GPT Audio

Audio model for inputs/outputs via Chat Completions API. Replaces deprecated gpt-4o-audio-preview-2024-12-17.

2024-10-01

Researched 50d ago

128k

128,000 tokens

128k contextMultimodalAudio
OpenAI API

$2.50 in / $10.00 out / 1M tokens

2 routes

Provider docs
GPT Audio Mini

GPT Audio Mini is OpenAI's GPT Audio model with multimodal text and image input. It offers a 125K-token context window.

2024-10-01

Researched 41d ago

128k

128,000 tokens

128k contextMultimodalAudio
OpenAI API

$0.600 in / $2.40 out / 1M tokens

2 routes

Provider docs
GPT-4o Audio

OpenAI: GPT-4o Audio available via OpenRouter. Pricing: $2.5/1M input, $10/1M output.

2024-10-01

Researched 71d ago

128k

128,000 tokens

128k contextAudio
OpenRouter

$2.50 in / $10.00 out / 1M tokens

1 route

Provider docs
Whisper

Whisper 1 is OpenAI's general-purpose speech recognition API model, based on Whisper large-v2, released March 2023. Supports multilingual transcription across 50+ languages, speech translation into English, and language identification. Priced at $0.006/min of audio (flat per-minute rate). Exposed via /v1/audio/transcriptions and /v1/audio/translations. For new applications, gpt-4o-transcribe offers better accuracy. API ID: whisper-1.

2023-03-01

Researched 22d ago

No window data

MultimodalAudio
OpenAI API

Pricing not tracked / 1M tokens

1 route

Provider docs
MAI-Transcribe-1.5

MAI-Transcribe-1.5 is Microsoft AI's second-generation speech-to-text transcription model. It supports 43 languages, domain-specific terminology recognition, and Microsoft-reported five-times-faster transcription than competing models while maintaining state-of-the-art accuracy. Streaming support was announced as coming soon at launch.

2026-06-02

Researched 27d ago

No window data

MultimodalAudio
Microsoft Foundry

Pricing not tracked / 1M tokens

1 route

Provider docs
MAI-Voice-2

MAI-Voice-2 is Microsoft AI's second-generation text-to-speech and voice synthesis model. It supports natural speech generation across 15+ languages, voice adaptation from short audio samples, a broader emotional range, and built-in safeguards against misuse. Microsoft announced a Flash variant as coming soon, but that unreleased variant is intentionally excluded from this seed integration.

2026-06-02

Researched 27d ago

No window data

Audio
Microsoft Foundry

Pricing not tracked / 1M tokens

1 route

Provider docs
StepAudio 2.5 ASR

StepAudio 2.5 ASR is StepFun's automatic speech recognition model. At 4B parameters, it introduces Multi-Token Prediction (MTP) technology to parallelly predict multiple tokens per decoding step, enabling transcription of 5 minutes of audio in approximately 1 second. Achieves 400% higher throughput and 60% lower latency compared to prior StepFun ASR systems while maintaining state-of-the-art accuracy. Supports Chinese and English; accepts PCM, OGG, MP3, and WAV formats. Available via the StepFun API (model: stepaudio-2.5-asr). Part of the unified StepAudio 2.5 architecture described in arXiv:2605.23463.

2026-05-22

Researched 33d ago

No window data

MultimodalAudio
StepFun

Pricing not tracked / 1M tokens

1 route

Provider docs
TTS-1

TTS-1 is OpenAI's text-to-speech model optimized for real-time, low-latency applications, released at DevDay November 2023. Accepts text input up to 4,096 characters and outputs audio in MP3, OPUS, AAC, FLAC, WAV, or PCM. Offers 6 preset voices (alloy, echo, fable, onyx, nova, shimmer). Lower output quality than TTS-1 HD but faster. Priced at $15.00/1M characters. API ID: tts-1.

2023-11-06

Researched 22d ago

No window data

Audio
OpenAI API

$15.00 in / - out / 1M tokens

1 route

Provider docs
Eleven Multilingual v2

Eleven Multilingual v2 is ElevenLabs' stable production TTS model, optimized for long-form content in 29 languages with emotionally rich synthesis. Maximum input: 10,000 characters per request. The default high-quality model prior to Eleven v3 GA. Priced at $0.10/1K characters ($100/1M chars). API ID: eleven_multilingual_v2.

2023-01-01

Researched 22d ago

No window data

Audio
ElevenLabs API

$100.00 in / - out / 1M tokens

1 route

Provider docs
TTS-1 HD

TTS-1 HD is OpenAI's high-quality text-to-speech model, optimized for audio quality over speed, released at DevDay November 2023. Uses the same 6 preset voices as TTS-1 but produces noticeably better audio fidelity at twice the per-character cost. Accepts text up to 4,096 characters. Priced at $30.00/1M characters. API ID: tts-1-hd.

2023-11-06

Researched 22d ago

No window data

Audio
OpenAI API

$30.00 in / - out / 1M tokens

1 route

Provider docs
GPT-4o Mini TTS

GPT-4o Mini TTS is OpenAI's instructable text-to-speech model built on GPT-4o mini, released March 20, 2025. Supports natural language instructions to control tone, style, pacing, and emotion (instructable TTS) — the model follows conversational prompts to adjust delivery rather than relying on static voice presets. OpenAI's recommended cost-efficient TTS for production. Input: text tokens + instructions at $0.60/1M tokens. Output: audio tokens at $12.00/1M tokens. Accepts up to 2,000 input tokens. API ID: gpt-4o-mini-tts.

2025-03-20

Researched 22d ago

2k

2,000 tokens

MultimodalAudio
OpenAI API

$0.600 in / - out / 1M tokens

1 route

Provider docs
StepAudio 2.5 Realtime

StepAudio 2.5 Realtime is StepFun's end-to-end real-time conversational voice model. It handles speech input and produces speech output through a single unified architecture with no intermediate ASR/TTS pipeline steps. Key capabilities include persona-consistent roleplay via dedicated RLHF training on million-scale persona data, paralinguistic comprehension (detecting and responding to tone, emotion, and speaking rate), and low-latency dialogue. Supports Chinese and English. Available via WebSocket API (step-2.5-realtime). Analogous in function to OpenAI's GPT Realtime models.

2026-05-24

Researched 33d ago

No window data

MultimodalAudio
StepFun

Pricing not tracked / 1M tokens

1 route

Provider docs
Scribe v2 Realtime

Scribe v2 Realtime is ElevenLabs' streaming speech-to-text model for voice agents, released November 2025. Delivers ~150ms latency with WebSocket streaming across 90+ languages. Supports Voice Activity Detection (VAD), manual commit control, and the same language coverage as Scribe v2. Priced at $0.39/hr of audio. API ID: scribe_v2_realtime.

2025-11-01

Researched 22d ago

No window data

Audio
ElevenLabs API

Pricing not tracked / 1M tokens

1 route

Provider docs
GPT-4o Mini Transcribe

GPT-4o Mini Transcribe is OpenAI's cost-efficient speech-to-text model based on GPT-4o mini, released March 20, 2025. Offers substantially better accuracy than Whisper at roughly half the price of gpt-4o-transcribe. Supports batch, Realtime transcription, and Assistants endpoints. Input: $1.25/1M audio tokens. Output: $5.00/1M text tokens. Practical: ~$0.003/min. API ID: gpt-4o-mini-transcribe.

2025-03-20

Researched 22d ago

16k

16,000 tokens

MultimodalAudioBatch
OpenAI API

- in / $5.00 out / 1M tokens

1 route

Provider docs
Eleven Flash v2.5

Eleven Flash v2.5 is ElevenLabs' ultra-low-latency TTS model for real-time applications and voice agents, released December 18, 2024. Achieves ~75ms latency with 32 languages (adds Hungarian, Norwegian, Vietnamese over Flash v2). Maximum input: 40,000 characters per request. Priced at $0.05/1K characters ($50/1M chars) — 50% cheaper than Multilingual v2. API ID: eleven_flash_v2_5.

2024-12-18

Researched 22d ago

No window data

Audio
ElevenLabs API

$50.00 in / - out / 1M tokens

1 route

Provider docs
Deepgram Nova-2

Nova-2 is Deepgram's previous-generation flagship speech-to-text model, released September 2023. Delivers ~36% WER improvement over Whisper Large across tested domains (8.4% median WER), with improved entity recognition, punctuation, and capitalization. Supports 36+ languages and 10 domain-specific variants (general, meeting, phonecall, voicemail, finance, conversationalai, video, medical, drivethru, automotive). Batch: $0.0043/min; streaming: $0.0077/min. API ID: nova-2.

2023-09-19

Researched 22d ago

No window data

MultimodalAudio
Deepgram API

Pricing not tracked / 1M tokens

1 route

Provider docs
GPT-4o Transcribe Diarize

GPT-4o Transcribe Diarize is OpenAI's automatic speech recognition model with integrated speaker diarization, released October 18, 2025. Identifies and labels who is speaking at each moment in multi-speaker audio, producing a diarized_json response with speaker labels and segment timestamps. Optionally accepts 2–10 second reference audio clips for up to 4 known speakers. Requires chunking for audio >30 seconds. Same pricing as gpt-4o-transcribe: $2.50/1M audio tokens in, $10.00/1M text tokens out. API ID: gpt-4o-transcribe-diarize.

2025-10-18

Researched 22d ago

16k

16,000 tokens

MultimodalAudio
OpenAI API

- in / $10.00 out / 1M tokens

1 route

Provider docs
StepAudio 2.5 TTS

StepAudio 2.5 TTS is StepFun's contextual text-to-speech model with fine-grained expressive control. Unlike tag-based TTS systems, it accepts plain natural language instructions to control emotion, pacing, pauses, and delivery. Supports zero-shot voice cloning with full timbre and emotion control. Priced at $0.85 per 10,000 characters (input text). Supports Chinese and English. Available via StepFun API (model: step-audio-2.5-tts). Part of the unified StepAudio 2.5 architecture described in arXiv:2605.23463.

2026-04-16

Researched 33d ago

No window data

MultimodalAudio
StepFun

Pricing not tracked / 1M tokens

1 route

Provider docs
Eleven v3

Eleven v3 is ElevenLabs' most expressive text-to-speech model, generally available February 2, 2026 (API preview from June 8, 2025). Supports dramatic delivery, emotional nuance, multi-speaker dialogue, and 70+ languages. Maximum input: 5,000 characters per request. Preferred over Multilingual v2 with 72% preference rate in testing. Priced at $0.10/1K characters ($100/1M chars). API ID: eleven_v3.

2026-02-02

Researched 22d ago

No window data

Audio
ElevenLabs API

$100.00 in / - out / 1M tokens

1 route

Provider docs
Scribe v2

Scribe v2 is ElevenLabs' current state-of-the-art batch speech-to-text model, released January 12, 2026. Improvements over v1 for long-form audio, extended silences, and tone changes. Supports 90+ languages, word-level timestamps, 32-speaker diarization, 56 entity types, and keyterm prompting (up to 1,000 terms). Base pricing: $0.22/hr; entity detection add-on: $0.07/hr; keyterm prompting add-on: $0.05/hr. API ID: scribe_v2.

2026-01-12

Researched 22d ago

No window data

Audio
ElevenLabs API

Pricing not tracked / 1M tokens

1 route

Provider docs
Qwen3 Omni 30B A3B

Qwen3 Omni 30B A3B is Alibaba's natively end-to-end omnimodal MoE model from the Qwen3 generation, capable of processing text, audio, images, and video while generating real-time streaming text and speech responses. Achieves SOTA on 22 of 36 audio/video benchmarks and open-source SOTA on 32 of 36. Available in Instruct and Thinking (reasoning) variants. Released September 22, 2025.

2025-09-22

Researched 38d ago

66k

65,536 tokens

ReasoningVisionMultimodalAudioTool useFunctions
Novita AI

$0.250 in / $0.970 out / 1M tokens

1 route

Provider docs
Deepgram Aura-2

Aura-2 is Deepgram's enterprise-grade text-to-speech model, released April 2025. Delivers sub-200ms time-to-first-byte (TTFB) with domain-specific pronunciation for drug names, legal terms, alphanumeric identifiers, dates, and currency. Outperformed ElevenLabs, Cartesia, OpenAI, Azure, Google, and PlayHT in enterprise voice preference testing. Offers 43+ English voices and multilingual support (Spanish, Dutch, French, German, Italian, Japanese). Priced at $0.030/1K characters ($30/1M chars). Available as cloud, VPC, and on-premises. API ID format: aura-2-[voice]-[lang] (e.g. aura-2-thalia-en).

2025-04-01

Researched 22d ago

No window data

Audio
Deepgram API

$30.00 in / - out / 1M tokens

1 route

Provider docs
GPT-4o Transcribe

GPT-4o Transcribe is OpenAI's flagship speech-to-text model based on GPT-4o, released March 20, 2025. Delivers substantially better word error rates than Whisper — especially for accented speech, background noise, and variable speaking rates. Supports batch, streaming (Realtime API), and Assistants endpoints. Input: $2.50/1M audio tokens. Output: $10.00/1M text tokens. Practical: ~$0.006/min. API ID: gpt-4o-transcribe.

2025-03-20

Researched 22d ago

16k

16,000 tokens

MultimodalAudioBatch
OpenAI API

- in / $10.00 out / 1M tokens

1 route

Provider docs
Deepgram Nova-3

Nova-3 is Deepgram's current-generation speech-to-text model, released February 12, 2025. Achieves 6.84% median WER streaming and 5.26% batch — a ~54% improvement over competitors. Supports real-time multilingual transcription across 31 languages with self-serve keyterm prompting (up to 100 terms) and real-time redaction (up to 50 entity types). Also includes nova-3-medical specialization. Streaming: $0.0048/min (mono), $0.0058/min (multilingual); pre-recorded: $0.0077/min (mono). API ID: nova-3.

2025-02-12

Researched 22d ago

No window data

MultimodalAudio
Deepgram API

Pricing not tracked / 1M tokens

1 route

Provider docs
Gemma 4 12B IT

Instruction-tuned version of Gemma 4 12B. Open weight (Apache 2.0), 12B parameters, encoder-free multimodal (text, image, audio). Optimized for chat and instruction-following. Runs on a 16GB laptop.

2026-06-03

Researched 19d ago

256k

256,000 tokens

256k contextReasoningVisionMultimodalAudioTool use
Hugging Face Inference Endpoints

Pricing not tracked / 1M tokens

2 routes

Provider docs
Gemma 4 12B

Google DeepMind's 12B open-weight multimodal model (Apache 2.0), designed to run on a 16GB laptop. First medium-sized model with native audio ingestion alongside text and image. Unified encoder-free decoder-only architecture. Supports 140+ languages. MMLU Pro: 77.2%.

2026-06-03

Researched 18d ago

256k

256,000 tokens

256k contextReasoningVisionMultimodalAudioTool use
Hugging Face Inference Endpoints

Pricing not tracked / 1M tokens

2 routes

Provider docs
Deepgram Flux

Flux is Deepgram's conversational automatic speech recognition (CSR) model, released October 15, 2025. Built specifically for voice agents: integrates turn detection natively into the ASR model, eliminating the need for a separate VAD/endpointing layer. Achieves end-of-turn detection within 1.5s (p95) and reduces interruptions ~30% and latency 200–600ms versus traditional pipelines. Emits voice-agent events: StartOfTurn, EndOfTurn, EagerEndOfTurn, TurnResumed. English: $0.0065/min; Multilingual (10 languages, GA April 2026): $0.0078/min. API ID: flux-general-en / flux-general-multi.

2025-10-15

Researched 22d ago

No window data

MultimodalAudio
Deepgram API

Pricing not tracked / 1M tokens

1 route

Provider docs
Transcribe (03-2026)

Cohere's state-of-the-art automatic speech recognition (ASR) model. Transcribe is a 2B parameter Conformer-based encoder-decoder model trained from scratch for high-fidelity transcription across 14 languages: English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Chinese (Mandarin), Japanese, Korean, Vietnamese, and Arabic. Can process 525 minutes of audio per minute. Achieves 5.42 WER on Hugging Face Open ASR leaderboard.

2026-03-01

Researched 179d ago

No window data

MultimodalAudio

No tracked provider route

Lyria 3 Clip

Lyria 3 Clip is Google DeepMind's Lyria model focused on audio understanding and generation. It was released 2025-01-01.

2025-01-01

Researched 41d ago

No window data

VisionMultimodalAudio
OpenRouter

Free in / Free out / 1M tokens

3 routes

Provider docs
Lyria 3 Pro

Lyria 3 Pro is Google DeepMind's Lyria model focused on audio understanding and generation. It was released 2025-01-01.

2025-01-01

Researched 41d ago

No window data

VisionMultimodalAudio
OpenRouter

Free in / Free out / 1M tokens

3 routes

Provider docs
Higgs Audio v3 TTS

Higgs Audio v3 TTS is Boson AI's 4B-parameter text-to-speech model released June 4, 2026. It supports 102 languages (85 at production quality with WER/CER <5%), zero-shot voice cloning, and inline control tokens for emotion (21 types), style (singing/shouting/whispering), sound effects, and prosody. Audio output is 24kHz MP3 or PCM. Open weights available under a non-commercial license; hosted API is in free public preview.

2026-06-04

Researched 24d ago

8k

8,192 tokens

Audio
Boson AI API

Pricing not tracked / 1M tokens

1 route

Provider docs
Grok Imagine Video 1.5 Preview

Grok Imagine Video 1.5 Preview is xAI's preview image-to-video API model announced publicly on June 3, 2026. xAI's model-specific docs state it currently does not support text-to-video; it animates a still image from a natural-language motion prompt, outputs H.264 MP4 clips at 24 FPS with synchronized audio, supports 480p and 720p output, and allows 6-15 second clips across common aspect ratios. The canonical LLMReference route slug uses dashes, while the official API model ID is grok-imagine-video-1.5-preview with alias grok-imagine-video-1.5-2026-05-30. Pricing is $0.01 per image input plus $0.08/second for 480p video or $0.14/second for 720p video on the xAI API.

2026-06-03

Researched 22d ago

No window data

VisionMultimodalAudio
xAI Console

Pricing not tracked / 1M tokens

1 route

Provider docs
GPT Realtime 2

GPT Realtime 2 is OpenAI's second-generation real-time voice model, released May 7, 2026. It is a GPT-5-class speech-to-speech model for voice agents with five reasoning intensity levels, parallel tool calls, spoken preambles, and recovery behavior on failed tasks. The model supports audio and text interaction through the Realtime API with a 128K token context window. Audio token pricing is $32 per 1M input tokens, $0.40 per 1M cached input tokens, and $64 per 1M output tokens.

2026-05-07

Researched 44d ago

131k

131,072 tokens

131k contextReasoningMultimodalAudioTool useFunctions
OpenAI API

$32.00 in / $64.00 out / 1M tokens

1 route · 1 cache

Provider docs
GPT Realtime Translate

GPT Realtime Translate is OpenAI's live speech-to-speech translation model, released May 7, 2026. It translates spoken input from 70+ languages into 13 output languages in real time without requiring speakers to pause or complete full sentences. The model is exposed through the /v1/realtime/translations endpoint and is priced per minute at $0.034 rather than per token.

2026-05-07

Researched 44d ago

No window data

MultimodalAudio
OpenAI API

Pricing not tracked / 1M tokens

1 route

Provider docs
GPT Realtime Whisper

GPT Realtime Whisper is OpenAI's streaming speech-to-text model, released May 7, 2026. It transcribes spoken audio live as a speaker talks rather than waiting for utterance completion, making it suitable for live captions, meeting notes, classroom transcripts, and real-time agent pipelines. The model is exposed through /v1/realtime/transcription_sessions and is priced per minute at $0.017 rather than per token.

2026-05-07

Researched 44d ago

No window data

Audio
OpenAI API

Pricing not tracked / 1M tokens

1 route

Provider docs
Nemotron 3 Nano Omni

NVIDIA Nemotron 3 Nano Omni is an open-weight 30B hybrid MoE multimodal model (3B active parameters) that natively accepts text, image, video, and audio inputs in a single inference loop. Built on a hybrid Mamba-Transformer architecture with 23 Mamba-2 layers, 23 MoE layers (128 experts, 6+1 active), and 6 GQA layers, plus Conv3D video layers and Efficient Video Sampling (EVS). Delivers up to 9x higher throughput than comparable omni models. Supports a 256K context window and a 16,384 reasoning budget. Open weights, datasets, and training recipes released under a permissive license.

2026-04-28

Researched 46d ago

262k

262,144 tokens

262k contextMultimodalAudio
OpenRouter

Free in / Free out / 1M tokens

1 route

Provider docs
Xiaomi MiMo-V2.5-TTS-Series

Xiaomi's MiMo home page lists Xiaomi MiMo-V2.5-TTS-Series as part of the V2.5 series brand-new release, with Token Plan availability, high-quality TTS voices, style-instruction following, voice design, and voice cloning. Precise parameter count, context/window limits, and per-token or per-character pricing were not disclosed in the accessible official page, so those fields are left null rather than inferred.

2026-04-23

Researched 34d ago

No window data

MultimodalAudio
Xiaomi

Pricing not tracked / 1M tokens

1 route

Provider docs
Gemini Deep Research Max Preview

Maximum-comprehensiveness version of Google's Deep Research agent, built on Gemini 3.1 Pro and released April 21, 2026. Spends more compute than the standard preview to consult more sources, refine reports, and capture nuanced details. Designed for accuracy-critical long-form investigations synthesizing information from hundreds of sources. Supports MCP servers, File Search, and multi-step planning. Context: 1M tokens; max output: 65,536 tokens. Runs at Gemini 3.1 Pro rates ($2.00/$12.00 per MTok). API ID: deep-research-max-preview-04-2026.

2026-04-21

Researched 58d ago

1m

1,000,000 tokens

1m contextVisionMultimodalAudioTool useFunctions
Google AI Studio

$2.00 in / $12.00 out / 1M tokens

1 route

Provider docs
Gemini Deep Research Preview

Google's agentic deep research model built on Gemini 3.1 Pro, released April 21, 2026. Designed for speed and efficiency in autonomous multi-step research: ingests text, images, PDFs, audio, and video to produce comprehensive cited reports from public web sources and private workspace data. Supports collaborative planning, visualization, MCP servers, and File Search. Context window: 1M tokens; max output: 65,536 tokens. Runs at Gemini 3.1 Pro rates ($2.00/$12.00 per MTok). API ID: deep-research-preview-04-2026.

2026-04-21

Researched 58d ago

1m

1,000,000 tokens

1m contextVisionMultimodalAudioTool useFunctions
Google AI Studio

$2.00 in / $12.00 out / 1M tokens

1 route

Provider docs
Gemini 3.1 Flash TTS Preview

Google's cost-efficient expressive text-to-speech model, released April 15, 2026. Supports 70+ languages, native multi-speaker dialogue, and 200+ audio style tags for precise delivery control. SynthID watermarking included. Achieves Elo 1,211 on the Artificial Analysis TTS leaderboard. Priced at $1.00/M input tokens and $20.00/M audio output tokens. API ID: gemini-3.1-flash-tts-preview.

2026-04-15

Researched 65d ago

16k

16,000 tokens

Audio
Google AI Studio

$1.00 in / - out / 1M tokens

1 route

Provider docs
MOSS-Audio 4B Instruct

MOSS-Audio 4B Instruct is the instruction-following 4.6B variant of MOSI AI and OpenMOSS Team's open-weight audio understanding model. It combines a MOSS-Audio encoder with a Qwen3-4B language backbone for speech, environmental sound, music, captioning, time-aware question answering, timestamped ASR, and audio-grounded reasoning.

2026-04-13

Researched 25d ago

No window data

MultimodalAudio
Hugging Face Inference Endpoints

Pricing not tracked / 1M tokens

1 route

Provider docs
MOSS-Audio 4B Thinking

MOSS-Audio 4B Thinking is the reasoning-tuned 4.6B variant of MOSI AI and OpenMOSS Team's open-weight audio understanding model. It uses the MOSS-Audio encoder and Qwen3-4B backbone, adding chain-of-thought-oriented post-training for stronger complex audio reasoning while retaining speech, sound, music, timestamp, captioning, and QA coverage.

2026-04-13

Researched 25d ago

No window data

ReasoningMultimodalAudio
Hugging Face Inference Endpoints

Pricing not tracked / 1M tokens

1 route

Provider docs
MOSS-Audio 8B Instruct

MOSS-Audio 8B Instruct is the instruction-following 8.6B variant of MOSI AI and OpenMOSS Team's open-weight audio understanding model. It pairs the MOSS-Audio encoder with a Qwen3-8B language backbone and is positioned for stronger open-source speech, sound, music, audio captioning, ASR, timestamp, and QA workloads.

2026-04-13

Researched 25d ago

No window data

MultimodalAudio
Hugging Face Inference Endpoints

Pricing not tracked / 1M tokens

1 route

Provider docs
MOSS-Audio 8B Thinking

MOSS-Audio 8B Thinking is the reasoning-tuned 8.6B variant of MOSI AI and OpenMOSS Team's open-weight audio understanding model. It uses the MOSS-Audio encoder and Qwen3-8B backbone, with Thinking post-training for complex audio reasoning over speech, environmental sound, music, timestamps, captions, and question answering.

2026-04-13

Researched 25d ago

No window data

ReasoningMultimodalAudio
Hugging Face Inference Endpoints

Pricing not tracked / 1M tokens

1 route

Provider docs
MAI-Transcribe-1

Microsoft AI speech-to-text model supporting top 25 languages. 2.5x faster batch transcription than Azure Fast. Optimized for real-world noisy environments.

2026-04-02

Researched 41d ago

No window data

MultimodalAudio
Microsoft Foundry

$0.360 in / - out / 1M tokens

1 route

Provider docs
MAI-Voice-1

Microsoft AI voice generation with emotional nuance and speaker identity preservation. Generates 60 seconds of audio in 1 second. Supports custom voice creation from brief audio samples.

2026-04-02

Researched 41d ago

No window data

MultimodalAudio
Microsoft Foundry

$22.00 in / - out / 1M tokens

1 route

Provider docs
Qwen3.5-Omni Flash

Qwen3.5-Omni Flash is Alibaba's lower-latency omnimodal API model, released March 30, 2026. It keeps the Qwen3.5-Omni text, image, audio, and video input surface while reducing cost and latency for short video analysis and high-throughput multimodal workloads. API model ID: qwen3.5-omni-flash.

2026-03-30

Researched 56d ago

262k

262,144 tokens

262k contextVisionMultimodalAudioTool useFunctions
Alibaba Cloud PAI-EAS

$0.100 in / $0.800 out / 1M tokens

1 route

Provider docs
Qwen3.5-Omni Plus

Qwen3.5-Omni Plus is Alibaba's flagship omnimodal API model, released March 30, 2026. It processes text, images, audio, and video simultaneously and returns text or speech responses. The Plus variant targets maximum quality for multimodal and real-time interaction workloads with API model ID qwen3.5-omni-plus.

2026-03-30

Researched 56d ago

262k

262,144 tokens

262k contextReasoningVisionMultimodalAudioTool use
Alibaba Cloud PAI-EAS

$0.400 in / $4.80 out / 1M tokens

1 route

Provider docs
MiMo-V2-Omni

Xiaomi MiMo-V2-Omni multimodal language model. Part of the MiMo V2 series; the Omni variant adds multimodal (image) understanding. Distinct from MiMo V2.5 which focuses on math reasoning.

2026-03-18

Researched 56d ago

262k

262,144 tokens

262k contextVisionMultimodalAudio
OpenRouter

$0.400 in / $2.00 out / 1M tokens

1 route

Provider docs
MOVA 360p

MOVA 360p is the lower-resolution open-weight MOVA checkpoint for synchronized video-audio generation. MOSI AI and the OpenMOSS Team describe MOVA as a 32B-parameter mixture-of-experts model with 18B active parameters during inference, designed for native image-to-video-audio and text-to-video-audio generation with synchronized audio, lip sync, and sound effects.

2026-01-29

Researched 25d ago

No window data

VisionMultimodalAudio
Hugging Face Inference Endpoints

Pricing not tracked / 1M tokens

1 route

Provider docs
MOVA 720p

MOVA 720p is the higher-resolution open-weight MOVA checkpoint for synchronized video-audio generation. MOSI AI and the OpenMOSS Team describe MOVA as a 32B-parameter mixture-of-experts model with 18B active parameters during inference, designed for native image-to-video-audio and text-to-video-audio generation with synchronized audio, lip sync, and sound effects.

2026-01-29

Researched 25d ago

No window data

VisionMultimodalAudio
Hugging Face Inference Endpoints

Pricing not tracked / 1M tokens

1 route

Provider docs
ERNIE 5.0

ERNIE 5.0 is Baidu's fifth-generation flagship foundation model, officially launched January 22, 2026 (preview at Baidu World November 13, 2025). It is a fully native multimodal model supporting text, image, audio, and video understanding and generation under a unified autoregressive framework, trained simultaneously across modalities from scratch. With 2.4 trillion total parameters and ultra-sparse MoE activation engaging <3% of parameters per inference, it delivers frontier performance at high efficiency. Available on Baidu AI Cloud's Qianfan MaaS platform. API model IDs: ernie-5.0; ernie-5.0-thinking-preview (thinking mode).

2026-01-22

Researched 5d ago

128k

128,000 tokens

128k contextReasoningVisionMultimodalAudioTool use
Baidu Qianfan

$0.890 in / $3.54 out / 1M tokens

1 route

Provider docs
gpt-realtime

Realtime model capable of text and audio inputs and outputs via the Realtime API.

2025-10-06

Researched 50d ago

32k

32,000 tokens

MultimodalAudioPrompt cache
OpenAI API

$4.00 in / $16.00 out / 1M tokens

1 route · 1 cache

Provider docs
gpt-realtime-mini

gpt-realtime-mini is OpenAI's GPT Realtime model with multimodal text and image input. It offers a 32K-token context window.

2025-10-06

Researched 41d ago

32k

32,000 tokens

MultimodalAudioPrompt cache
OpenAI API

$0.600 in / $2.40 out / 1M tokens

1 route · 1 cache

Provider docs
Gemini 2.5 Flash Live API

Gemini 2.5 Flash Live API is Google DeepMind's Gemini 2.5 model with multimodal text and image input. It offers a 128K-token context window.

2025-04-01

Researched 41d ago

128k

128,000 tokens

128k contextVisionMultimodalAudioTool useFunctions
GCP Vertex AI

$0.500 in / - out / 1M tokens

1 route

Provider docs
Gemini 2.5 Flash TTS Preview

Gemini 2.5 Flash TTS Preview is Google DeepMind's Gemini 2.5 model focused on audio understanding and generation. It offers a 128K-token context window.

2025-04-01

Researched 41d ago

128k

128,000 tokens

128k contextAudioTool useFunctionsBatch
Google AI Studio

$0.500 in / - out / 1M tokens

1 route · 1 batch

Provider docs
Gemini 2.5 Pro TTS Preview

Gemini 2.5 Pro TTS Preview is Google DeepMind's Gemini 2.5 model focused on audio understanding and generation. It offers a 128K-token context window.

2025-04-01

Researched 41d ago

128k

128,000 tokens

128k contextAudioTool useFunctionsBatch
Google AI Studio

$1.00 in / - out / 1M tokens

1 route · 1 batch

Provider docs
Gemini 2.0 Flash Live API

Gemini 2.0 Flash Live API is Google DeepMind's Gemini 2.0 model with multimodal text and image input. It offers a 1M-token context window.

2025-03-01

Researched 41d ago

1m

1,000,000 tokens

1m contextVisionMultimodalAudioTool useFunctions
GCP Vertex AI

$0.500 in / - out / 1M tokens

1 route

Provider docs
Lyria RealTime

Lyria RealTime is Google DeepMind's Lyria model focused on audio understanding and generation. It was released 2024-11-01.

2024-11-01

Researched 41d ago

No window data

Audio
Google AI Studio

Pricing not tracked / 1M tokens

1 route

Provider docs
MiniMax Speech 2.8

MiniMax Speech 2.8 is MiniMax's MiniMax Speech model. It was released 2024-10-01.

2024-10-01

Researched 41d ago

No window data

Audio
Runware

Pricing not tracked / 1M tokens

1 route

Provider docs
Lyria 2

Google Lyria 2 music generation model for creating original musical compositions.

2024-06-01

Researched 71d ago

No window data

Audio
GCP Vertex AI

Pricing not tracked / 1M tokens

1 route

Provider docs
MOSS-TTS-v1.5

MOSS-TTS-v1.5 is an open-weight multilingual text-to-speech model from MOSI AI and the OpenMOSS team. The 8B-parameter MossTTSDelay model supports zero-shot voice cloning, long-form speech generation, explicit pause control with [pause X.Ys] markers, and language-tagged multilingual synthesis across 31 languages. Version 1.5 improves on MOSS-TTS v1.0 with stronger multilingual synthesis, more stable voice cloning, better long-reference short-text handling, and punctuation-driven prosody. The model weights are available on Hugging Face under Apache 2.0; no hosted token-priced API route is confirmed in the June 2026 research handoff.

2026-05-26

Researched 25d ago

No window data

Audio

No tracked provider route

Granite Speech 4.1 2B

IBM Granite Speech 4.1 2B is a multilingual ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) model trained on 174,000 hours of audio. ASR: English, French, German, Spanish, Portuguese, Japanese. Translation: X→English (French, German, Spanish, Portuguese, Japanese) and English→X (French, German, Spanish, Italian, Japanese, Mandarin Chinese). Features: punctuation/truecasing, keyword biasing, dual-head CTC encoder. Architecture: 16 conformer blocks + 2-layer window Q-former + Granite 4.0 1B LLM base (128K context). Variants: granite-speech-4.1-2b-plus (adds speaker-attributed ASR, word timestamps), granite-speech-4.1-2b-nar (non-autoregressive, higher throughput). Apache 2.0.

2026-04-29

Researched 61d ago

128k

128,000 tokens

128k contextMultimodalAudio

No tracked provider route

Granite Speech 4.1 2B NAR

IBM Granite Speech 4.1 2B NAR (Non-AutoRegressive) is a high-throughput speech recognition model that generates transcriptions in a single forward pass rather than token-by-token. Architecture: 440M CTC speech encoder (16-layer Conformer) + 160M Q-Former projector + 1B bidirectional LLM editor (LoRA-adapted Granite-4.0-1b-base). Achieves ~1820x real-time factor on a single H100 GPU at batch size 128 and 1.29% WER on LibriSpeech Clean. Optimized for latency-sensitive production deployments. Apache 2.0 license.

2026-04-29

Researched 60d ago

No window data

MultimodalAudio

No tracked provider route

Granite Speech 4.1 2B Plus

IBM Granite Speech 4.1 2B Plus is an enhanced speech-to-text model supporting ASR (automatic speech recognition), speaker-attributed ASR with speaker labels, and word-level timestamps. Extends the base Granite Speech 4.1 2B with richer transcription features including keyword list biasing (KWB) and incremental decoding. Achieves 5.71% average WER on the HuggingFace Open ASR Leaderboard. Supports English, French, German, Spanish, and Portuguese. Apache 2.0 license.

2026-04-29

Researched 60d ago

No window data

MultimodalAudio

No tracked provider route

Grok Voice Think Fast 1.0

xAI's flagship full-duplex voice agent model for complex, multi-step enterprise workflows. Ranks #1 on the τ-voice Bench leaderboard at 67.3%, outperforming GPT Realtime and Gemini Voice. Supports 25+ languages with real-time background reasoning that does not increase response latency. Excels at structured data capture (names, addresses, phone numbers, account numbers) even with accents or speech disfluencies, and supports up to 28 concurrent tool integrations. Powers Starlink's customer support line with reported 20% sales conversion and 70% autonomous inquiry resolution. API ID: grok-voice-think-fast-1.0.

2026-04-23

Researched 64d ago

No window data

ReasoningMultimodalAudioTool useFunctionsJSON

No tracked provider route