StepAudio 2.5 Models by StepFun
About
StepAudio 2.5 is StepFun's unified audio-language foundation model family, introduced in May 2026 (arXiv:2605.23463). It covers three API-accessible capabilities — text-to-speech (TTS), automatic speech recognition (ASR), and real-time conversational voice (Realtime) — all built on a shared decoder architecture. The family claims top scores across five voice AI benchmarks, surpassing GPT Realtime and Gemini Live on tested dimensions. Supports Chinese and English.
Current Variants
Use-when guidance is derived from seed capabilities, context, release, and replacement fields.
Use when the workload needs voice, multimodal inputs, and audio.
Use when the workload needs speech recognition, 4B parameters, and multimodal inputs.
Use when the workload needs text to speech, multimodal inputs, and audio.
| Model | Use when | Released | Signals | Status |
|---|---|---|---|---|
| StepAudio 2.5 Realtime | Use when the workload needs voice, multimodal inputs, and audio. | 2026-05 | voicemultimodal inputsaudio | Current |
| StepAudio 2.5 ASR | Use when the workload needs speech recognition, 4B parameters, and multimodal inputs. | 2026-05 | speech recognition4B parametersmultimodal inputs | Current |
| StepAudio 2.5 TTS | Use when the workload needs text to speech, multimodal inputs, and audio. | 2026-04 | text to speechmultimodal inputsaudio | Current |
Release Timeline
2 release groupsSpecifications(3 models)
| Model | Released | Parameters | Multimodal |
|---|---|---|---|
| StepAudio 2.5 Realtime | 2026-05 | — | Yes |
| StepAudio 2.5 ASR | 2026-05 | 4B | Yes |
| StepAudio 2.5 TTS | 2026-04 | — | Yes |
Available From(1 provider)
Frequently Asked Questions
- What is StepAudio 2.5 used for?
- StepAudio 2.5 is used for voice, speech recognition, and text to speech. The family description and listed model capabilities point to those workloads as the best fit.
- How does StepAudio 2.5 compare to Step?
- StepAudio 2.5 by StepFun is strongest where you need voice, while Step by StepFun is the closest related family to check for vision and multimodal work. StepAudio 2.5 has 3 listed variants, while Step reaches up to 256K context, so compare the specs and pricing tables before choosing a production model.
- Which StepAudio 2.5 model should I use?
- If price is the main constraint, use the pricing table first because StepAudio 2.5 does not have complete provider pricing in the local data. For the most capable/latest local choice, evaluate StepAudio 2.5 Realtime with multimodal inputs.
