LLM Reference

StepAudio 2.5 Models by StepFun

StepFunProprietary
3 models2026

About

StepAudio 2.5 is StepFun's unified audio-language foundation model family, introduced in May 2026 (arXiv:2605.23463). It covers three API-accessible capabilities — text-to-speech (TTS), automatic speech recognition (ASR), and real-time conversational voice (Realtime) — all built on a shared decoder architecture. The family claims top scores across five voice AI benchmarks, surpassing GPT Realtime and Gemini Live on tested dimensions. Supports Chinese and English.

Current Variants

Use-when guidance is derived from seed capabilities, context, release, and replacement fields.

3 in view

Use when the workload needs voice, multimodal inputs, and audio.

2026-05voicemultimodal inputsaudio

Use when the workload needs speech recognition, 4B parameters, and multimodal inputs.

2026-05speech recognition4B parametersmultimodal inputs

Use when the workload needs text to speech, multimodal inputs, and audio.

2026-04text to speechmultimodal inputsaudio

Release Timeline

2 release groups
2026-05
2 current
StepAudio 2.5 ASR
speech recognition4B parametersmultimodal inputs
Current
StepAudio 2.5 Realtime
voicemultimodal inputsaudio
Current
2026-04
1 current
StepAudio 2.5 TTS
text to speechmultimodal inputsaudio
Current

Specifications(3 models)

StepAudio 2.5 model specifications comparison
ModelReleasedParametersMultimodal
StepAudio 2.5 Realtime2026-05Yes
StepAudio 2.5 ASR2026-054BYes
StepAudio 2.5 TTS2026-04Yes

Available From(1 provider)

Frequently Asked Questions

What is StepAudio 2.5 used for?
StepAudio 2.5 is used for voice, speech recognition, and text to speech. The family description and listed model capabilities point to those workloads as the best fit.
How does StepAudio 2.5 compare to Step?
StepAudio 2.5 by StepFun is strongest where you need voice, while Step by StepFun is the closest related family to check for vision and multimodal work. StepAudio 2.5 has 3 listed variants, while Step reaches up to 256K context, so compare the specs and pricing tables before choosing a production model.
Which StepAudio 2.5 model should I use?
If price is the main constraint, use the pricing table first because StepAudio 2.5 does not have complete provider pricing in the local data. For the most capable/latest local choice, evaluate StepAudio 2.5 Realtime with multimodal inputs.

Models(3)