StepAudio 2.5 ASR
step-audio-2-5-asr
Last refreshed 2026-05-27. Next refresh: weekly.
StepAudio 2.5 ASR is worth evaluating for vision when its provider route and context window match the workload.
Decision context: Vision task fit, 1 tracked provider route, and research from 2026-05-27.
Use it for
- Teams evaluating vision
- Buyers comparing 1 tracked provider route
Do not use it for
- Strict JSON or tool-calling flows
Cheapest output
-
StepFun per 1M tokens
Provider routes
1
Tracked API hosts
Quality / dollar
Unknown
No task benchmark coverage yet
Freshness
2026-05-27
Researched today
Top use-case fit
Vision
Included by capability and metadata signals in the decision map.
Provider price ladder
| Provider | Input / 1M | Output / 1M | Route |
|---|---|---|---|
| StepFun | - | - | ServerlessPartial |
Benchmark peer barsfor Vision
No task-mapped benchmark peers are available for this model yet.
Migration checks
No linked migration route is available for this model yet.
About
StepAudio 2.5 ASR is StepFun's automatic speech recognition model. At 4B parameters, it introduces Multi-Token Prediction (MTP) technology to parallelly predict multiple tokens per decoding step, enabling transcription of 5 minutes of audio in approximately 1 second. Achieves 400% higher throughput and 60% lower latency compared to prior StepFun ASR systems while maintaining state-of-the-art accuracy. Supports Chinese and English; accepts PCM, OGG, MP3, and WAV formats. Available via the StepFun API (model: stepaudio-2.5-asr). Part of the unified StepAudio 2.5 architecture described in arXiv:2605.23463.