StepAudio 2.5 TTS
step-audio-2-5-tts
Last refreshed 2026-05-27. Next refresh: weekly.
StepAudio 2.5 TTS is worth evaluating for vision when its provider route and context window match the workload.
Decision context: Vision task fit, 1 tracked provider route, and research from 2026-05-27.
Use it for
- Teams evaluating vision
- Buyers comparing 1 tracked provider route
Do not use it for
- Strict JSON or tool-calling flows
Cheapest output
-
StepFun per 1M tokens
Provider routes
1
Tracked API hosts
Quality / dollar
Unknown
No task benchmark coverage yet
Freshness
2026-05-27
Researched today
Top use-case fit
Vision
Included by capability and metadata signals in the decision map.
Provider price ladder
| Provider | Input / 1M | Output / 1M | Route |
|---|---|---|---|
| StepFun | - | - | ServerlessPartial |
Benchmark peer barsfor Vision
No task-mapped benchmark peers are available for this model yet.
Migration checks
No linked migration route is available for this model yet.
About
StepAudio 2.5 TTS is StepFun's contextual text-to-speech model with fine-grained expressive control. Unlike tag-based TTS systems, it accepts plain natural language instructions to control emotion, pacing, pauses, and delivery. Supports zero-shot voice cloning with full timbre and emotion control. Priced at $0.85 per 10,000 characters (input text). Supports Chinese and English. Available via StepFun API (model: step-audio-2.5-tts). Part of the unified StepAudio 2.5 architecture described in arXiv:2605.23463.