StepAudio 2.5 Realtime
step-audio-2-5-realtime
Last refreshed 2026-05-27. Next refresh: weekly.
StepAudio 2.5 Realtime is worth evaluating for vision when its provider route and context window match the workload.
Decision context: Vision task fit, 1 tracked provider route, and research from 2026-05-27.
Use it for
- Teams evaluating vision
- Buyers comparing 1 tracked provider route
Do not use it for
- Strict JSON or tool-calling flows
Cheapest output
-
StepFun per 1M tokens
Provider routes
1
Tracked API hosts
Quality / dollar
Unknown
No task benchmark coverage yet
Freshness
2026-05-27
Researched today
Top use-case fit
Vision
Included by capability and metadata signals in the decision map.
Provider price ladder
| Provider | Input / 1M | Output / 1M | Route |
|---|---|---|---|
| StepFun | - | - | ServerlessPartial |
Benchmark peer barsfor Vision
No task-mapped benchmark peers are available for this model yet.
Migration checks
No linked migration route is available for this model yet.
About
StepAudio 2.5 Realtime is StepFun's end-to-end real-time conversational voice model. It handles speech input and produces speech output through a single unified architecture with no intermediate ASR/TTS pipeline steps. Key capabilities include persona-consistent roleplay via dedicated RLHF training on million-scale persona data, paralinguistic comprehension (detecting and responding to tone, emotion, and speaking rate), and low-latency dialogue. Supports Chinese and English. Available via WebSocket API (step-2.5-realtime). Analogous in function to OpenAI's GPT Realtime models.