LLM Reference

StepAudio 2.5 TTS

step-audio-2-5-tts

Researched today

Last refreshed 2026-05-27. Next refresh: weekly.

ProprietaryMultimodalVision

StepAudio 2.5 TTS is worth evaluating for vision when its provider route and context window match the workload.

Decision context: Vision task fit, 1 tracked provider route, and research from 2026-05-27.

Use it for

  • Teams evaluating vision
  • Buyers comparing 1 tracked provider route

Do not use it for

  • Strict JSON or tool-calling flows

Cheapest output

-

StepFun per 1M tokens

Provider routes

1

Tracked API hosts

Quality / dollar

Unknown

No task benchmark coverage yet

Freshness

2026-05-27

Researched today

fresh

Top use-case fit

Vision

Included by capability and metadata signals in the decision map.

Provider price ladder

ProviderInput / 1MOutput / 1MRoute
StepFun--
ServerlessPartial

Benchmark peer barsfor Vision

No task-mapped benchmark peers are available for this model yet.

Migration checks

No linked migration route is available for this model yet.

About

StepAudio 2.5 TTS is StepFun's contextual text-to-speech model with fine-grained expressive control. Unlike tag-based TTS systems, it accepts plain natural language instructions to control emotion, pacing, pauses, and delivery. Supports zero-shot voice cloning with full timbre and emotion control. Priced at $0.85 per 10,000 characters (input text). Supports Chinese and English. Available via StepFun API (model: step-audio-2.5-tts). Part of the unified StepAudio 2.5 architecture described in arXiv:2605.23463.

Capabilities

MultimodalAudio

Rankings

Specifications

Released2026-04-16
Specializationtext-to-speech
LicenseProprietary
Fine-tuning0

Created by

One of China's leading AI 'Six Tigers'.

Shanghai, China
Founded 2023
Website

Providers(1)