What is Step used for?

Step is used for vision and multimodal work, reasoning, and agent workflows and tool use. The family description and listed model capabilities point to those workloads as the best fit.

How does Step compare to StepAudio 2.5?

Step by StepFun is strongest where you need vision and multimodal work, while StepAudio 2.5 by StepFun is the closest related family to check for voice. Step has 11 listed variants and reaches up to 256k context, so compare the specs and pricing tables before choosing a production model.

Which Step model should I use?

For the lowest listed input price, start with Step 3.5 Flash through OpenRouter at $0.1/1M input tokens. For the most capable/latest local choice, evaluate Step 3.7 Flash with 256k context and reasoning, tool use, function calling, structured outputs, and multimodal inputs.

Step Models by StepFun

StepFunProprietary

11 models2024–2026Up to 256k ctxFrom $0.1/1M input

Details

ResearcherStepFun

LicenseProprietary

Commercial useCommercial use: conditional

Models11

Released2024–2026

Max context256k

Capabilities

Vision3 of 11 models

Multimodal5 of 11 models

Reasoning2 of 11 models

Function Calling2 of 11 models

Tool Use1 of 11 models

Structured Outputs1 of 11 models

Links

Website HuggingFace

About

The Step family of large language and multimodal models from StepFun (阶跃星辰). The series spans proprietary API models and open-weight Flash releases, including Step 3.7 Flash, a 198B-parameter sparse MoE vision-language model with 256K context, Apache 2.0 weights, and selectable reasoning levels for agentic coding, tool use, image, and video workflows.

Current Variants

Use-when guidance is based on each model's tracked capabilities, context window, release date, and replacement status.

11 in view

Step 3.7 FlashCurrent

Use when the workload needs 256k context, reasoning, and tool use.

2026-05256k contextreasoningtool use

Step 3.5 FlashCurrent

Use when the workload needs 256k context and reasoning.

2026-01256k contextreasoning

StepFun Step-2Current

Use when the workload needs 128k context.

2025-10128k context

StepFun Step-1Current

Use when the workload needs 128k context.

2025-08128k context

Step-2Current

Use when the workload needs 256k context, function calling, and multimodal inputs.

2024-09256k contextfunction callingmultimodal inputs

Step-1V TurboCurrent

Use when the workload needs multimodal inputs.

2024-07multimodal inputs

Step-1.5VCurrent

Use when the workload needs 128k context and multimodal inputs.

2024-06128k contextmultimodal inputs

Step-1Current

Use when the workload needs 128k context.

2024-04128k context

Step-1VCurrent

Use when the workload needs multimodal inputs.

2024-03multimodal inputs

Step-InstructCurrent

Use when provider availability and model metadata match the workload.

2024-03

Step-MathCurrent

Use when provider availability and model metadata match the workload.

2024-03

Current Step variants with use-when guidance and lifecycle status
Model	Use when	Released	Signals	Status
Step 3.7 Flash	Use when the workload needs 256k context, reasoning, and tool use.	2026-05	256k contextreasoningtool use	Current
Step 3.5 Flash	Use when the workload needs 256k context and reasoning.	2026-01	256k contextreasoning	Current
StepFun Step-2	Use when the workload needs 128k context.	2025-10	128k context	Current
StepFun Step-1	Use when the workload needs 128k context.	2025-08	128k context	Current
Step-2	Use when the workload needs 256k context, function calling, and multimodal inputs.	2024-09	256k contextfunction callingmultimodal inputs	Current
Step-1V Turbo	Use when the workload needs multimodal inputs.	2024-07	multimodal inputs	Current
Step-1.5V	Use when the workload needs 128k context and multimodal inputs.	2024-06	128k contextmultimodal inputs	Current
Step-1	Use when the workload needs 128k context.	2024-04	128k context	Current
Step-1V	Use when the workload needs multimodal inputs.	2024-03	multimodal inputs	Current
Step-Instruct	Use when provider availability and model metadata match the workload.	2024-03	—	Current
Step-Math	Use when provider availability and model metadata match the workload.	2024-03	—	Current

Release Timeline

9 release groups

2026-05

1 current

Step 3.7 Flash

256k contextreasoningtool use

Current

2026-01

1 current

Step 3.5 Flash

256k contextreasoning

Current

2025-10

1 current

StepFun Step-2

128k context

Current

2025-08

1 current

StepFun Step-1

128k context

Current

2024-09

1 current

Step-2

256k contextfunction callingmultimodal inputs

Current

2024-07

1 current

Step-1V Turbo

multimodal inputs

Current

2024-06

1 current

Step-1.5V

128k contextmultimodal inputs

Current

2024-04

1 current

Step-1

128k context

Current

Specifications(11 models)

Step model specifications comparison
Model	Released	Context	Parameters	Vision	Multimodal	Reasoning	Fn Calling	Tool Use	Structured Outputs
Step 3.7 Flash	2026-05	256k	198B (11B active)	Yes	Yes	Yes	Yes	Yes	Yes
Step 3.5 Flash	2026-01	256k	196B (11B active)	No	No	Yes	No	No	No
StepFun Step-2	2025-10	128k	1T (MoE)*	No	No	No	No	No	No
StepFun Step-1	2025-08	128k	—	No	No	No	No	No	No
Step-2	2024-09	256k	1T (MoE)*	Yes	Yes	No	Yes	No	No
Step-1V Turbo	2024-07	—	—	No	Yes	No	No	No	No
Step-1.5V	2024-06	128k	—	Yes	Yes	No	No	No	No
Step-1	2024-04	128k	—	No	No	No	No	No	No
Step-1V	2024-03	—	—	No	Yes	No	No	No	No
Step-Instruct	2024-03	—	—	No	No	No	No	No	No
Step-Math	2024-03	—	—	No	No	No	No	No	No

Available From(3 providers)

NVIDIA NIM

OpenRouter

StepFun

Pricing

Step model pricing by provider
Model	Provider	Input / 1M	Output / 1M	Type
Step 3.5 Flash	OpenRouter	$0.1	$0.3	Serverless
Step 3.7 Flash	StepFun	$0.2	$1.15	Serverless
Step 3.7 Flash	OpenRouter	$0.2	$1.15	Serverless

Popular comparisons in this family

Comparisons

All comparisons →

Frequently Asked Questions

What is Step used for?: Step is used for vision and multimodal work, reasoning, and agent workflows and tool use. The family description and listed model capabilities point to those workloads as the best fit.
How does Step compare to StepAudio 2.5?: Step by StepFun is strongest where you need vision and multimodal work, while StepAudio 2.5 by StepFun is the closest related family to check for voice. Step has 11 listed variants and reaches up to 256k context, so compare the specs and pricing tables before choosing a production model.
Which Step model should I use?: For the lowest listed input price, start with Step 3.5 Flash through OpenRouter at $0.1/1M input tokens. For the most capable/latest local choice, evaluate Step 3.7 Flash with 256k context and reasoning, tool use, function calling, structured outputs, and multimodal inputs.

Models(11)

Step 3.7 Flash

2026-05256k198B (11B active)3 providers

MultimodalReasoning

Step 3.5 Flash

2026-01256k196B (11B active)1 provider

Reasoning

StepFun Step-2

2025-10128k1T (MoE)*

StepFun Step-1

2025-08128k

Step-2

2024-09256k1T (MoE)*1 provider

Multimodal

Step-1V Turbo

2024-07

Multimodal

Step-1.5V

2024-06128k1 provider

Multimodal

Step-1

2024-04128k1 provider

Step-1V

Step-Instruct

Step-Math

Step Models by StepFun

Details

Capabilities

Links

About

Current Variants

Release Timeline

Specifications(11 models)

Available From(3 providers)

Pricing

Popular comparisons in this family

Comparisons

Frequently Asked Questions

Related Model Families

Models(11)