LLM Reference

Step Models by StepFun

StepFunProprietaryProprietary
11 models2024–2026Up to 256k ctxFrom $0.1/1M input

Details

ResearcherStepFun
LicenseProprietary
Commercial useCommercial use with conditions
Models11
Released2024–2026
Max context256k

Capabilities

Vision3 of 11 models
Multimodal5 of 11 models
Reasoning2 of 11 models
Function Calling2 of 11 models
Tool Use1 of 11 models
Structured Outputs1 of 11 models

About

The Step family of large language and multimodal models from StepFun (阶跃星辰). The series spans proprietary API models and open-weight Flash releases, including Step 3.7 Flash, a 198B-parameter sparse MoE vision-language model with 256K context, Apache 2.0 weights, and selectable reasoning levels for agentic coding, tool use, image, and video workflows.

Current Variants

Use-when guidance is derived from seed capabilities, context, release, and replacement fields.

11 in view

Use when the workload needs 256k context, reasoning, and tool use.

2026-05256k contextreasoningtool use

Use when the workload needs 256k context and reasoning.

2026-01256k contextreasoning

Use when the workload needs 128k context.

2025-10128k context

Use when the workload needs 128k context.

2025-08128k context
Step-2Current

Use when the workload needs 256k context, function calling, and multimodal inputs.

2024-09256k contextfunction callingmultimodal inputs

Use when the workload needs multimodal inputs.

2024-07multimodal inputs
Step-1.5VCurrent

Use when the workload needs 128k context and multimodal inputs.

2024-06128k contextmultimodal inputs
Step-1Current

Use when the workload needs 128k context.

2024-04128k context
Step-1VCurrent

Use when the workload needs multimodal inputs.

2024-03multimodal inputs

Use when provider availability and model metadata match the workload.

2024-03
Step-MathCurrent

Use when provider availability and model metadata match the workload.

2024-03

Release Timeline

9 release groups
2026-05
1 current
Step 3.7 Flash
256k contextreasoningtool use
Current
2026-01
1 current
Step 3.5 Flash
256k contextreasoning
Current
2025-10
1 current
StepFun Step-2
128k context
Current
2025-08
1 current
StepFun Step-1
128k context
Current
2024-09
1 current
Step-2
256k contextfunction callingmultimodal inputs
Current
2024-07
1 current
Step-1V Turbo
multimodal inputs
Current
2024-06
1 current
Step-1.5V
128k contextmultimodal inputs
Current
2024-04
1 current
Step-1
128k context
Current

Specifications(11 models)

Step model specifications comparison
ModelReleasedContextParametersVisionMultimodalReasoningFn CallingTool UseStructured Outputs
Step 3.7 Flash2026-05256k198B (11B active)YesYesYesYesYesYes
Step 3.5 Flash2026-01256k196B (11B active)NoNoYesNoNoNo
StepFun Step-22025-10128k1T (MoE)*NoNoNoNoNoNo
StepFun Step-12025-08128kNoNoNoNoNoNo
Step-22024-09256k1T (MoE)*YesYesNoYesNoNo
Step-1V Turbo2024-07NoYesNoNoNoNo
Step-1.5V2024-06128kYesYesNoNoNoNo
Step-12024-04128kNoNoNoNoNoNo
Step-1V2024-03NoYesNoNoNoNo
Step-Instruct2024-03NoNoNoNoNoNo
Step-Math2024-03NoNoNoNoNoNo

Available From(3 providers)

Pricing

Step model pricing by provider
ModelProviderInput / 1MOutput / 1MType
Step 3.5 FlashOpenRouter$0.1$0.3Serverless
Step 3.7 FlashStepFun$0.2$1.15Serverless
Step 3.7 FlashOpenRouter$0.2$1.15Serverless

Frequently Asked Questions

What is Step used for?
Step is used for vision and multimodal work, reasoning, and agent workflows and tool use. The family description and listed model capabilities point to those workloads as the best fit.
How does Step compare to StepAudio 2.5?
Step by StepFun is strongest where you need vision and multimodal work, while StepAudio 2.5 by StepFun is the closest related family to check for voice. Step has 11 listed variants and reaches up to 256k context, so compare the specs and pricing tables before choosing a production model.
Which Step model should I use?
For the lowest listed input price, start with Step 3.5 Flash through OpenRouter at $0.1/1M input tokens. For the most capable/latest local choice, evaluate Step 3.7 Flash with 256k context and reasoning, tool use, function calling, structured outputs, and multimodal inputs.