Step 3.7 Flash
Last refreshed 2026-05-29. Next refresh: weekly.
Step 3.7 Flash is worth evaluating for coding, rag, and agents when its provider route and context window match the workload.
Decision context: Coding task fit, 3 tracked provider routes, and research from 2026-05-29.
Use it for
- Teams evaluating coding, rag, and agents
- Workloads that can use a 256k context window
- Buyers comparing 3 tracked provider routes
Do not use it for
- Workloads where another current model has stronger sourced task evidence
Cheapest output
$1.15
OpenRouter per 1M tokens
Provider routes
3
Tracked API hosts
Quality / dollar
Grade C
Ranked by benchmark score divided by cheapest output price
Freshness
2026-05-29
Researched today
Top use-case fit
Coding
Q/$ C1 relevant benchmark in the decision map.
RAG
Included by capability and metadata signals in the decision map.
Agents
Included by capability and metadata signals in the decision map.
Provider price ladder
Compare all 3| Provider | Input / 1M | Output / 1M | Cache | Route |
|---|---|---|---|---|
| OpenRouter | $0.200 | $1.15 | - | Serverless |
| StepFun | $0.200 | $1.15 | read $0.040 | Serverless |
| NVIDIA NIM | - | - | - | ProvisionedPartial |
Benchmark peer barsfor Coding
Migration checks
No linked migration route is available for this model yet.
About
Step 3.7 Flash is StepFun's open-weights multimodal Mixture-of-Experts model for agentic coding, tool use, long-context reasoning, image understanding, and video understanding. It combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder, activates about 11B parameters per token, supports a 256K-token context window, and exposes low, medium, and high reasoning levels for speed/depth tradeoffs. StepFun reports leading open-model results on ClawEval-1.1, SimpleVQA with Search, and SWE-bench Pro at launch. Weights are available on Hugging Face under Apache 2.0.
Step 3.7 Flash has a 256k-token context window.
Step 3.7 Flash input tokens at $0.2/1M, output at $1.15/1M.
Capabilities
Benchmark Scores(14)
| Benchmark | Score | Version | Source |
|---|---|---|---|
| ClawEval-1.1 | 67.1 | 1st among open models at release | https://static.stepfun.com/blog/step-3.7-flash/ |
| SimpleVQA with Search Tool | 79.2 | 1st at release (GPT-5.5: 79.1) | https://static.stepfun.com/blog/step-3.7-flash/ |
| V* with Python | 95.3 | 2nd at release (Kimi K2.6: 96.9) | https://huggingface.co/stepfun-ai/Step-3.7-Flash |
| SWE-bench Pro | 56.3 | 2nd at release (Claude Opus 4.7: 64.3, GPT-5.5: 58.6) | https://static.stepfun.com/blog/step-3.7-flash/ |
| Terminal-Bench 2.1 | 59.5 | Comparison: Step 3.5 Flash 53.37%, DeepSeek V4 Flash 62.0%, Gemini 3.5 Flash 76.2%, GPT-5.5 82.7%, Claude Opus 4.7 69.4% | https://static.stepfun.com/blog/step-3.7-flash/ |
| Toolathlon | 49.5 | — | https://huggingface.co/stepfun-ai/Step-3.7-Flash |
| Humanity's Last Exam | 47.2 | — | https://huggingface.co/stepfun-ai/Step-3.7-Flash |
| GDPval-AA | 45.8 | — | https://huggingface.co/stepfun-ai/Step-3.7-Flash |
| WorldVQA | 58.1 | Comparison: Kimi K2.6 55.98% | https://static.stepfun.com/blog/step-3.7-flash/ |
| HR-Bench 4K | 89.1 | Comparison: Kimi K2.6 91.25% | https://static.stepfun.com/blog/step-3.7-flash/ |
| Android Daily | 61.9 | Comparison: Gemini 3 Flash 63.21% | https://static.stepfun.com/blog/step-3.7-flash/ |
| DeepSearchQA | 92.8 | — | https://static.stepfun.com/blog/step-3.7-flash/ |
| BrowseComp | 75.8 | — | https://static.stepfun.com/blog/step-3.7-flash/ |
| ResearchRubrics | 71.7 | — | https://static.stepfun.com/blog/step-3.7-flash/ |
API Versions
step-3.7-flash