UI-TARS Models by ByteDance
1 model2026Up to 128k ctx
Details
ResearcherByteDance
LicenseApache 2.0(OSI)
Commercial useCommercial use allowed
Models1
Released2026
Max context128k
Capabilities
VisionAll models
MultimodalAll models
Tool UseAll models
Links
WebsiteAbout
UI-TARS is ByteDance's multimodal vision-language agent series optimized for GUI automation across desktop, web, and mobile environments.
Current Variants
Use-when guidance is derived from seed capabilities, context, release, and replacement fields.
1 in view
UI-TARS 1.5 7BCurrent
Use when the workload needs agents, 128k context, and 7B parameters.
2026-02agents128k context7B parameters
| Model | Use when | Released | Signals | Status |
|---|---|---|---|---|
| UI-TARS 1.5 7B | Use when the workload needs agents, 128k context, and 7B parameters. | 2026-02 | agents128k context7B parameters | Current |
Release Timeline
1 release group2026-02
1 current
UI-TARS 1.5 7B
Currentagents128k context7B parameters
Specifications(1 models)
| Model | Released | Context | Parameters | Vision | Multimodal | Tool Use |
|---|---|---|---|---|---|---|
| UI-TARS 1.5 7B | 2026-02 | 128k | 7B | Yes | Yes | Yes |
Frequently Asked Questions
- What is UI-TARS used for?
- UI-TARS is used for agents, vision and multimodal work, and agent workflows and tool use. The family description and listed model capabilities point to those workloads as the best fit.
- How does UI-TARS compare to Seed?
- UI-TARS by ByteDance is strongest where you need agents, while Seed by ByteDance is the closest related family to check for vision and multimodal work. UI-TARS has 1 listed variant and reaches up to 128k context, while Seed reaches up to 256k context, so compare the specs and pricing tables before choosing a production model.
- Which UI-TARS model should I use?
- If price is the main constraint, use the pricing table first because UI-TARS does not have complete provider pricing in the local data. For the most capable/latest local choice, evaluate UI-TARS 1.5 7B with 128k context and tool use and multimodal inputs.






