Qwen2-VL Models by Alibaba
1 model2025Up to 32k ctxFrom $0.9/1M input
Details
ResearcherAlibaba
LicenseApache 2.0(OSI)
Commercial useCommercial use allowed
Models1
Released2025
Max context32k
Capabilities
VisionAll models
MultimodalAll models
About
Qwen2-VL is a family of 1 AI model by Alibaba, released in 2025.
Current Variants
Use-when guidance is derived from seed capabilities, context, release, and replacement fields.
1 in view
Qwen2-VL-72B-InstructCurrent
Use when the workload needs multimodal, 32k context, and 72B parameters.
2025-01multimodal32k context72B parameters
| Model | Use when | Released | Signals | Status |
|---|---|---|---|---|
| Qwen2-VL-72B-Instruct | Use when the workload needs multimodal, 32k context, and 72B parameters. | 2025-01 | multimodal32k context72B parameters | Current |
Release Timeline
1 release group2025-01
1 current
Qwen2-VL-72B-Instruct
Currentmultimodal32k context72B parameters
Specifications(1 models)
| Model | Released | Context | Parameters | Vision | Multimodal |
|---|---|---|---|---|---|
| Qwen2-VL-72B-Instruct | 2025-01 | 32k | 72B | Yes | Yes |
Available From(1 provider)
Pricing
| Model | Provider | Input / 1M | Output / 1M | Type |
|---|---|---|---|---|
| Qwen2-VL-72B-Instruct | Fireworks AI | $0.9 | $0.9 | Serverless |
Frequently Asked Questions
- What is Qwen2-VL used for?
- Qwen2-VL is used for multimodal and vision and multimodal work. The family description and listed model capabilities point to those workloads as the best fit.
- How does Qwen2-VL compare to Tongyi DeepResearch?
- Qwen2-VL by Alibaba is strongest where you need multimodal, while Tongyi DeepResearch by Alibaba is the closest related family to check for adjacent model selection. Qwen2-VL has 1 listed variant and reaches up to 32k context, while Tongyi DeepResearch reaches up to 131k context, so compare the specs and pricing tables before choosing a production model.
- Which Qwen2-VL model should I use?
- For the lowest listed input price, start with Qwen2-VL-72B-Instruct through Fireworks AI at $0.9/1M input tokens. For the most capable/latest local choice, evaluate Qwen2-VL-72B-Instruct with 32k context and multimodal inputs.






