DeepSeek VL Models by DeepSeek
Details
Capabilities
About
DeepSeek-VL is an advanced open-source family of vision-language models crafted for real-world applications, offering 1.3B and 7B parameter sizes with both "base" and "chat" variants. A standout feature is its hybrid vision encoder, which efficiently handles 1024 x 1024 high-resolution images, balancing performance with low computational needs. The models prioritize robust language abilities by integrating vision-language data strategically during training, preventing any compromise on language performance. With a vast pretraining dataset sourced from Common Crawl, web code, e-books, and educational content, DeepSeek-VL achieves competitive or state-of-the-art results across various benchmarks. These models aim to bridge the open-source and closed-source performance gap, enhancing both user experience and real-world applicability, and are available on platforms like Hugging Face for easy access.
Current Variants
Use-when guidance is derived from seed capabilities, context, release, and replacement fields.
Use when the workload needs 7B parameters and multimodal inputs.
Use when the workload needs 1.3B parameters and multimodal inputs.
Use when the workload needs 7B parameters and multimodal inputs.
Use when the workload needs 1.3B parameters and multimodal inputs.
| Model | Use when | Released | Signals | Status |
|---|---|---|---|---|
| DeepSeek VL 7B | Use when the workload needs 7B parameters and multimodal inputs. | 2024-03 | 7B parametersmultimodal inputs | Current |
| DeepSeek VL 1.3B | Use when the workload needs 1.3B parameters and multimodal inputs. | 2024-03 | 1.3B parametersmultimodal inputs | Current |
| DeepSeek VL 7B Chat | Use when the workload needs 7B parameters and multimodal inputs. | 2024-03 | 7B parametersmultimodal inputs | Current |
| DeepSeek VL 1.3B Chat | Use when the workload needs 1.3B parameters and multimodal inputs. | 2024-03 | 1.3B parametersmultimodal inputs | Current |
Release Timeline
1 release groupSpecifications(4 models)
| Model | Released | Parameters | Vision | Multimodal |
|---|---|---|---|---|
| DeepSeek VL 7B | 2024-03 | 7B | Yes | Yes |
| DeepSeek VL 1.3B | 2024-03 | 1.3B | Yes | Yes |
| DeepSeek VL 7B Chat | 2024-03 | 7B | Yes | Yes |
| DeepSeek VL 1.3B Chat | 2024-03 | 1.3B | Yes | Yes |
Available From(1 provider)
Pricing
| Model | Provider | Input / 1M | Output / 1M | Type |
|---|---|---|---|---|
| DeepSeek VL 7B | Replicate API | $0.05 | $0.25 | Serverless |
Frequently Asked Questions
- What is DeepSeek VL used for?
- DeepSeek VL is used for vision and multimodal work and coding. The family description and listed model capabilities point to those workloads as the best fit.
- How does DeepSeek VL compare to Janus?
- DeepSeek VL by DeepSeek is strongest where you need vision and multimodal work, while Janus by DeepSeek is the closest related family to check for image generation. DeepSeek VL has 4 listed variants, so compare the specs and pricing tables before choosing a production model.
- Which DeepSeek VL model should I use?
- For the lowest listed input price, start with DeepSeek VL 7B through Replicate API at $0.05/1M input tokens. For the most capable/latest local choice, evaluate DeepSeek VL 7B with multimodal inputs.





