DeepSeek VL 1.3B

Multimodal

About

DeepSeek VL 1.3B is an advanced vision-language (VL) model that integrates multimodal understanding capabilities, enabling it to process and interpret both images and text effectively. Featuring a SigLIP-L vision encoder for 384 x 384 pixel image inputs, it is built upon a foundation of extensive training on text and vision-language tokens. This open-source model supports tasks such as image captioning, visual question answering, and multimodal document understanding, while also excelling in scenarios requiring embodied intelligence. Despite its powerful features, it is compact with 1.3 billion parameters, making it resource-efficient for real-world applications and available on platforms like Hugging Face.

Capabilities

MultimodalFunction CallingTool UseJSON Mode

Specifications

FamilyDeepSeek VL

Released2024-03-15

Parameters1.3B

ArchitectureDecoder Only

Specializationgeneral