DeepSeek VL 1.3B
Multimodal
About
DeepSeek VL 1.3B is an advanced vision-language (VL) model that integrates multimodal understanding capabilities, enabling it to process and interpret both images and text effectively. Featuring a SigLIP-L vision encoder for 384 x 384 pixel image inputs, it is built upon a foundation of extensive training on text and vision-language tokens. This open-source model supports tasks such as image captioning, visual question answering, and multimodal document understanding, while also excelling in scenarios requiring embodied intelligence. Despite its powerful features, it is compact with 1.3 billion parameters, making it resource-efficient for real-world applications and available on platforms like Hugging Face.
Capabilities
MultimodalFunction CallingTool UseJSON Mode
Specifications
FamilyDeepSeek VL
Released2024-03-15
Parameters1.3B
ArchitectureDecoder Only
Specializationgeneral