LLM Reference

DeepSeek VL 7B

Multimodal

About

DeepSeek-VL 7B is an open-source vision-language model engineered for robust real-world applications. With its general multimodal understanding capabilities, it can process logical diagrams, web pages, formulas, scientific literature, natural images, and complex scenarios involving embodied intelligence. The model features a hybrid vision encoder that integrates SigLIP-L and SAM-B, allowing it to handle high-resolution (1024 x 1024) image inputs. Built on the DeepSeek-LLM-7b-base foundation, it is pre-trained on roughly 2 trillion text tokens and further trained on approximately 400 billion vision-language tokens. A standout variant, DeepSeek-VL-7b-chat, is specially optimized for conversational tasks, enhancing both performance and user experience by addressing the limitations of existing open-source multimodal models.

Capabilities

MultimodalFunction CallingTool UseJSON Mode

Providers(1)

ProviderInput (per 1M)Output (per 1M)Type
Replicate APIServerless

Specifications

Released2024-03-15
Parameters7B
ArchitectureDecoder Only
Specializationgeneral