
NVLM
About
The NVLM 1.0 family consists of advanced multimodal large language models from NVIDIA, designed to excel in vision-language tasks. These models not only rival top-tier proprietary models like GPT-4o but also compare favorably with open-access models such as Llama 3-V 405B. Uniquely, NVLM 1.0 enhances text-only performance post multimodal training, contrary to many multimodal models that may degrade in text capabilities. Comprising three primary architectures—NVLM-D (decoder-only), NVLM-X (cross-attention-based), and NVLM-H (hybrid)—each setup aims to maximize different multimodal processing facets. NVIDIA supports open research by releasing the model weights and plans to share the training code. NVLM 1.0 excels in tasks like OCR, multimodal reasoning, and coding, showcasing extensive capabilities beyond traditional text-related tasks 1212.