LLM Reference

NeVA 43B

About

NeVA 43B, developed by NVIDIA, is a sophisticated multimodal vision-language model designed within a decoder-only GPT architecture. It processes a vast array of data, trained on 1.1 trillion tokens with 48 layers. Its exceptional capability in understanding and generating text and images stems from integrating a frozen CLIP model for image encoding with a GPT language model. NeVA excels in visual question answering, image captioning, and image-related instruction following. Its development included meticulous pre-training with image-caption pairs from datasets like CC-3M and further fine-tuning using GPT-4-generated instruction data. Leveraging NVIDIA’s advanced Hopper and Ampere/Turing hardware, NeVA efficiently performs inference tasks via the Triton Inference Server. Despite its robust performance, it retains typical limitations, including biases due to training data and challenges in model interpretability.

Capabilities

MultimodalFunction CallingTool UseJSON Mode

Specifications

FamilyNeVA
Parameters43B
ArchitectureDecoder Only
Specializationgeneral