LLM Reference

NeVA 8B

About

NeVA is NVIDIA's version of LLaVA, a multimodal vision-language model engineered to interpret and respond to inputs involving both text and images. Built on a transformer architecture, NeVA integrates a GPT language model available in 8B, 22B, and 43B parameter versions, alongside a CLIP vision encoder (ViT-L/14). The model's projection layer facilitates the seamless combination of visual data with textual information. NeVA's two-stage training process involves pretraining on image-caption pairs and finetuning with synthetic instruction data, enabling it to adeptly handle complex, multimodal prompts. It excels in generating responses to queries involving images, offering visual comprehension, and creating textual descriptions of visual content. Deployed using NVIDIA's Triton inference server, it benefits from the NeMo LLM framework's efficient training capabilities.

Capabilities

MultimodalFunction CallingTool UseJSON Mode

Specifications

FamilyNeVA
Parameters8B
ArchitectureDecoder Only
Specializationgeneral