LLM ReferenceLLM Reference

NeVA 43B

About

NeVA 43B, developed by NVIDIA, is a sophisticated multimodal vision-language model designed within a decoder-only GPT architecture. It processes a vast array of data, trained on 1.1 trillion tokens with 48 layers. Its exceptional capability in understanding and generating text and images stems from integrating a frozen CLIP model for image encoding with a GPT language model. NeVA excels in visual question answering, image captioning, and image-related instruction following. Its development included meticulous pre-training with image-caption pairs from datasets like CC-3M and further fine-tuning using GPT-4-generated instruction data. Leveraging NVIDIA’s advanced Hopper and Ampere/Turing hardware, NeVA efficiently performs inference tasks via the Triton Inference Server. Despite its robust performance, it retains typical limitations, including biases due to training data and challenges in model interpretability.

Capabilities

VisionMultimodalReasoningFunction CallingTool UseStructured OutputsCode Execution

Rankings

Specifications

FamilyNeVA
Released2024-03-01
Parameters43B
ArchitectureDecoder Only
Specializationgeneral
Trainingfinetuning

Created by

Accelerated AI for enterprise solutions

Santa Clara, California, United States
Founded 2015
Website