LLM Reference

LLaVA Llama 2 7B

About

LLaVA, short for Large Language and Vision Assistant, is a multimodal AI model that integrates the Llama 2 7B language model with a vision encoder, often using CLIP, through a projection matrix or multilayer perceptron (MLP). This combination empowers LLaVA to handle both textual and visual data, enabling tasks like visual question answering, image captioning, optical character recognition (OCR), and multimodal dialogue. Its training involves a two-stage process: feature alignment pre-training followed by fine-tuning on multimodal instruction-following data. Despite a relatively small training dataset, LLaVA demonstrates strong performance and adaptability to various large language models, with subsequent versions like LLaVA-NeXT offering enhancements in image resolution and reasoning abilities 1 2 8 13.

Capabilities

MultimodalFunction CallingTool UseJSON Mode

Specifications

FamilyLLaVA
Parameters7B
ArchitectureDecoder Only
Specializationgeneral