LLM ReferenceLLM Reference

LLaVA Llama 2 7B

About

LLaVA, short for Large Language and Vision Assistant, is a multimodal AI model that integrates the Llama 2 7B language model with a vision encoder, often using CLIP, through a projection matrix or multilayer perceptron (MLP). This combination empowers LLaVA to handle both textual and visual data, enabling tasks like visual question answering, image captioning, optical character recognition (OCR), and multimodal dialogue. Its training involves a two-stage process: feature alignment pre-training followed by fine-tuning on multimodal instruction-following data. Despite a relatively small training dataset, LLaVA demonstrates strong performance and adaptability to various large language models, with subsequent versions like LLaVA-NeXT offering enhancements in image resolution and reasoning abilities 1 2 8 13.

Capabilities

VisionMultimodalReasoningFunction CallingTool UseStructured OutputsCode Execution

Rankings

Specifications

FamilyLLaVA
Released2023-04-17
Parameters7B
ArchitectureDecoder Only
Specializationgeneral
Trainingfinetuning

Created by

Academic researcher focused on vision models

N/A
Founded N/A
Website