LLM ReferenceLLM Reference

LLaVA Vicuna 13B

About

LLaVA Vicuna 13B is a versatile open-source, multimodal chatbot model based on the Vicuna-13B language framework. It integrates a pre-trained CLIP ViT-L/14 vision encoder, connected via a projection matrix, to process both textual and visual data. Fine-tuned on an extensive multimodal instruction-following dataset, it includes 558K curated image-text pairs and numerous GPT-4-based multimodal and VQA samples. This model excels in tasks such as visual question answering and image captioning, offering a performance that rivals GPT-4 in multimodal tasks. Further iterations, like LLaVA-NeXT, have been enhanced with advanced data processing techniques and superior language models, although detailed benchmark scores are not uniformly available across various datasets 1, 2, 5, 7.

Capabilities

VisionMultimodalReasoningFunction CallingTool UseStructured OutputsCode Execution

Rankings

Specifications

FamilyLLaVA
Released2023-04-17
Parameters13B
ArchitectureDecoder Only
Specializationgeneral
Trainingfinetuning

Created by

Academic researcher focused on vision models

N/A
Founded N/A
Website