LLM Reference

LLaVA Vicuna 13B

About

LLaVA Vicuna 13B is a versatile open-source, multimodal chatbot model based on the Vicuna-13B language framework. It integrates a pre-trained CLIP ViT-L/14 vision encoder, connected via a projection matrix, to process both textual and visual data. Fine-tuned on an extensive multimodal instruction-following dataset, it includes 558K curated image-text pairs and numerous GPT-4-based multimodal and VQA samples. This model excels in tasks such as visual question answering and image captioning, offering a performance that rivals GPT-4 in multimodal tasks. Further iterations, like LLaVA-NeXT, have been enhanced with advanced data processing techniques and superior language models, although detailed benchmark scores are not uniformly available across various datasets 1, 2, 5, 7.

Capabilities

MultimodalFunction CallingTool UseJSON Mode

Specifications

FamilyLLaVA
Parameters13B
ArchitectureDecoder Only
Specializationgeneral