LLaVA Vicuna 13B
About
LLaVA Vicuna 13B is a versatile open-source, multimodal chatbot model based on the Vicuna-13B language framework. It integrates a pre-trained CLIP ViT-L/14 vision encoder, connected via a projection matrix, to process both textual and visual data. Fine-tuned on an extensive multimodal instruction-following dataset, it includes 558K curated image-text pairs and numerous GPT-4-based multimodal and VQA samples. This model excels in tasks such as visual question answering and image captioning, offering a performance that rivals GPT-4 in multimodal tasks. Further iterations, like LLaVA-NeXT, have been enhanced with advanced data processing techniques and superior language models, although detailed benchmark scores are not uniformly available across various datasets 1, 2, 5, 7.