LLaVA Llama 2 13B
About
LLaVA (Large Language and Vision Assistant) Llama 2 13B is an open-source multimodal chatbot model that leverages the transformer architecture to process and integrate sequential data like text and images. It is a fine-tuned version of the LLaMA 2 language model, trained on a substantial dataset of image-text pairs and GPT-generated multimodal instruction-following data. The model's architecture features a vision encoder (CLIP ViT-L/14) for images and a language model (Vicuna) for text, connected by a projection matrix. LLaVA excels in open-ended conversations, visual reasoning tasks, and can synergize with other models like GPT-4 for complex tasks. It underwent a two-stage training process to align features and fine-tune for tasks, achieving state-of-the-art results in some benchmarks, though it may face challenges in reasoning and factual precision.