
LLaVA 1.5
About
LLaVA 1.5 represents a sophisticated family of open-source large multimodal models created primarily for research. These models build on the LLaMA/Vicuna architecture and are further fine-tuned using multimodal instruction-following data generated by GPT. The key innovation in LLaVA 1.5 is the introduction of a two-layer Multilayer Perceptron (MLP) vision-language connector, enhancing performance by replacing the earlier linear projection method. This development, together with task-oriented visual question answering data, markedly enhances the models' capabilities on diverse benchmarks. Available in 7B and 13B parameter variants, LLaVA 1.5 models are recognized for their data efficiency and state-of-the-art results on multiple benchmarks, all achieved with relatively small training datasets. They are optimized for applications demanding visual and language comprehension, such as visual chat, image captioning, and visual question answering 1310.