
LLaVA 1.6
About
LLaVA 1.6 represents a significant advancement in large multimodal models, integrating a vision encoder with a large language model for enhanced visual and language understanding 12. Designed for multimodal chatbot applications, it excels over its predecessor, LLaVA 1.5, with key improvements such as fourfold increased input image resolution, supporting up to 672x672 and other resolutions 2. It enhances visual reasoning and OCR capabilities through a refined instruction tuning data mixture 2. Utilizing LLMs like Mistral-7B and Vicuna-13B as backbones, LLaVA 1.6 offers better commercial licensing and bilingual support 4. Its most advanced 34B parameter model demonstrates exceptional training efficiency and outperforms commercial models such as Google's Gemini Pro on several benchmarks, with a demo available online showcasing its chat and visual question answering capabilities 10.