LLaVA 1.6

About

LLaVA 1.6 represents a significant advancement in large multimodal models, integrating a vision encoder with a large language model for enhanced visual and language understanding 12. Designed for multimodal chatbot applications, it excels over its predecessor, LLaVA 1.5, with key improvements such as fourfold increased input image resolution, supporting up to 672x672 and other resolutions 2. It enhances visual reasoning and OCR capabilities through a refined instruction tuning data mixture 2. Utilizing LLMs like Mistral-7B and Vicuna-13B as backbones, LLaVA 1.6 offers better commercial licensing and bilingual support 4. Its most advanced 34B parameter model demonstrates exceptional training efficiency and outperforms commercial models such as Google's Gemini Pro on several benchmarks, with a demo available online showcasing its chat and visual question answering capabilities 10.

Models(4)

LLaVA 1.6 Vicuna 7B

LLaVA 1.6 Vicuna 13B

13B

LLaVA 1.6 Mistral 7B

LLaVA 1.6 Hermes Yi 34B

2024-0134B

Details

ResearcherHaotian Liu

Models4

Links

Website HuggingFace