LLaVA Models by Haotian Liu
About
LLaVA, or Large Language and Vision Assistant, is an advanced family of open-source large multimodal models (LMMs) developed by a collaborative team from the University of Wisconsin-Madison, Microsoft Research, and Columbia University 126. These models uniquely integrate a vision encoder, such as CLIP ViT-L/14, with large language models like Vicuna, Mistral, and Nous-Hermes to enable robust visual and language understanding 126. A key innovation of LLaVA models is their end-to-end training process, enriched with GPT-4 generated multimodal instruction-following data to optimize performance 12. The evolution of LLaVA models includes LLaVA-1.5, which added an MLP vision-language connector and academic task-oriented data, and LLaVA-NeXT (1.6), which improved image resolution and broadened LLM support 6. Prioritizing data efficiency, these models are highly accessible for research purposes 12.
Current Variants
Use-when guidance is derived from seed capabilities, context, release, and replacement fields.
Use when the workload needs 4k context, 13B parameters, and multimodal inputs.
| Model | Use when | Released | Signals | Status |
|---|---|---|---|---|
| LLaVA Vicuna 13B | Use when the workload needs 13B parameters. | 2023-04 | 13B parameters | Current |
| LLaVA Llama 2 13B | Use when the workload needs 13B parameters. | 2023-04 | 13B parameters | Current |
| LLaVA Llama 2 7B | Use when the workload needs 7B parameters. | 2023-04 | 7B parameters | Current |
| LLaVA 13B | Use when the workload needs 4k context, 13B parameters, and multimodal inputs. | 2023-04 | 4k context13B parametersmultimodal inputs | Current |
Release Timeline
1 release groupSpecifications(4 models)
| Model | Released | Context | Parameters | Vision | Multimodal |
|---|---|---|---|---|---|
| LLaVA Vicuna 13B | 2023-04 | — | 13B | No | No |
| LLaVA Llama 2 13B | 2023-04 | — | 13B | No | No |
| LLaVA Llama 2 7B | 2023-04 | — | 7B | No | No |
| LLaVA 13B | 2023-04 | 4k | 13B | Yes | Yes |
Available From(1 provider)
Frequently Asked Questions
- What is LLaVA used for?
- LLaVA is used for vision and multimodal work, coding, and chatbot and role-playing use cases. The family description and listed model capabilities point to those workloads as the best fit.
- How does LLaVA compare to LLaVA 1.5?
- LLaVA by Haotian Liu is strongest where you need vision and multimodal work, while LLaVA 1.5 by Haotian Liu is the closest related family to check for structured outputs. LLaVA has 4 listed variants and reaches up to 4k context, while LLaVA 1.5 reaches up to 4k context, so compare the specs and pricing tables before choosing a production model.
- Which LLaVA model should I use?
- If price is the main constraint, use the pricing table first because LLaVA does not have complete provider pricing in the local data. For the most capable/latest local choice, evaluate LLaVA 13B with 4k context and multimodal inputs.


