LLM Reference

LLaVA Models by Haotian Liu

4 models2023Up to 4k ctx

About

LLaVA, or Large Language and Vision Assistant, is an advanced family of open-source large multimodal models (LMMs) developed by a collaborative team from the University of Wisconsin-Madison, Microsoft Research, and Columbia University 126. These models uniquely integrate a vision encoder, such as CLIP ViT-L/14, with large language models like Vicuna, Mistral, and Nous-Hermes to enable robust visual and language understanding 126. A key innovation of LLaVA models is their end-to-end training process, enriched with GPT-4 generated multimodal instruction-following data to optimize performance 12. The evolution of LLaVA models includes LLaVA-1.5, which added an MLP vision-language connector and academic task-oriented data, and LLaVA-NeXT (1.6), which improved image resolution and broadened LLM support 6. Prioritizing data efficiency, these models are highly accessible for research purposes 12.

Current Variants

Use-when guidance is derived from seed capabilities, context, release, and replacement fields.

4 in view

Use when the workload needs 13B parameters.

2023-0413B parameters

Use when the workload needs 13B parameters.

2023-0413B parameters

Use when the workload needs 7B parameters.

2023-047B parameters
LLaVA 13BCurrent

Use when the workload needs 4k context, 13B parameters, and multimodal inputs.

2023-044k context13B parametersmultimodal inputs

Release Timeline

1 release group
2023-04
4 current
LLaVA 13B
4k context13B parametersmultimodal inputs
Current
LLaVA Llama 2 13B
13B parameters
Current
LLaVA Llama 2 7B
7B parameters
Current
LLaVA Vicuna 13B
13B parameters
Current

Specifications(4 models)

LLaVA model specifications comparison
ModelReleasedContextParametersVisionMultimodal
LLaVA Vicuna 13B2023-0413BNoNo
LLaVA Llama 2 13B2023-0413BNoNo
LLaVA Llama 2 7B2023-047BNoNo
LLaVA 13B2023-044k13BYesYes

Available From(1 provider)

Frequently Asked Questions

What is LLaVA used for?
LLaVA is used for vision and multimodal work, coding, and chatbot and role-playing use cases. The family description and listed model capabilities point to those workloads as the best fit.
How does LLaVA compare to LLaVA 1.5?
LLaVA by Haotian Liu is strongest where you need vision and multimodal work, while LLaVA 1.5 by Haotian Liu is the closest related family to check for structured outputs. LLaVA has 4 listed variants and reaches up to 4k context, while LLaVA 1.5 reaches up to 4k context, so compare the specs and pricing tables before choosing a production model.
Which LLaVA model should I use?
If price is the main constraint, use the pricing table first because LLaVA does not have complete provider pricing in the local data. For the most capable/latest local choice, evaluate LLaVA 13B with 4k context and multimodal inputs.

Models(4)