LLaVA 1.6 Vicuna 13B
About
LLaVA 1.6 Vicuna 13B is a sophisticated multimodal language model designed to handle multimodal chatbot tasks, integrating text and image processing seamlessly. It features a pre-trained LLM, Vicuna-13B, and a likely CLIP ViT-L/14 vision encoder, linked using a trainable projection matrix, allowing it to comprehend both textual and visual content efficiently. The model offers capabilities such as image captioning, visual question answering, and enhanced reasoning and OCR, with the added advantage of processing high-resolution images up to 672x672 pixels. While it brings notable improvements over its predecessors, it still faces challenges like dependency on diverse training data, computational costs, and inherent biases, necessitating further refinements to maximize its potential across various applications 1238.
Capabilities
Providers(1)
| Provider | Input (per 1M) | Output (per 1M) | Type | |
|---|---|---|---|---|
| Replicate API | — | — | Serverless |