LLM Reference

LLaVA 1.6 Vicuna 13B

About

LLaVA 1.6 Vicuna 13B is a sophisticated multimodal language model designed to handle multimodal chatbot tasks, integrating text and image processing seamlessly. It features a pre-trained LLM, Vicuna-13B, and a likely CLIP ViT-L/14 vision encoder, linked using a trainable projection matrix, allowing it to comprehend both textual and visual content efficiently. The model offers capabilities such as image captioning, visual question answering, and enhanced reasoning and OCR, with the added advantage of processing high-resolution images up to 672x672 pixels. While it brings notable improvements over its predecessors, it still faces challenges like dependency on diverse training data, computational costs, and inherent biases, necessitating further refinements to maximize its potential across various applications 1238.

Capabilities

MultimodalFunction CallingTool UseJSON Mode

Providers(1)

ProviderInput (per 1M)Output (per 1M)Type
Replicate API
Serverless

Specifications

FamilyLLaVA 1.6
Parameters13B
ArchitectureDecoder Only
Specializationgeneral