LLM Reference
Fireworks AI

FireLLaVA 13B on Fireworks AI

FireLLaVA · Fireworks AI

Serverless

Pricing

TypePrice (per 1M)
Input tokens$0.90
Output tokens$0.90

Capabilities

VisionMultimodalReasoningFunction CallingTool UseJSON ModeCode Execution

About FireLLaVA 13B

FireLLaVA 13B is an advanced vision-language model (VLM) designed to process both images and text, excelling in tasks such as image-to-text generation and visual question answering. Built by integrating CodeLlama 34B Instruct with a vision component akin to OpenAI's CLIP-ViT, this model utilizes a training set consisting of 588K lines of single and multi-turn visual question answering data. As the first commercially permissive open-source LLaVA model, FireLLaVA 13B effectively mimics the multimodal capabilities of GPT-4, showing comparable or superior performance on certain benchmarks. Despite limitations with multiple image inputs and small text recognition, it is readily accessible on Hugging Face and through APIs, requiring a specific prompt template to function optimally.

Get Started

Model Specs

Released2024-04-10
Parameters13B
ArchitectureDecoder Only