FireLLaVA 13B
About
FireLLaVA 13B is an advanced vision-language model (VLM) designed to process both images and text, excelling in tasks such as image-to-text generation and visual question answering. Built by integrating CodeLlama 34B Instruct with a vision component akin to OpenAI's CLIP-ViT, this model utilizes a training set consisting of 588K lines of single and multi-turn visual question answering data. As the first commercially permissive open-source LLaVA model, FireLLaVA 13B effectively mimics the multimodal capabilities of GPT-4, showing comparable or superior performance on certain benchmarks. Despite limitations with multiple image inputs and small text recognition, it is readily accessible on Hugging Face and through APIs, requiring a specific prompt template to function optimally.
Capabilities
Providers(1)
| Provider | Input (per 1M) | Output (per 1M) | Type | |
|---|---|---|---|---|
| Fireworks AI Platform | $0.9 | $0.9 | Serverless |