Pricing
| Type | Price (per 1M) |
|---|---|
| Input tokens | $0.90 |
| Output tokens | $0.90 |
Capabilities
About FireLLaVA 13B
FireLLaVA 13B is an advanced vision-language model (VLM) designed to process both images and text, excelling in tasks such as image-to-text generation and visual question answering. Built by integrating CodeLlama 34B Instruct with a vision component akin to OpenAI's CLIP-ViT, this model utilizes a training set consisting of 588K lines of single and multi-turn visual question answering data. As the first commercially permissive open-source LLaVA model, FireLLaVA 13B effectively mimics the multimodal capabilities of GPT-4, showing comparable or superior performance on certain benchmarks. Despite limitations with multiple image inputs and small text recognition, it is readily accessible on Hugging Face and through APIs, requiring a specific prompt template to function optimally.