LLM Reference

FireLLaVA 13B

About

FireLLaVA 13B is an advanced vision-language model (VLM) designed to process both images and text, excelling in tasks such as image-to-text generation and visual question answering. Built by integrating CodeLlama 34B Instruct with a vision component akin to OpenAI's CLIP-ViT, this model utilizes a training set consisting of 588K lines of single and multi-turn visual question answering data. As the first commercially permissive open-source LLaVA model, FireLLaVA 13B effectively mimics the multimodal capabilities of GPT-4, showing comparable or superior performance on certain benchmarks. Despite limitations with multiple image inputs and small text recognition, it is readily accessible on Hugging Face and through APIs, requiring a specific prompt template to function optimally.

Capabilities

MultimodalFunction CallingTool UseJSON Mode

Providers(1)

ProviderInput (per 1M)Output (per 1M)Type
Fireworks AI Platform$0.9$0.9
Serverless

Specifications

FamilyFireLLaVA
Parameters13B
ArchitectureDecoder Only
Specializationgeneral