LLM ReferenceLLM Reference

FireLLaVA 13B

About

FireLLaVA 13B is an advanced vision-language model (VLM) designed to process both images and text, excelling in tasks such as image-to-text generation and visual question answering. Built by integrating CodeLlama 34B Instruct with a vision component akin to OpenAI's CLIP-ViT, this model utilizes a training set consisting of 588K lines of single and multi-turn visual question answering data. As the first commercially permissive open-source LLaVA model, FireLLaVA 13B effectively mimics the multimodal capabilities of GPT-4, showing comparable or superior performance on certain benchmarks. Despite limitations with multiple image inputs and small text recognition, it is readily accessible on Hugging Face and through APIs, requiring a specific prompt template to function optimally.

Capabilities

VisionMultimodalReasoningFunction CallingTool UseStructured OutputsCode Execution

Providers(1)

ProviderInput (per 1M)Output (per 1M)Type
Fireworks AI$0.9$0.9Serverless

Rankings

Specifications

FamilyFireLLaVA
Released2024-04-10
Parameters13B
ArchitectureDecoder Only
Specializationgeneral
Trainingfinetuning

Created by

Blazing-fast inference for generative AI

Redwood City, California, United States
Founded 2022
Website

Providers(1)