LLM ReferenceLLM Reference

PaliGemma 3B 896

paligemma-3b-896

Open SourceMultimodal

About

PaliGemma 3B 896 is a versatile and lightweight vision-language model developed by Google, designed to process and integrate both images and text. Inspired by the PaLI-3 model, it employs components like the SigLIP vision model and the Gemma-2B language model, featuring a linear projection layer for seamless integration of visual and textual inputs. Capable of handling tasks such as image captioning, visual question answering, object detection, and segmentation, it supports multilingual text processing. Despite requiring task-specific fine-tuning for optimal performance, PaliGemma highlights strong capabilities across various vision-language applications, although it may encounter challenges with contextual understanding, biases, and computational demands 124.

PaliGemma 3B 896 has a 512-token context window.

Capabilities

VisionMultimodalReasoningFunction CallingTool UseStructured OutputsCode Execution

Providers(1)

ProviderInput (per 1M)Output (per 1M)Type
NVIDIA NIMProvisioned

Rankings

Specifications

FamilyPaliGemma
Released2024-05-14
Parameters3B
Context512
ArchitectureDecoder Only
Specializationgeneral
Trainingfinetuned

Created by

Pioneering artificial intelligence research.

London, United Kingdom
Founded 2014
Website

Providers(1)