LLM Reference

PaliGemma Models by Google DeepMind

Google DeepMindGemmaOpen weights
3 models2024Up to 512 ctx

Details

ResearcherGoogle DeepMind
LicenseGemma
Commercial useCommercial use with conditions
Models3
Released2024
Max context512

Capabilities

Vision1 of 3 models
Multimodal1 of 3 models

About

PaliGemma is a family of open-source vision-language models (VLMs) developed by Google, emphasizing lightweight design and efficiency compared to other large language models. Built using open components, including the SigLIP vision model and the Gemma language model, PaliGemma models seamlessly process both images and text to deliver text outputs. This capability makes them well-suited for tasks such as image captioning, visual question answering, and object detection. Available in resolutions ranging from 224x224 to 896x896, these models are offered in various forms including pre-trained, mix, and fine-tuned versions to meet diverse research and practical needs. While useful for direct inference, they excel when fine-tuned for specific applications 13578.

Current Variants

Use-when guidance is derived from seed capabilities, context, release, and replacement fields.

3 in view

Use when the workload needs 512 context, 3B parameters, and multimodal inputs.

2024-05512 context3B parametersmultimodal inputs

Use when the workload needs 512 context and 3B parameters.

2024-05512 context3B parameters

Use when the workload needs 128 context and 3B parameters.

2024-05128 context3B parameters

Release Timeline

1 release group
2024-05
3 current
PaliGemma 3B 224
128 context3B parameters
Current
PaliGemma 3B 448
512 context3B parameters
Current
PaliGemma 3B 896
512 context3B parametersmultimodal inputs
Current

Specifications(3 models)

PaliGemma model specifications comparison
ModelReleasedContextParametersVisionMultimodal
PaliGemma 3B 8962024-055123BYesYes
PaliGemma 3B 4482024-055123BNoNo
PaliGemma 3B 2242024-051283BNoNo

Available From(1 provider)

Frequently Asked Questions

What is PaliGemma used for?
PaliGemma is used for vision and multimodal work. The family description and listed model capabilities point to those workloads as the best fit.
How does PaliGemma compare to Gemma 4?
PaliGemma by Google DeepMind is strongest where you need vision and multimodal work, while Gemma 4 by Google DeepMind is the closest related family to check for multimodal. PaliGemma has 3 listed variants and reaches up to 512 context, while Gemma 4 reaches up to 256k context, so compare the specs and pricing tables before choosing a production model.
Which PaliGemma model should I use?
If price is the main constraint, use the pricing table first because PaliGemma does not have complete provider pricing in the local data. For the most capable/latest local choice, evaluate PaliGemma 3B 896 with 512 context and multimodal inputs.