Using PaliGemma 3B 896 on NVIDIA NIM

Implementation guide · PaliGemma · Google DeepMind

ProvisionedOpen Weights

Quick Start

1
Create an account at NVIDIA NIM and generate an API key.
2
Use the NVIDIA NIM SDK or REST API to call paligemma-3b-896 — see the documentation for request format.
3
You'll be billed . See full pricing.

API Portal Documentation Pricing Model Card

Code Examples

See NVIDIA NIM documentation for integration details.

About NVIDIA NIM

NIM packages inference runtimes and model profiles into containers that expose standard API surfaces such as chat completions, completions, model listing, tokenization, health, and management endpoints. The hosted API path is useful for prototyping and catalog discovery, while the NGC/container path is the self-hosted route for teams that want GPU-hour infrastructure control, private-network deployment, Kubernetes scaling, or NVIDIA AI Enterprise support. Per-token pricing is not a universal provider-level claim in the current seed data; pricing should stay attached to sourced model-provider rows or NVIDIA's current catalog terms.

NVIDIA NIM is NVIDIA's deployment platform for GPU-accelerated inference microservices. Developers can try hosted NIM APIs through the NVIDIA API Catalog on build.nvidia.com, then move the same model families into self-hosted NIM containers on NVIDIA GPUs in a data center, private cloud, public cloud, or workstation. The catalog positions NIM around optimized open and NVIDIA models, including chat, coding, reasoning, retrieval, vision, speech, and safety use cases, with downloadable model cards and API endpoints where NVIDIA exposes them.

View all models on NVIDIA NIM →

Pricing on NVIDIA NIM

Capabilities

VisionMultimodal

About PaliGemma 3B 896

PaliGemma 3B 896 is a versatile and lightweight vision-language model developed by Google, designed to process and integrate both images and text. Inspired by the PaLI-3 model, it employs components like the SigLIP vision model and the Gemma-2B language model, featuring a linear projection layer for seamless integration of visual and textual inputs. Capable of handling tasks such as image captioning, visual question answering, object detection, and segmentation, it supports multilingual text processing. Despite requiring task-specific fine-tuning for optimal performance, PaliGemma highlights strong capabilities across various vision-language applications, although it may encounter challenges with contextual understanding, biases, and computational demands 124.

Full model details →

Model Specs

Released2024-05-14

Parameters3B

Context512

ArchitectureDecoder Only

Provider

NVIDIA NIM

NVIDIA

Santa Clara, California, United States