LLM Reference

Llama 3.2 11B Vision Instruct

Released
2024-09-25
Last refreshed
2026-06-01
Status
Researched 3d ago
Open SourceMultimodalRAGLong contextVisionJSON / Tool use

Llama 3.2 11B Vision Instruct is worth evaluating for rag, long context, and vision when its provider route and context window match the workload.

Use it for

  • Teams evaluating rag, long context, and vision
  • Workloads that can use a 128k context window
  • Buyers comparing 4 tracked provider routes

Do not use it for

  • Workloads where another current model has stronger sourced task evidence
Specifications
Family
Llama 3.2
Released
2024-09-25
Context
128k
Parameters
10.6B
Architecture
Decoder Only
Knowledge cutoff
2024-03
Specialization
general
Training
finetuned
Created by

Large-scale open-source AI for social technologies.

Menlo Park, California, United States
Founded 2013
Website
Pricing
Output / 1M
$0.160
Input / 1M
$0.160

Cheapest of 8 routes · Vercel AI Gateway

About

Instruction-tuned 11B Llama 3.2 Vision model for image reasoning, visual question answering, document understanding, and captioning. NVIDIA NIM lists text plus image input, text output, and a 128K context window for the Llama 3.2 Vision collection.

Llama 3.2 11B Vision Instruct is Meta's entry-level multimodal model, released in September 2024 as part of the Llama 3.2 family. With 11 billion parameters, it was among the first openly available Meta models to accept image inputs alongside text, supporting a 128,000-token combined context window for text and image content. The model produces text-only output. NVIDIA NIM documents it as accepting text plus image input with text output within the Llama 3.2 Vision collection's shared context limit.

The instruction-tuned variant is fine-tuned for visual question answering, image captioning, document understanding, and figure interpretation in both single-turn and multi-turn conversational settings. It uses the same Llama 3 tokenizer and base architecture as the text-only Llama 3.2 models, extended with a vision encoder that projects image patches into the language model's embedding space.

Llama 3.2 11B Vision Instruct is available as open weights under Meta's Llama Community License and hosted on OpenRouter, Fireworks AI, NVIDIA NIM, AWS Bedrock, Azure AI Foundry, and Bitdeer. Teams needing stronger visual reasoning at the cost of higher compute should evaluate the Llama 3.2 90B Vision Instruct variant, which shares the same architecture and context window but has substantially more parameters.

Llama 3.2 11B Vision Instruct has a 128k-token context window.

Llama 3.2 11B Vision Instruct input tokens at $0.049/1M, output at $0.676/1M.

Top use-case fit

RAG

Included by capability and metadata signals in the decision map.

Long context

Included by capability and metadata signals in the decision map.

Vision

Included by capability and metadata signals in the decision map.

Provider price ladder

Compare all 8

Compare API pricing across 4 providers for input and output tokens, batch, and cached reads when available.

ProviderInput / 1MOutput / 1MRoute
Vercel AI Gateway$0.160$0.160
Serverless
Fireworks AI$0.200$0.200
Serverless
OpenRouter$0.245$0.245
Serverless
AWS Bedrock$0.200$0.270
Serverless

Capabilities

VisionMultimodalStructured Outputs

Benchmark peer barsfor RAG

No task-mapped benchmark peers are available for this model yet.

Migration checks

No linked migration route is available for this model yet.

Rankings & picks(9)