Cosmos 3 Models by NVIDIA AI
About
NVIDIA Cosmos 3 is the world's first fully open omnimodel for physical AI, built on a Mixture-of-Transformers (MoT) architecture that combines a vision-language autoregressive Reasoner tower with a diffusion-based Generator tower. The family natively understands and generates text, images, video, ambient sound, and robot action sequences with physics-grounded accuracy. Designed for robotics, autonomous vehicles, and smart infrastructure; supports synthetic data generation, robot policy training, world simulation, and VLM reasoning. Announced at Computex 2026 on 2026-06-01; model weights released 2026-05-31. GitHub: https://github.com/nvidia/cosmos.
Current Variants
Use-when guidance is derived from seed capabilities, context, release, and replacement fields.
Use when the workload needs multimodal, 256k context, and 16B parameters.
Use when the workload needs multimodal, 256k context, and 64B parameters.
Use when the workload needs image generation, 4k context, and 64B parameters.
Use when the workload needs video generation, 4k context, and 64B parameters.
Use when the workload needs robotics, 4k context, and 16B parameters.
Use when the workload needs multimodal, multimodal inputs, and audio.
| Model | Use when | Released | Signals | Status |
|---|---|---|---|---|
| Cosmos 3 Nano | Use when the workload needs multimodal, 256k context, and 16B parameters. | 2026-05 | multimodal256k context16B parameters | Current |
| Cosmos 3 Super | Use when the workload needs multimodal, 256k context, and 64B parameters. | 2026-05 | multimodal256k context64B parameters | Current |
| Cosmos 3 Super Text2Image | Use when the workload needs image generation, 4k context, and 64B parameters. | 2026-05 | image generation4k context64B parameters | Current |
| Cosmos 3 Super Image2Video | Use when the workload needs video generation, 4k context, and 64B parameters. | 2026-05 | video generation4k context64B parameters | Current |
| Cosmos 3 Nano Policy DROID | Use when the workload needs robotics, 4k context, and 16B parameters. | 2026-05 | robotics4k context16B parameters | Current |
| Cosmos 3 Edge | Use when the workload needs multimodal, multimodal inputs, and audio. | Unknown release | multimodalmultimodal inputsaudio | Current |
Release Timeline
2 release groupsSpecifications(6 models)
| Model | Released | Context | Parameters | Vision | Multimodal | Reasoning | Structured Outputs |
|---|---|---|---|---|---|---|---|
| Cosmos 3 Nano | 2026-05 | 256k | 16B | Yes | Yes | Yes | No |
| Cosmos 3 Super | 2026-05 | 256k | 64B | Yes | Yes | Yes | No |
| Cosmos 3 Super Text2Image | 2026-05 | 4k | 64B | No | Yes | No | No |
| Cosmos 3 Super Image2Video | 2026-05 | 4k | 64B | Yes | Yes | No | No |
| Cosmos 3 Nano Policy DROID | 2026-05 | 4k | 16B | Yes | Yes | No | Yes |
| Cosmos 3 Edge | — | — | — | Yes | Yes | No | No |
Available From(1 provider)
Frequently Asked Questions
- What is Cosmos 3 used for?
- Cosmos 3 is used for multimodal, image generation, and image. The family description and listed model capabilities point to those workloads as the best fit.
- How does Cosmos 3 compare to NVIDIA Nemotron Nano 12B v2 VL?
- Cosmos 3 by NVIDIA AI is strongest where you need multimodal, while NVIDIA Nemotron Nano 12B v2 VL by NVIDIA AI is the closest related family to check for structured outputs. Cosmos 3 has 6 listed variants and reaches up to 256k context, so compare the specs and pricing tables before choosing a production model.
- Which Cosmos 3 model should I use?
- If price is the main constraint, use the pricing table first because Cosmos 3 does not have complete provider pricing in the local data. For the most capable/latest local choice, evaluate Cosmos 3 Nano with 256k context and reasoning and multimodal inputs.
Models(6)
Cosmos 3 Nano
Cosmos 3 Super
Cosmos 3 Super Text2Image
Cosmos 3 Super Image2Video
Cosmos 3 Nano Policy DROID
Cosmos 3 Edge





