Cosmos 3 Nano is NVIDIA's 16B-parameter omnimodel optimized for efficient inference on workstation-grade hardware (NVIDIA RTX PRO 6000). Architecture: dual-tower Mixture-of-Transformers with an 8B autoregressive Reasoner and an 8B diffusion-based Generator. The Reasoner supports up to 256K tokens of context for vision-language reasoning; the Generator produces video up to 720p at variable frame rates (default 189 frames). Natively handles text, image, video, audio (48kHz stereo), and robot action trajectories across 10+ robot embodiments including Franka Panda, UR, Google robot, and UMI. BF16 precision only. Available as open weights on Hugging Face and via the Cosmos 3 Reasoner NIM (NIM_MODEL_SIZE=nano). Intended for real-time robotics inference and edge-adjacent deployment. Robot action input/output is preserved in this description because the model schema does not have a dedicated action modality field.
2026-05-31
Researched 28d ago
256k
256,000 tokens