LLM Reference
NVIDIA NIM

Kosmos 2 on NVIDIA NIM

Kosmos-2 · Microsoft Research

Provisioned

Pricing

TypePrice (per 1M)
Input tokensFree
Output tokensFree

Capabilities

VisionMultimodalReasoningFunction CallingTool UseJSON ModeCode Execution

About Kosmos 2

Kosmos-2, developed by Microsoft Research, is an advanced multimodal large language model (MLLM) that enhances the capabilities of its predecessor, Kosmos-1. It features a Transformer-based architecture trained on the GrIT dataset of grounded image-text pairs, enabling it to understand and interact with both text and visual data. A key innovation is Kosmos-2's ability to ground language to the visual world, allowing for nuanced interaction with images by linking text to specific visual elements using location tokens. This model excels in various tasks including image caption generation, referring expression comprehension, and perception-language tasks, making it valuable for applications such as robotics, multimodal dialogue systems, and more. Kosmos-2 is considered a significant step towards AI systems that are more contextually aware and closer to achieving artificial general intelligence (AGI) 12.

Get Started

Model Specs

Released2023-03-15
Parameters1.66B
ArchitectureDecoder Only