LLM Reference

Kosmos 2.5

About

Kosmos-2.5 is a cutting-edge multimodal literate model created by Microsoft, specifically tailored for reading and interpreting text-intensive images such as scanned documents and academic papers. This model is distinguished by its capabilities to manage both structured and unstructured texts within images, generating spatially-aware text blocks and producing structured markdown outputs. Utilizing a unified multimodal architecture with a Transformer-based design, Kosmos-2.5 efficiently processes visual and textual data. It is adaptable through fine-tuning and achieves impressive benchmark results, comparable to larger models and GPT-4 in some tasks, making it a versatile tool for document understanding and image-to-text conversion applications. However, like many AI models, it is not immune to errors or hallucinations.

Capabilities

MultimodalFunction CallingTool UseJSON Mode

Specifications

Parameters1.37B
ArchitectureDecoder Only
Specializationgeneral