
Chameleon
About
The Chameleon family of large language models (LLMs), developed by Meta AI, marks a significant advancement in multimodal AI. These models feature an innovative early-fusion architecture that integrates both images and text seamlessly, allowing simultaneous processing and generation in any sequence. Unlike traditional models that handle modalities separately, Chameleon holistically combines visual and textual data from the beginning, which enhances its understanding of complex inputs. This architecture facilitates tasks like image captioning, visual question answering, and mixed-modal generation. A notable innovation is its vector quantization approach, enabling image tokenization compatible with transformer text processing. The models are accessible in various sizes, including the publicly available 7B and 34B parameter versions, and have outperformed other leading models like Google's Gemini Pro and OpenAI's GPT-4V in specific tasks. Additionally, another version of Chameleon, developed by UCLA and Microsoft, focuses on compositional reasoning with integrated tool usage, achieving state-of-the-art results in benchmarks such as ScienceQA.