AI Glossary
multimodal
Definition
Multimodal refers to LLMs extended to process and generate across multiple data modalities, such as text paired with images, audio, or video, via unified tokenization and cross-attention mechanisms. This enables tasks like visual question answering or captioning, integrating modality-specific encoders into the core transformer.