LLM Reference
AI Glossary

multimodal

Definition

Multimodal refers to LLMs extended to process and generate across multiple data modalities, such as text paired with images, audio, or video, via unified tokenization and cross-attention mechanisms. This enables tasks like visual question answering or captioning, integrating modality-specific encoders into the core transformer.