LLM Reference
AI Glossary
preprocessing

Tokenization

Definition

Tokenization is the process of breaking down input text into smaller units called tokens (e.g., words, subwords, or characters) that the model can process numerically. It is the first step in preparing data for LLMs, affecting vocabulary size and sequence length.

Models Using Tokenization(2)