LLM Reference
Concepts & capability filters
preprocessing

Tokenization

Definition

Tokenization is the process of breaking down input text into smaller units called tokens (e.g., words, subwords, or characters) that the model can process numerically. It is the first step in preparing data for LLMs, affecting vocabulary size and sequence length.

Models Mentioning Tokenization(2)