Concepts & capability filters
preprocessing
Tokenization
See matching models with benchmark scores and pricing.
Definition
Tokenization is the process of breaking down input text into smaller units called tokens (e.g., words, subwords, or characters) that the model can process numerically. It is the first step in preparing data for LLMs, affecting vocabulary size and sequence length.