preprocessing

Tokenization

See matching models with benchmark scores and pricing.

Definition

Tokenization is the process of breaking down input text into smaller units called tokens (e.g., words, subwords, or characters) that the model can process numerically. It is the first step in preparing data for LLMs, affecting vocabulary size and sequence length.

Models Mentioning Tokenization(2)

Toppy M 7B2023-12 Cerebras GPT 13B2023-03