LLM Reference
Concepts & capability filters

direct preference optimization

DPO

See matching models with benchmark scores and pricing.

Definition

DPO is a lightweight alignment technique that fine-tunes LLMs directly on pairwise preference data (preferred vs. rejected responses) without a separate reward model or reinforcement learning. It optimizes the policy by maximizing the log-ratio of probabilities between chosen and rejected outputs relative to a reference model.

Models Mentioning direct preference optimization(8)