LLM Reference
AI Glossary

direct preference optimization

DPO

Definition

DPO is a lightweight alignment technique that fine-tunes LLMs directly on pairwise preference data (preferred vs. rejected responses) without a separate reward model or reinforcement learning. It optimizes the policy by maximizing the log-ratio of probabilities between chosen and rejected outputs relative to a reference model.