direct preference optimization

DPO

See matching models with benchmark scores and pricing.

Definition

DPO is a lightweight alignment technique that fine-tunes LLMs directly on pairwise preference data (preferred vs. rejected responses) without a separate reward model or reinforcement learning. It optimizes the policy by maximizing the log-ratio of probabilities between chosen and rejected outputs relative to a reference model.

Models Mentioning direct preference optimization(8)

Llama 3 TenyxChat 70B2024-08 Phi-3 Medium 4K2024-05 OLMo 1B2024-02 Dolphin 2.6 Mixtral 8x7B2023-12 Snorkel Mistral PairRM2023-11 Zephyr 7B Alpha2023-10 Zephyr 7B Beta2023-10 Zephyr 7B Gemma2023-10