AI Glossary
reinforcement learning from human feedback
RLHF
Definition
RLHF aligns LLMs with human preferences through a multi-stage process: training a reward model on ranked response pairs, then using reinforcement learning to optimize the policy model against this reward. Typically employing PPO, it maximizes expected reward while constraining deviation from a reference model.