proximal policy optimization

PPO

See matching models with benchmark scores and pricing.

Definition

PPO is an on-policy reinforcement learning algorithm used in RLHF to update the LLM policy model by maximizing a clipped surrogate objective, ensuring stable training through trust-region constraints. It balances reward maximization with KL-divergence penalties to prevent large policy shifts.

Models Mentioning proximal policy optimization(1)

Starling LM 7B Beta2024-02