loading

RLHF — Step-by-Step Visualization

hardAIMLGenerative AIAlignmentLLM

Step through Reinforcement Learning from Human Feedback — watch a reward model score responses, then PPO update the LLM policy toward human preferences.

Algorithm Pattern

Human Preference → Reward Signal → Policy Update

Key Idea

RLHF aligns LLMs to human values in 3 phases: (1) supervised fine-tuning on demonstrations, (2) train a reward model on human preference pairs, (3) optimize the LLM with PPO using the reward model as the reward function.

Step-by-Step Approach

  1. Phase 1 — SFT: fine-tune LLM on high-quality demonstrations.
  2. Phase 2 — Reward model: collect (prompt, preferred, rejected) pairs; train RM to score.
  3. Phase 3 — PPO: generate responses, score with RM, update LLM to maximize reward.
  4. KL penalty: clip(PPO update) − β·KL(policy, ref_policy) prevents over-optimization.
  5. Human raters only needed for preference labels — RM scales to many prompts.

Common Gotchas

  • Reward hacking: policy finds ways to get high reward without being truly helpful.
  • KL divergence from the original model prevents the policy from drifting too far.
  • DPO (Direct Preference Optimization) achieves similar results without RL.

Related Problems