Question 1

What is the algorithm pattern for RLHF?

Accepted Answer

Human Preference → Reward Signal → Policy Update: RLHF aligns LLMs to human values in 3 phases: (1) supervised fine-tuning on demonstrations, (2) train a reward model on human preference pairs, (3) optimize the LLM with PPO using the reward model as the reward function.

Question 2

How do you solve RLHF step by step?

Accepted Answer

Phase 1 — SFT: fine-tune LLM on high-quality demonstrations. Phase 2 — Reward model: collect (prompt, preferred, rejected) pairs; train RM to score. Phase 3 — PPO: generate responses, score with RM, update LLM to maximize reward. KL penalty: clip(PPO update) − β·KL(policy, ref_policy) prevents over-optimization. Human raters only needed for preference labels — RM scales to many prompts.

Question 3

What are common mistakes when solving RLHF?

Accepted Answer

Reward hacking: policy finds ways to get high reward without being truly helpful. KL divergence from the original model prevents the policy from drifting too far. DPO (Direct Preference Optimization) achieves similar results without RL.

RLHF — Step-by-Step Visualization

Algorithm Pattern

Key Idea

Step-by-Step Approach

Common Gotchas

Related Problems