Step through Reinforcement Learning from Human Feedback — watch a reward model score responses, then PPO update the LLM policy toward human preferences.
Human Preference → Reward Signal → Policy Update
RLHF aligns LLMs to human values in 3 phases: (1) supervised fine-tuning on demonstrations, (2) train a reward model on human preference pairs, (3) optimize the LLM with PPO using the reward model as the reward function.