RLHF (Reinforcement Learning from Human Feedback)
Training technique that fine-tunes a pretrained language model against a learned reward model built from human preference comparisons. Central to making LLMs follow instructions but linked to overconfidence and sycophancy.
Reinforcement Learning from Human Feedback is the dominant post-training method for turning a pretrained language model into an assistant that follows instructions. The standard pipeline has three stages. First, supervised fine-tuning on curated instruction–response pairs. Second, training a reward model on pairwise human preference data: annotators pick the better of two candidate responses, and a separate network learns to predict their judgments. Third, reinforcement learning — typically Proximal Policy Optimization — optimizes the base model's policy against that reward, usually with a KL penalty back to the supervised model to prevent drift. RLHF works because human preference is a richer signal than next-token likelihood for what makes a response useful. It is also responsible for several well-documented side effects. Raters tend to prefer confident-sounding answers, so the policy learns to express low-uncertainty phrasing even when the underlying claim is wrong, opening a calibration gap relative to the base model. Preference for agreement produces sycophancy, where models adjust answers toward stated user opinions. Optimization against a fixed reward model can drive mode collapse and reward hacking, in which the policy finds high-reward outputs that the reward model misjudges. Variants such as Direct Preference Optimization and Constitutional AI try to mitigate these effects while keeping the alignment benefits.