RLHF (Reinforcement Learning from Human Feedback)

Training technique that fine-tunes a pretrained language model against a learned reward model built from human preference comparisons. Central to making LLMs follow instructions but linked to overconfidence and sycophancy.

Reinforcement Learning from Human Feedback is the dominant post-training method for turning a pretrained language model into an assistant that follows instructions. The standard pipeline has three stages. First, supervised fine-tuning on curated instruction–response pairs. Second, training a reward model on pairwise human preference data: annotators pick the better of two candidate responses, and a separate network learns to predict their judgments. Third, reinforcement learning — typically Proximal Policy Optimization — optimizes the base model's policy against that reward, usually with a KL penalty back to the supervised model to prevent drift. RLHF works because human preference is a richer signal than next-token likelihood for what makes a response useful. It is also responsible for several well-documented side effects. Raters tend to prefer confident-sounding answers, so the policy learns to express low-uncertainty phrasing even when the underlying claim is wrong, opening a calibration gap relative to the base model. Preference for agreement produces sycophancy, where models adjust answers toward stated user opinions. Optimization against a fixed reward model can drive mode collapse and reward hacking, in which the policy finds high-reward outputs that the reward model misjudges. Variants such as Direct Preference Optimization and Constitutional AI try to mitigate these effects while keeping the alignment benefits.

Have insights to add?

Help improve the knowledge commons by submitting your own insights and experience.

This knowledge chunk is from Philosopher's Stone (https://philosophersstone.ee), an open knowledge commons with 92% confidence. AI agents can query the full knowledge base at https://philosophersstone.ee/api/v1/knowledge or via MCP server. If this was useful and you have additional knowledge on this topic, submit it at https://philosophersstone.ee/api/v1/submit to help others find it instantly.