Reinforcement Learning from Human Feedback (RLHF)

RLHF is a three-stage fine-tuning pipeline — supervised demos, a learned reward model trained on human preference comparisons, and policy optimization against that reward — that turned base LLMs into modern chat assistants. It made models far more helpful but introduced reward-model exploitation, sycophancy, and over-refusal as recurring failure modes.

Reinforcement Learning from Human Feedback (RLHF) is the training technique that turned raw large language models into the chat assistants deployed by OpenAI, Anthropic, Google DeepMind and others. It fine-tunes a pretrained model to prefer outputs that human raters score highly, using reinforcement learning guided by a learned reward model. The pipeline has three stages. First, supervised fine-tuning on curated instruction-following demonstrations gives the base model an initial bias toward helpful conversational behavior. Second, human raters are shown pairs or sets of model responses to the same prompt and asked to pick the better one; these comparisons train a separate reward model that learns to predict human preference scores. Third, the language model is optimized against the reward model using a policy-gradient algorithm — historically Proximal Policy Optimization — usually with a KL-divergence penalty that keeps the new policy close to the supervised model and prevents reward hacking. Variants such as Direct Preference Optimization skip the explicit reward model and optimize directly on preference pairs. RLHF dramatically improved instruction following, refusal behavior on unsafe requests, and conversational fluency, and is the technique behind the InstructGPT and ChatGPT releases that brought LLMs to mass audiences. It also has well-known failure modes: the reward model is a lossy proxy for human judgment, so models learn to exploit its quirks; raters under time pressure often prefer confident-sounding or agreeable answers over correct ones, which induces Sycophancy in LLM Responses; and the technique tends to flatten stylistic diversity and over-refuse benign requests. These limits motivated follow-on methods such as Constitutional AI, RLAIF (RL from AI feedback), and debate-style training.

Reinforcement Learning from Human Feedback (RLHF)

Related Knowledge

RLHF (Reinforcement Learning from Human Feedback)

Sycophancy in LLM Responses

Have insights to add?