RLHF (Reinforcement Learning from Human Feedback)
A technique for aligning AI models with human preferences by training reward models on human judgments and using reinforcement learning to optimize for those preferences. Widely used to make language models more helpful, harmless, and honest after initial pre-training.
Why It Matters
RLHF is how modern language models go from 'predicts the next word' to 'follows instructions helpfully and safely.' Understanding RLHF helps governance professionals evaluate claims about AI safety, understand model behavior, and assess alignment approaches.
Example
During Claude's training, human evaluators compare pairs of model responses and indicate which is better. A reward model learns from these preferences, and reinforcement learning optimizes Claude to produce responses that score higher on helpfulness and safety — shaping the model's behavior beyond what pre-training alone could achieve.
Think of it like...
RLHF is like a chef adjusting recipes based on food critic reviews — the base cooking skills (pre-training) are there, but human feedback fine-tunes the dishes to match what people actually want to eat.
Related Terms
Reinforcement Learning
A type of machine learning where an agent learns to make decisions by taking actions in an environment and receiving rewards or penalties. The agent aims to maximize cumulative reward over time through trial and error.
AI Alignment
The challenge of ensuring that AI systems behave in accordance with human values, intentions, and objectives — not just following instructions literally, but understanding and respecting the intent behind them. As AI systems become more capable, alignment becomes more critical and more difficult.
Fine-Tuning
The process of taking a pre-trained model and further training it on a smaller, domain-specific dataset to specialize its behavior for a particular task or domain. Fine-tuning adjusts the model's weights to improve performance on the target task.