What Is RLHF? | GEO Glossary

Reinforcement Learning from Human Feedback (RLHF) is a training technique that tunes LLM behavior with preference data collected from humans. A raw pre-trained LLM is fluent but often unhelpful or unsafe; RLHF is the standard alignment step that turns that raw model into "a conversational partner people actually prefer."

Why It Matters

RLHF was the core reason ChatGPT captured public attention in 2022. OpenAI's InstructGPT paper showed a 1.3B-parameter RLHF model was preferred by humans over the 175B-parameter base model. The lesson: "align with human feedback" is a stronger lever than "make the model bigger." Nearly every commercial LLM today — Claude, GPT, Gemini, Llama — ships with some form of RLHF or a derivative.

Three Stages

1. Pre-training: Learn next-token prediction on a huge text corpus. The model is knowledgeable but bad at following instructions.

2. Supervised Fine-Tuning (SFT): Fine-tune on human-written "good question → good answer" pairs. The model learns the chat format and instruction following.

3. RLHF proper:

Train a reward model: Show humans two candidate responses and ask which is better. Train a reward model on these preference pairs.
RL optimization: Use an RL algorithm like PPO (Proximal Policy Optimization) to adjust the LLM to maximize the reward model's scores.

The result is still a language model — but one whose outputs are tuned toward human preferences.

What RLHF Solves

Usefulness: Converts "technically predictive but useless" answers into "actually useful" ones.

Safety: Trains the model to refuse violent, discriminatory, or illegal content.

Honesty: Encourages "I don't know" over fabrication — though it doesn't solve hallucination entirely.

Tone and format: Teaches friendly style, structured responses, and cultural registers like Korean honorifics.

Limits and Criticism

Reward hacking: The model exploits weaknesses in the reward model to produce answers that look good to evaluators but aren't actually better.

Feedback bias: The cultural and personal biases of labelers get baked into the reward model.

Over-alignment: Becomes overly cautious and refuses legitimate questions.

Cost: Collecting high-quality human feedback is slow and expensive and doesn't scale cleanly.

Hallucination tension: Some research argues RLHF can amplify hallucinations by rewarding confident-sounding answers.

Derivatives and Alternatives

DPO (Direct Preference Optimization): Skips the reward model and optimizes the LLM directly from preference data. Rapidly replacing RLHF post-2023.

Constitutional AI (CAI): Anthropic's approach — instead of human feedback, use an explicit "constitution" the model self-critiques and revises itself against.

RLAIF (RL from AI Feedback): Use another LLM to provide preference judgments instead of humans. Cheaper but more bias risk.

GEO Implications

Modern LLMs, thanks to RLHF, are aligned toward a neutral, useful tone. Blog content that tends to get cited by AI search leans into calm, informational writing rather than sensational or exaggerated copy. Since RLHF also rewards citations and appropriate uncertainty, fact-based content with explicit sources is more likely to be picked as a citation candidate.

Sources: