Skip to main content
AI-Developer → AI Fundamentals#12 of 14

Part 12 — RLHF: How Humans Taught AI to Be Helpful

GPT-3 was brilliant but chaotic — it would answer 'How do I bake a cake?' with a story about a Belgian chocolate factory in 1923. RLHF is the secret process that transformed raw AI intelligence into the helpful assistant you use every day.

March 12, 2026
11 min read
#RLHF#SFT#Reinforcement Learning#ChatGPT#AI Training#Reward Model#PPO#LLM
THE BILLION-DOLLAR QUESTION
If GPT-3 could predict language perfectly, why did it answer "How do I bake a chocolate cake?" with a story about a Belgian factory worker from 1923?
And more importantly — what changed between GPT-3 (2020) and ChatGPT (2022) to turn an unpredictable text-completion engine into the world's most popular AI assistant?

The answer is RLHF — Reinforcement Learning from Human Feedback. It's not just a training trick. It's the invisible workforce of thousands of human annotators that shaped every AI model you've ever talked to.


The Problem: Intelligence Without Purpose

Before we can understand RLHF, we need to understand what was broken.

In 2020, OpenAI released GPT-3 — at the time, the largest language model ever built. 175 billion parameters. Trained on hundreds of billions of words from the internet, books, and code. It was genuinely extraordinary.

And yet, it was nearly unusable as an assistant.

❌ GPT-3 (2020)
Prompt: "How do I bake a chocolate cake?"
"In the chocolate factories of Belgium, circa 1923, a worker named Henri first discovered the process of conching... The history of chocolate cake spans several centuries, with the earliest recorded recipes dating from..."
→ Completed the text. Didn't answer the question.
✅ ChatGPT (2022)
Prompt: "How do I bake a chocolate cake?"
"Here's a simple chocolate cake recipe:
1. Preheat oven to 350°F...
2. Mix dry ingredients...
3. Add wet ingredients..."
→ Understood intent. Answered directly.

The technical capability was the same. GPT-3 could have answered the question. It had all the knowledge. The problem was that it was trained to complete text, not to be helpful.

This is a fundamental misalignment between what the model was optimized for (predict the next word accurately) and what users actually needed (understand my intent and respond usefully).


The Three-Stage Pipeline

Modern AI assistants like ChatGPT, Claude, and Gemini are all built using a three-stage training pipeline. Each stage transforms the model in a specific way.

The Three Training Stages
1
Pre-Training
Self-Supervised Learning on internet scale data
Train on 600+ billion words. The model learns language, grammar, facts, reasoning patterns, and world knowledge by predicting masked or next tokens. No human labels needed.
✓ Deep language understanding
✗ Completes text, not instructions
2
Supervised Fine-Tuning (SFT)
Humans write ideal question-answer pairs
A team of human trainers (typically hundreds of people) writes thousands of examples: a real question paired with the ideal answer. The model learns to mimic this Q→A format.
✓ Can follow instructions
✗ May still give suboptimal answers
3
RLHF
Reinforcement Learning from Human Feedback
Humans compare multiple model outputs and rank them. These rankings train a Reward Model. The Reward Model then guides the LLM to produce responses humans prefer — at massive scale.
✓ Genuinely helpful, safe, and aligned
✓ This creates ChatGPT

Each stage builds on the previous. You can't skip stages — a model needs language knowledge (Stage 1) before it can learn instruction-following (Stage 2), and it needs basic instruction-following before preference learning (Stage 3) has anything meaningful to optimize.


Stage 1: Pre-Training — Building the Foundation

Pre-training is the most computationally expensive phase. It's also the most passive from a human perspective — no one is writing labels or making decisions. The model simply learns by doing.

How Self-Supervised Pre-Training Works
The model receives a sentence and learns to predict the missing pieces. No human needs to label anything — the text itself is the "label".
Training Example (Masked Language Modeling — BERT style):
Input: "The cat sat on the [MASK]."
Target: "mat" (87%), "floor" (9%), "chair" (4%)
Training Example (Causal Language Modeling — GPT style):
Input: "Paris is the capital of"
Target: "France" (92%)

After trillions of these predictions across terabytes of text, the model has internalized:

  • Grammar and syntax across many languages
  • Factual knowledge from encyclopedias, textbooks, news
  • Reasoning patterns from scientific papers and code
  • Cultural context from books, stories, and forums

The key insight: the model learns all of this without anyone telling it what to learn. The structure of human knowledge is encoded in the statistical patterns of text itself.

600B+
words in training data
(GPT-3 scale)
~$100M
estimated pre-training cost
(frontier models)
Weeks
of GPU cluster time
(thousands of A100s)

What you get after pre-training is called a Base Model. It's extraordinarily capable — but it's a text completion machine, not an assistant. Ask it a question, and it might just generate 3 more related questions.


Stage 2: Supervised Fine-Tuning (SFT) — Teaching Intent

SFT is the bridge between "knows a lot" and "can be useful." It's a relatively small dataset (10,000 to 100,000 examples) compared to pre-training, but it's 100% human-crafted.

SFT Example: Smart Glasses Recommendation
HUMAN-WRITTEN QUESTION:
"What's the best smart glasses in 2026?"
HUMAN-WRITTEN IDEAL ANSWER:
"The top options for 2026 are:

Ray-Ban Meta Ultra ($549) — Lightest smart glasses with 48MP camera, real-time translation in 40 languages
Xreal Air 3 Ultra ($449) — Best AR display with 4K quality, 3-hour battery

The right choice depends on your use case: Ray-Ban for everyday wear, Xreal for productivity and gaming."

The model learns from thousands of such pairs. This teaches it the format of a helpful response: address the question directly, provide structured information, acknowledge trade-offs.

What SFT teaches:

  • How to interpret question intent (not just complete text)
  • Response formatting (lists, comparisons, steps)
  • Appropriate length (not too short, not rambling)
  • Basic safety (don't explain how to make weapons)

What SFT can't teach well:

  • Subtle quality differences between two decent answers
  • Complex safety edge cases
  • Stylistic preferences humans have but can't easily articulate

This is where RLHF comes in.


Stage 3: RLHF — The Secret Sauce

RLHF is technically Reinforcement Learning, but the human feedback component makes it fundamentally different from how RL is used in games or robotics. Let's break down the mechanism step by step.

Step 3a: Collect Human Comparisons

Instead of writing ideal answers (hard and slow), human evaluators are shown two (or more) model responses to the same prompt and asked a simple question: Which is better?

Prompt: "What's the best smart glasses in 2026?"
Answer A — ✅ Human Preferred
"Ray-Ban Meta Ultra ($549) is the best for everyday use — 48g, 48MP camera, 40-language real-time translation. For AR productivity, Xreal Air 3 Ultra ($449) offers better 4K display."
Answer B — ✗ Not Preferred
"Smart glasses have a fascinating history going back to 2013 when Google Glass was introduced. The technology has evolved significantly and there are many options from various manufacturers..."
Thousands of human annotators make millions of these comparisons → Training dataset for the Reward Model

Why is comparing easier than writing?

  • Faster: Comparing takes 30 seconds vs writing an ideal answer taking 10+ minutes
  • More consistent: "A is better than B" is less subjective than writing the perfect answer from scratch
  • Captures subtle preferences: Things like tone, confidence level, appropriate caveats — hard to write rules for, easy to recognize

Step 3b: Train the Reward Model

From these millions of human comparisons, a separate neural network is trained — the Reward Model. Its job is simple: given any response, output a single number representing how much humans would like it.

Reward Model Scoring
"Direct, helpful, accurate, well-structured..."
0.92
"Somewhat relevant, partially addresses question..."
0.58
"Historical tangent, doesn't answer the question..."
0.31
"Harmful, dangerous, or policy-violating content..."
0.05

The Reward Model is now a proxy for "what humans want." It can score any response instantly — without requiring a human to read it. This is the key that makes RLHF scale.

Step 3c: Optimize the LLM with PPO

Now we have a "teacher" (the Reward Model) that can score anything. We use this to improve the main LLM using PPO (Proximal Policy Optimization) — a reinforcement learning algorithm.

The PPO Training Loop
🤖
LLM Generates
Response to a prompt
⚖️
Reward Model Scores
0.0 to 1.0
📊
PPO Updates Weights
High score → repeat; Low score → avoid
↺ Repeat thousands of times → Model learns what humans prefer

Why PPO specifically?

PPO is a "safe" reinforcement learning algorithm. It updates the LLM's weights but includes a constraint: don't change too much in one step. Without this constraint, the model would exploit the Reward Model — finding ways to get high scores that don't actually reflect real quality (a phenomenon called "reward hacking").

The PPO constraint keeps the trained model close to the SFT model, ensuring it doesn't forget its language abilities while improving helpfulness.

⚠️ Reward Hacking: A Real Problem
Without constraints, LLMs can "game" the Reward Model in unexpected ways. Early RLHF experiments found models that learned to write extremely long, confident-sounding responses — because evaluators initially rated those higher, regardless of accuracy.
This is why the KL divergence penalty (don't stray too far from the SFT model) is a critical component of PPO in RLHF.

SFT vs RLHF: What's the Difference?

A helpful analogy:

📝 SFT is like...
👩‍🏫
A Teacher Grading an Exam
The teacher writes the answer key before the exam. The student copies the teacher's style. Clear, but limited — you can only be as good as the pre-written examples.
🍽️ RLHF is like...
👨‍🍳
A Chef Getting Tasted
The chef tries two versions of a dish. The taster says "A is better." The chef learns preferences. Over thousands of tastings, they develop intuition that goes beyond any written recipe.

The key difference: SFT teaches by imitation. RLHF teaches by preference — which is a much richer signal, especially for things that are hard to define explicitly but easy to recognize.


Constitutional AI: The Next Evolution

RLHF requires human evaluators — thousands of them, working constantly. This is expensive and has its own biases (evaluators have different values, fatigue, cultural backgrounds).

Anthropic developed an extension of RLHF called Constitutional AI (CAI), used in Claude:

Constitutional AI — Key Innovation
Traditional RLHF
Humans compare responses and pick the better one → Reward Model learns from these comparisons
Constitutional AI
A written "constitution" (principles like "be honest", "avoid harm") guides AI-generated critiques → AI provides most of its own feedback, humans only validate high-level principles
Example Constitutional Principle:
"Choose the response that is less likely to contain false information, and more likely to cite reliable sources."

CAI significantly reduces the human labor required while making the model's values more explicit and auditable. You can see exactly what principles the model was trained to uphold.


Real-World Impact: RLHF Before and After

The measurable improvements from RLHF are dramatic:

Behavioral Changes After RLHF
DIMENSION
Before RLHF (Base/SFT)
After RLHF
Intent
Completes text patterns
Addresses user's actual goal
Safety
May produce harmful content if prompted
Refuses clearly dangerous requests
Honesty
Confidently makes things up
More likely to say "I don't know"
Clarity
Verbose, rambling, unfocused
Concise, structured, actionable
Tone
Inconsistent, can be rude
Consistently helpful and polite

The InstructGPT paper (the research behind ChatGPT) showed that humans preferred RLHF-trained outputs over GPT-3 outputs 71% of the time — even when the RLHF model was 100x smaller. RLHF doesn't just improve safety — it makes the model fundamentally more useful.


The Human Workforce Behind RLHF

RLHF requires something we rarely talk about: an enormous amount of human labor.

The Hidden Human Cost of RLHF
SFT Phase
Hundreds of specialist writers creating high-quality Q&A pairs. Often domain experts (lawyers, doctors, coders) for specific capabilities.
RLHF Phase
Thousands of annotators doing millions of comparisons. OpenAI's contracts with Kenyan workers ($1-2/hour) became controversial in 2023 after Time magazine investigation.
Safety Red-Teaming
Specialists who actively try to make the model produce harmful output, so those behaviors can be trained out. Psychologically demanding work.
Ongoing Evaluation
After deployment, continuous evaluation of model outputs to catch regressions and identify new training opportunities.

This human labor is one of the biggest differentiators between frontier AI labs. OpenAI, Anthropic, and Google have invested hundreds of millions of dollars in human annotation infrastructure — and it's a significant competitive moat.


Try It Yourself: Seeing RLHF's Effects

You can explore the difference between base and RLHF models directly through APIs:

from openai import OpenAI

client = OpenAI()

# A prompt that shows the base model vs instruction model difference
ambiguous_prompt = "Baking chocolate cake"

# gpt-3.5-turbo-instruct behaves more like a base/SFT model
# It treats the input as text to complete
base_style_response = client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=ambiguous_prompt,
    max_tokens=150
)
print("Base-style completion:")
print(base_style_response.choices[0].text)
print()

# gpt-4o is RLHF-trained — treats the same input as an implicit question
chat_response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": ambiguous_prompt}]
)
print("RLHF chat model response:")
print(chat_response.choices[0].message.content)
# Demonstrating the "I don't know" behavior (RLHF-trained honesty)
# RLHF models are trained to acknowledge uncertainty

responses_to_test = [
    "What is the population of Mars?",  # Should say "no population" not make up numbers
    "What did Einstein say on March 15, 1945?",  # Specific date — may not know exactly
    "Who won the 2028 Olympics?",  # Future event — should not guess
]

for prompt in responses_to_test:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100
    )
    print(f"Q: {prompt}")
    print(f"A: {response.choices[0].message.content}")
    print("-" * 40)
# Comparing system prompt effects (a direct RLHF capability)
# Base models can't follow system prompts — only RLHF models can

def test_system_prompt(system_msg, user_msg):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": user_msg}
        ],
        max_tokens=200
    )
    return response.choices[0].message.content

# RLHF-trained models respect role instructions
formal_response = test_system_prompt(
    "You are a formal legal advisor. Respond professionally and conservatively.",
    "What should I do about my neighbor's loud music?"
)

casual_response = test_system_prompt(
    "You are a chill friend. Keep it casual and practical.",
    "What should I do about my neighbor's loud music?"
)

print("Formal (legal advisor):")
print(formal_response)
print("\nCasual (friend):")
print(casual_response)
# Same question, same model — dramatically different responses
# This persona-following is a direct product of RLHF training

Expected output observations:

  • Base-style completion will continue the text "Baking chocolate cake" as if writing an article
  • RLHF chat model will ask if you want a recipe or offer one directly
  • RLHF models will say "I can't know that" for Mars population or 2028 Olympics
  • System prompts dramatically change tone and framing in RLHF models

Key Takeaways

Pre-training ≠ Useful AI. A model that perfectly predicts text is not automatically a helpful assistant. Alignment requires separate training.
SFT teaches format; RLHF teaches preference. SFT shows the model what an answer looks like. RLHF tells the model which of two answers is actually better — a richer and more scalable signal.
The Reward Model is the key abstraction. It lets human preferences scale to billions of training steps without needing a human present for each one.
RLHF is not just safety — it's usefulness. The InstructGPT paper showed a 100x smaller RLHF model beats a base model that's 100x larger in human preference ratings.
Human labor is fundamental. There is no AI alignment without people making millions of careful judgments. This is expensive, ethically complex, and under-discussed.

What's Next in the Series

NEXT IN SERIES
Prompt Engineering: The Art of Talking to AI
You now understand how AI models are built and trained. Next: how do you talk to them effectively? Learn the anatomy of a perfect prompt — Role, Context, Task, Format — and why "ChatGPT is getting dumber" is almost never true.
✦ The 4-part prompt framework
✦ Few-Shot prompting techniques
✦ Chain-of-Thought reasoning
✦ System prompts vs User prompts
AI Fundamentals
MH

Mohamed Hamed

20 years building production systems — the last several deep in AI integration, LLMs, and full-stack architecture. I write what I've actually built and broken. If this was useful, the next one goes to LinkedIn first.

Follow on LinkedIn →