Skip to main content
AI-Developer/AI Fundamentals
Part 11 of 14

Part 12 — RLHF: How Humans Taught AI to Be Helpful

GPT-3 was brilliant but chaotic — it would answer 'How do I bake a cake?' with a story about a Belgian chocolate factory in 1923. RLHF is the secret process that transformed raw AI intelligence into the helpful assistant you use every day.

March 12, 2026
11 min read
#RLHF#SFT#Reinforcement Learning#ChatGPT#AI Training#Reward Model#PPO#LLM

RLHF: The Secret of Alignment

GPT-3 knew everything but followed nothing. It was a text-completion engine, not an assistant. RLHF is the process that aligned raw intelligence with human intent, safety, and helpfulness.

Primary Objective
Supervised Fine-Tuning | Reward Models | PPO Loop | Constitutional AI

If GPT-3 could predict language perfectly, why did it answer "How do I bake a chocolate cake?" with a story about a Belgian factory worker from 1923? And what changed between GPT-3 (2020) and ChatGPT (2022) to turn an unpredictable text-completion engine into the world's most popular AI assistant?

The answer is RLHF — Reinforcement Learning from Human Feedback. It's not just a training trick; it's the invisible workforce of thousands of human annotators that shaped every AI model you've ever talked to.

💡
The Alignment Gap

Predicting the next word is not the same as being helpful. In 2020, GPT-3 would answer a cake-recipe request with a history of Belgian factories. It was accurate in language, but wrong in intent.


The Problem: Intelligence Without Purpose

In 2020, OpenAI released GPT-3 — at the time the largest language model ever built. 175 billion parameters, trained on hundreds of billions of words. It was genuinely extraordinary. And yet it was nearly unusable as an assistant.

Same Prompt: How do I bake a chocolate cake?

GPT-3 (2020)

"In the chocolate factories of Belgium, circa 1923, a worker named Henri first discovered the process of conching… The history of chocolate cake spans several centuries…"

Completed the text. Never answered the question.

ChatGPT (2022)

"Here's a simple chocolate cake recipe: 1. Preheat oven to 350°F. 2. Mix the dry ingredients. 3. Add the wet ingredients…"

Understood intent. Answered directly.

The technical capability was the same — GPT-3 could have answered; it had all the knowledge. The problem was that it was trained to complete text, not to be helpful. That's a fundamental misalignment between what the model was optimized for (predict the next word) and what users actually need (understand my intent and respond usefully).


The Three-Stage Pipeline

Modern AI assistants — ChatGPT, Claude, Gemini — are all built using a three-stage training pipeline. Each stage transforms the model in a specific way, and you cannot skip any step: a model needs language knowledge before it can learn instruction-following, and it needs instruction-following before preference learning has anything meaningful to optimize.

The Frontier Training Pipeline

📚
PRE-TRAINING

Self-supervised learning on 600B+ words. Model learns "how to speak." (✓ Deep language understanding · ✗ Completes text, not instructions)

📝
SFT (FINE-TUNING)

Humans write 10,000+ ideal Q&A pairs. Model learns "how to follow." (✓ Follows instructions · ✗ May still give suboptimal answers)

⚖️
RLHF

Humans rank outputs; a reward model learns "what we prefer." (✓ Genuinely helpful, safe, and aligned — this creates ChatGPT)


Stage 1: Pre-Training — Building the Foundation

Pre-training is the most computationally expensive phase — and the most passive from a human perspective. No one writes labels; the model learns by predicting missing pieces of text, where the text itself is the "label":

text
1234567
Masked Language Modeling (BERT style):
Input:  "The cat sat on the [MASK]."
Target: "mat" (87%), "floor" (9%), "chair" (4%)

Causal Language Modeling (GPT style):
Input:  "Paris is the capital of"
Target: "France" (92%)

After trillions of these predictions, the model has internalized grammar and syntax across many languages, factual knowledge from encyclopedias and textbooks, reasoning patterns from papers and code, and cultural context from books and forums — without anyone telling it what to learn.

600B+
Words in training data
~$100M
Pre-training cost
Weeks
On thousands of GPUs

What you get after pre-training is a Base Model — extraordinarily capable, but a text-completion machine, not an assistant. Ask it a question and it might just generate three more related questions.


Stage 2: Supervised Fine-Tuning (SFT) — Teaching Intent

SFT is the bridge between "knows a lot" and "can be useful." It's a relatively small dataset (10,000–100,000 examples) compared to pre-training, but it's 100% human-crafted — a real question paired with the ideal answer:

text
123456789
HUMAN-WRITTEN QUESTION:
"What's the best smart glasses in 2026?"

HUMAN-WRITTEN IDEAL ANSWER:
"The top options for 2026 are:
 • Ray-Ban Meta Ultra ($549) — lightest, 48MP camera, 40-language translation
 • Xreal Air 3 Ultra ($449) — best AR display, 4K quality, 3-hour battery
 The right choice depends on your use case: Ray-Ban for everyday wear,
 Xreal for productivity and gaming."

What SFT teaches: how to interpret question intent (not just complete text), response formatting (lists, comparisons, steps), appropriate length, and basic safety.

What SFT can't teach well: subtle quality differences between two decent answers, complex safety edge cases, and stylistic preferences humans have but can't easily articulate. This is where RLHF comes in.


Stage 3: RLHF — The Secret Sauce

RLHF is technically Reinforcement Learning, but the human-feedback component makes it fundamentally different from how RL is used in games or robotics. Let's break the mechanism into three micro-steps.

Step 3a: Collect Human Comparisons

Instead of writing ideal answers (hard and slow), human evaluators see two model responses to the same prompt and answer one simple question: which is better?

Prompt: What's the best smart glasses in 2026?

Answer A — ✅ Preferred

"Ray-Ban Meta Ultra ($549) is best for everyday use — 48g, 48MP camera, 40-language real-time translation. For AR productivity, Xreal Air 3 Ultra ($449) offers a better 4K display."

Answer B — ✗ Not Preferred

"Smart glasses have a fascinating history going back to 2013 when Google Glass was introduced. The technology has evolved significantly and there are many options…"

Why is comparing easier than writing? It's faster (30 seconds vs 10+ minutes to author an ideal answer), more consistent ("A is better than B" is less subjective than writing perfection from scratch), and it captures subtle preferences — tone, confidence, appropriate caveats — that are hard to write rules for but easy to recognize.

Step 3b: Train the Reward Model

From millions of these comparisons, a separate neural network is trained — the Reward Model. Its job: given any response, output a single number for how much humans would like it.

Direct, helpful, accurate, well-structured92%
Somewhat relevant, partially addresses the question58%
Historical tangent, doesn't answer the question31%
Harmful, dangerous, or policy-violating content5%

The Reward Model is now a proxy for "what humans want." It can score any response instantly, without a human reading it — and that's the key that makes RLHF scale.

Step 3c: Optimize the LLM with PPO

Now we have a "teacher" (the Reward Model) that can score anything. We use it to improve the main LLM with PPO (Proximal Policy Optimization), a reinforcement-learning algorithm:

The PPO Training Loop

🤖
LLM GENERATES

The model produces a response to a random prompt.

⚖️
REWARD MODEL SCORES

The response gets a grade from 0.0 to 1.0.

📊
PPO UPDATES

Weights shift: high score → repeat the behavior; low score → avoid it.

🔒
KL CONSTRAINT

The model is penalized if it strays too far from its original language ability.

Why PPO specifically? It's a "safe" RL algorithm: it updates weights but constrains how much they can change in one step. Without that constraint, the model would exploit the Reward Model — finding ways to score high that don't reflect real quality.

⚠️
Reward Hacking: A Real Problem

Without constraints, LLMs can "game" the Reward Model. Early RLHF experiments produced models that wrote extremely long, confident-sounding responses — because evaluators initially rated those higher, regardless of accuracy. This is why the KL-divergence penalty (don't stray too far from the SFT model) is critical.


SFT vs. RLHF: What's the Difference?

Two Ways to Teach

📝 SFT is like… a Teacher Grading an Exam

The teacher writes the answer key before the exam; the student copies the teacher's style. Clear, but limited — you can only be as good as the pre-written examples. SFT teaches by imitation.

🍽️ RLHF is like… a Chef Getting Tasted

The chef tries two versions of a dish; the taster says "A is better." Over thousands of tastings, intuition develops that goes beyond any written recipe. RLHF teaches by preference — a much richer signal.


Constitutional AI: The Next Evolution

RLHF requires thousands of human evaluators working constantly — expensive, and subject to evaluator fatigue, differing values, and cultural bias. Anthropic developed an extension called Constitutional AI (CAI), used in Claude.

RLHF vs. Constitutional AI

Traditional RLHF

Humans compare responses and pick the better one → the Reward Model learns from these comparisons.

Constitutional AI

A written "constitution" (principles like be honest, avoid harm) guides AI-generated critiques → the AI provides most of its own feedback; humans only validate high-level principles.

💡
Example Constitutional Principle

"Choose the response that is less likely to contain false information, and more likely to cite reliable sources."

CAI significantly reduces the human labor required while making the model's values more explicit and auditable — you can see exactly what principles it was trained to uphold.


Real-World Impact: Before and After RLHF

DimensionBefore RLHF (Base/SFT)After RLHF
IntentCompletes text patternsAddresses the user's actual goal
SafetyMay produce harmful content if promptedRefuses clearly dangerous requests
HonestyConfidently makes things upMore likely to say "I don't know"
ClarityVerbose, rambling, unfocusedConcise, structured, actionable
ToneInconsistent, can be rudeConsistently helpful and polite

The InstructGPT paper (the research behind ChatGPT) showed humans preferred RLHF-trained outputs over GPT-3 outputs 71% of the timeeven when the RLHF model was 100× smaller. RLHF doesn't just improve safety; it makes the model fundamentally more useful.


The Human Workforce Behind RLHF

RLHF requires something we rarely talk about: an enormous amount of human labor.

The Hidden Human Cost

SFT Phase

Hundreds of specialist writers crafting high-quality Q&A pairs — often domain experts (lawyers, doctors, coders) for specific capabilities.

RLHF Phase

Thousands of annotators doing millions of comparisons. OpenAI's contracts with Kenyan workers ($1–2/hour) became controversial in 2023 after a Time investigation.

Safety Red-Teaming

Specialists who actively try to make the model produce harmful output, so those behaviors can be trained out. Psychologically demanding work.

Ongoing Evaluation

Continuous post-deployment evaluation to catch regressions and find new training opportunities.

This human labor is one of the biggest differentiators between frontier labs — OpenAI, Anthropic, and Google have each invested hundreds of millions in annotation infrastructure, and it's a significant competitive moat.


Try It Yourself: Seeing RLHF's Effects

You can explore the difference between base and RLHF models directly through APIs:

python
1234567891011121314151617
from openai import OpenAI
client = OpenAI()

ambiguous_prompt = "Baking chocolate cake"

# gpt-3.5-turbo-instruct behaves more like a base/SFT model:
# it treats the input as text to complete.
base_style = client.completions.create(
    model="gpt-3.5-turbo-instruct", prompt=ambiguous_prompt, max_tokens=150
)
print("Base-style completion:\n", base_style.choices[0].text)

# gpt-4o is RLHF-trained — it treats the same input as an implicit question.
chat = client.chat.completions.create(
    model="gpt-4o", messages=[{"role": "user", "content": ambiguous_prompt}]
)
print("RLHF chat response:\n", chat.choices[0].message.content)

Things to watch for:

  • The base-style completion will continue the text "Baking chocolate cake" as if writing an article.
  • The RLHF chat model will offer a recipe or ask what you want.
  • RLHF models will say "I can't know that" for the population of Mars or the 2028 Olympics — trained honesty about uncertainty.
  • System prompts dramatically change tone in RLHF models (base models can't follow them at all).

Key Takeaways

01
01
Pre-training ≠ Useful AI

A model that perfectly predicts text is not automatically a helpful assistant. Alignment requires separate training.

02
02
SFT teaches format; RLHF teaches preference

SFT shows the model what an answer looks like. RLHF tells it which of two answers is actually better — a richer, more scalable signal.

03
03
The Reward Model is the key abstraction

It lets human preferences scale to billions of training steps without a human present for each one.

04
04
RLHF is usefulness, not just safety

InstructGPT showed a 100× smaller RLHF model beating a base model 100× larger in human preference ratings.

05
05
Human labor is fundamental

There is no AI alignment without people making millions of careful judgments — expensive, ethically complex, and under-discussed.


What's Next in the Series

💡
Next: Prompt Engineering

You now understand how AI models are built and trained. Next: how do you talk to them effectively? Learn the anatomy of a perfect prompt — Role, Context, Task, Format — few-shot and chain-of-thought techniques, system vs. user prompts, and why "ChatGPT is getting dumber" is almost never true. Read the next part →

MH

Mohamed Hamed

20 years building production systems — the last several deep in AI integration, LLMs, and full-stack architecture. I write what I've actually built and broken. If this was useful, the next one goes to LinkedIn first.

Follow on LinkedIn →