Skip to main content
AI-Developer → AI Fundamentals#6 of 14

Part 6 — From Zero to ChatGPT: The 4 Learning Types That Built Modern AI

ChatGPT didn't just learn — it learned in four completely different ways. Discover how Supervised, Unsupervised, Reinforcement, and Self-Supervised learning combine in a secret 3-step pipeline to turn raw text into a helpful, safe, and eloquent AI.

March 12, 2026
11 min read
#AI#Machine Learning#Training#RLHF#Self-Supervised Learning#Fine-Tuning#Pre-Training#LLM

THE EVOLUTION OF LLMs

Zero to ChatGPT

4 Types of Learning — 3 Secret Steps — 1 Revolutionary AI

Remember the training loop and neuron from the last two articles? Today we answer who decides what the loop learns.

In our last article, we explored how a neural network learns — the forward pass, loss function, backpropagation, and gradient descent. That covered the mechanics of learning.

But there's a deeper question we left unanswered: Who decides what's right and what's wrong?

The answer changes everything. And it comes in four flavors.


The 4 Types of Machine Learning

Modern AI systems don't use a single learning strategy. GPT, Claude, Gemini — they all combine four fundamentally different types of learning in a carefully orchestrated sequence. Let's break each one down.


Type 1: Supervised Learning — The Classroom 🏫

In Supervised Learning, there's a teacher who provides labeled examples. The model sees a question, the model makes a guess, and the teacher says "right" or "wrong."

Real-World Example: Wearable Device Classifier

Input (Image) Label (Correct Answer)
📷 Ray-Ban Meta photo "Smart Glasses" ✅
📷 Samsung Ring photo "Smart Ring" ✅
📷 AirPods Pro photo "Smart Earbuds" ✅

Supervised learning has two sub-types that cover fundamentally different problems:

Classification

Which category does this belong to?

Example: "Is this device glasses, a ring, or earbuds?" → Output is a discrete class

Regression

What number/value should this output?

Example: "What will this device's price be next quarter?" → Output is a continuous value

Where Supervised Learning is used today:

  • Medical image diagnosis (is this tumor malignant or benign?)
  • Email spam detection
  • Housing price prediction
  • Credit card fraud detection
  • Voice recognition ("Hey Siri, set a timer")

The catch: You need labeled data — thousands or millions of human-annotated examples. This is expensive, slow, and doesn't scale to "understand all of human language."


Type 2: Unsupervised Learning — The Detective 🔍

No teacher. No labels. The model stares at raw data and discovers hidden patterns entirely on its own.

Self-Discovery Example

Raw data — no labels provided:

[price: $549, weight: 48g]

[price: $449, weight: 72g]

[price: $349, weight: 3g]

[price: $299, weight: 5g]

[price: $199, weight: 3g]

The model decided on its own:

🔵 Group A — Heavy + Expensive
(Glasses, Headsets)
🔴 Group B — Light + Affordable
(Rings, Trackers)

Nobody told the AI what "glasses" or "rings" are. It discovered the natural structure of the data itself. 🤯

Think of a child who was shown 100 images with zero explanations. They'd eventually notice that some things have "long ears" while others "have wings." The AI does the same — pure pattern discovery.

The embedding vectors we explored in our embeddings article — those are built using Unsupervised Learning. The model learned that "king" and "queen" are related without anyone telling it so.

Where Unsupervised Learning is used:

  • Customer segmentation (e-commerce grouping buyers by behavior)
  • Anomaly detection (spotting unusual transactions)
  • Topic modeling (discovering themes in millions of documents)
  • Building embedding models ← directly powers Similarity Search

Type 3: Reinforcement Learning — The Gamer 🎮

No fixed right answers. Instead, the model tries things and receives rewards or penalties.

The Reinforcement Learning Loop

🤖
AGENT (AI)
🎮
TAKES ACTION
+1 🎁
REWARD / PENALTY
🧠
UPDATES POLICY
Classic Uses
AlphaGo (board games)
Robotics
Self-driving cars
The Big One: RLHF ⭐
This is what made ChatGPT
helpful, polite, and safe!

The elegance of RL: there's no need to define all the "correct" moves in advance. You just define a reward signal, and the agent figures out the strategy on its own.

AlphaGo (DeepMind, 2016) mastered the game of Go — a game with more possible positions than atoms in the observable universe — using RL. It eventually beat the world champion 4-1, making moves no human had ever thought of.


Type 4: Self-Supervised Learning — The Star ⭐

This is the most important type for modern AI. GPT, Claude, Gemini — all built on this. And it's technically a clever subtype of Unsupervised Learning (a clever subtype of Unsupervised Learning where the model invents its own practice problems by hiding words in sentences).

The insight is deceptively simple: what if we could generate our own labels from the data itself?

Instead of needing human annotators to label billions of examples, the model creates its own training signal:

The Mask-and-Predict Game

Round 1:

Input: "The best smart glasses in 2026 are ___"

Model guesses: "Apple" ← Wrong, learns from it

Correct: "Ray-Ban" ✅ ← Weights updated

Round 2:

Input: "The best smart glasses in ___ are Ray-Ban"

Model guesses: "2026" ✅ Correct! Weights reinforced

Round 3 (billions more like these):

Input: "___ was founded in Cupertino, California"

Model guesses: "Apple" ✅ Correct!

One trillion-word dataset becomes trillions of self-generated training signals — this is why no human labels were needed.

Do this with billions of sentences and you get a model that understands grammar, facts about the world, logical reasoning, and even writing style — without a single human-written label.

The mathematical elegance: every sentence in the training corpus becomes thousands of training examples by masking different words. A trillion-word dataset effectively becomes trillions of self-generated training signals.


The 4 Learning Types — Side by Side

Type Has Correct Answers? Learns From Best Known Use
Supervised ✅ Yes (human labels) Question + correct answer pairs Image classification, fraud detection
Unsupervised ❌ No labels Raw data (finding natural patterns) Embeddings, customer clustering
Reinforcement Reward / Penalty Trial and error in an environment Games (AlphaGo), RLHF
Self-Supervised ✅ Self-generated from data Trillions of words (masking/predicting) All modern LLMs ⭐
GPT uses ALL FOUR types together — in different phases of its development. 🤯

How the 4 Types Fit Together in the Real Pipeline

Here's what most courses miss: Self-Supervised Learning is actually a subtype of Unsupervised Learning — it just generates its own labels from raw data instead of discovering clusters. And the training loop we explored in the last article (Forward Pass → Loss → Backprop → Update) runs inside every one of these phases. The neuron from Article 3 is the core machine being tuned at each step. All four types aren't separate approaches — they're four different configurations of the same fundamental learning machinery, sequenced carefully to produce a capable and safe AI.


The Secret 3-Step Pipeline: How GPT Was Actually Built

Now here's where it gets fascinating. Those four learning types don't operate in isolation — they're combined in a precise, sequential pipeline that transforms a raw text-crunching machine into a helpful, articulate AI assistant.

Think of it like training a doctor. You don't put a newborn directly into medical school. You teach them step by step.

📚

Step 1: Pre-Training

Self-Supervised Learning on trillions of words

Months on
thousands of GPUs
🎓

Step 2: Supervised Fine-Tuning (SFT)

Humans write ideal Q&A examples, model learns to follow instructions

Thousands of
curated examples
🏆

Step 3: RLHF

Human raters compare responses, Reward Model trains, AI gets optimized

Hundreds of thousands
of comparisons
🤖

ChatGPT

Helpful ✅ Polite ✅ Safe ✅ Refuses dangerous requests ✅

Just like the training loop we saw last article (Forward Pass → Loss → Backprop → Update) runs inside every one of these three steps.

Now watch how OpenAI (and every major lab) stacks these four types into the exact 3-step pipeline that created ChatGPT.

Let's dive into each step.


Step 1: Pre-Training — Reading the Entire Internet 📚

Pre-training is where it all begins. Using Self-Supervised Learning, the model is exposed to an almost incomprehensible volume of text.

Training Data Scale (GPT-3 Class Models)

🌐 Web Text / Common Crawl — 600 Billion words
📚 Books — 100 Billion words
💻 GitHub Code — 50 Billion words
📖 Wikipedia

GPT-4 class models train on even more — estimated 13+ trillion tokens

What the model gains from Pre-Training:

  • Grammar and syntax in dozens of languages
  • Facts about the world (history, science, geography, culture)
  • Writing styles (formal, casual, technical, creative)
  • Code patterns across programming languages
  • Mathematical reasoning

The critical limitation: After pre-training, the model is like a brilliant student who has read every book in the library — but never learned to have a conversation. Ask it "What is the capital of France?" and it might respond with more text that sounds like it continues a Wikipedia article, not a direct answer.

Pre-trained model response to "What is the capital of France?":
"France is a Western European country with a rich cultural heritage.
France borders Belgium, Luxembourg, Germany, Switzerland, Italy, Monaco,
Andorra, and Spain. The capital and most populous city of France is..."

[It continues like a Wikipedia article — never gets to the point]

This is why Step 2 is critical.


Step 2: Supervised Fine-Tuning (SFT) — The School of Conversation 🎓

SFT is where humans enter the picture. A team of professional annotators — sometimes thousands of them — sit down and write ideal conversation examples.

Human-Written Training Examples

Question: "What is the capital of France?"

Answer: "The capital of France is Paris."

Question: "How do I make a chocolate cake?"

Answer: "Here's a simple chocolate cake recipe. Ingredients: 2 cups flour, 2 cups sugar, ¾ cup cocoa powder... [structured, helpful response]"

Question: "How do I hack into my neighbor's WiFi?"

Answer: "I'm unable to help with that. Accessing someone's network without permission is illegal. If you're having connectivity issues, here are some legal alternatives..."

... thousands more examples covering helpful answers, safe refusals, and ideal formatting

The model trains on these examples using standard supervised learning. Now it learns to:

  • Answer directly instead of continuing text
  • Format responses appropriately (lists, code blocks, etc.)
  • Refuse harmful requests politely but firmly

After SFT ✅

Answers directly and helpfully
Follows conversational format

Still problematic ❌

May sometimes be rude, unsafe,
or give poor-quality answers

SFT taught the model how to respond. But it didn't teach it to optimize the quality of its responses in the way humans actually prefer.


Step 3: RLHF — Teaching Human Taste 🏆

RLHF (Reinforcement Learning from Human Feedback) is OpenAI's secret weapon — and the reason ChatGPT feels different from just "a language model."

The core insight: instead of telling the model what the right answer is, you tell it which answer is better.

The RLHF Process — 3 Micro-Steps

1
Generate multiple responses

The model produces 2-4 different answers to the same question.

2
Humans rank the responses

Human raters read both and say "Answer A is better than B." No need to write the perfect answer — just compare.

3
Train a Reward Model

A separate neural network learns to predict human preference scores. This becomes the automated "judge."

4
Optimize with RL (PPO)

The main model gets reinforced when the Reward Model gives it high scores. Responses the Reward Model dislikes get penalized.

A real example of what RLHF teaches:

Question: "Explain quantum entanglement simply."

ANSWER B (before RLHF)

"Quantum entanglement is a phenomenon where two particles become correlated such that the quantum state of each particle cannot be described independently of the others, even when separated by a large distance, per Bell's theorem (1964)..."

Technically correct. Utterly unhelpful for a beginner.

ANSWER A (after RLHF preferred)

"Imagine two magic coins that always show opposite faces — if one lands heads, the other lands tails, no matter how far apart they are. That's quantum entanglement: two particles linked so that measuring one instantly tells you about the other."

Humans preferred this. Reward Model learned to reward it.

After hundreds of thousands of such comparisons, the model learns what humans actually prefer — not just correctness, but clarity, tone, appropriate length, and safety.

This is exactly why ChatGPT feels polite and safe — humans taught it human taste using the same gradient descent we learned in Article 4.


SFT vs RLHF — The Key Distinction

Step 2: SFT

Teacher Mode

Shows the model the correct answer

Q: "Capital of Egypt?"
A: "Cairo" ← this is the answer

Teaches: how to respond

Step 3: RLHF

Critic Mode

Compares responses and picks the better one

A: "Cairo" ← preferred
B: "Cairo, Egypt's capital, founded in 969 CE by the Fatimid Caliphate..."
Human: "A is better"

Teaches: which response is best

SFT = Correctness  |  RLHF = Quality  |  Both together = ChatGPT

The Real Numbers Behind the Magic

📚

600B+

Words in Pre-Training

✍️

10K–100K

SFT examples written by humans

👥

100K–1M

Human preference comparisons for RLHF

~$100M

Estimated cost to pre-train GPT-4

Scale Comparison

Our toy neuron (Article 3): 2 weights  |  Embedding model (Article 2): 117 million parameters  |  GPT-4 class: trillions of parameters


Key Vocabulary Reference

Term Definition
Pre-Training Initial training on massive datasets using Self-Supervised Learning. Builds general language understanding.
Self-Supervised The model generates its own training signal from the data (masking and predicting). No human labels needed.
Fine-Tuning Adapting a pre-trained model to a specific task or behavior pattern using additional training.
SFT Supervised Fine-Tuning — train on human-written Q&A pairs to teach conversational behavior.
RLHF Reinforcement Learning from Human Feedback — optimize response quality based on human preferences.
Reward Model A separate neural network trained to predict human preference scores for responses. Acts as an automated judge.
Human Labelers Professional annotators who write SFT examples and rank RLHF response pairs. Their preferences shape the AI's personality.
Base Model A model that has completed Pre-Training only. Excellent at text continuation; poor at following instructions. Example: Llama-3-8B (non-instruct).
Instruct Model A base model that has been further refined with SFT + RLHF. Follows instructions, refuses harmful requests, adopts a conversational tone. Example: Llama-3-8B-Instruct.
LLM Large Language Model — the category of models trained with all the above techniques (ChatGPT, Claude, Gemini, Llama, etc.)

The Core Insight

Why ChatGPT feels different

A raw pre-trained model is like a brilliant encyclopedia. SFT gives it a personality. RLHF gives it your personality — calibrated to how humans actually want to interact with AI. The three steps together create something qualitatively different from any of them alone.

ChatGPT is not just smarter because of more data or parameters. It's better because of the humans who carefully shaped its responses at every stage. Behind every helpful answer is a pipeline of billions of words, thousands of human-written examples, and hundreds of thousands of human preference judgments.


Pro Tips for Builders

💡 What Knowing This Changes For You

1.

Choose the right model for the task. Base models are great for text completion and creative generation. Instruct models are required for Q&A, task following, and user-facing apps. Never use a base model in production chat.

2.

RLHF shapes safety — not just quality. The reason Claude, ChatGPT, and Gemini refuse harmful requests isn't a filter bolted on after — it was baked in during RLHF training. Understanding this helps you anticipate model behavior and write better system prompts.

3.

Fine-tuning is SFT applied to your data. When you fine-tune an open-source model on your company's Q&A pairs, you're running Step 2 of this exact pipeline on your own dataset. The architecture is identical — only the data changes.

4.

Self-Supervised scale is the moat. The reason you can't replicate GPT-4 is the pre-training compute. But the SFT and RLHF layers? Those you can run on open models like Llama 3 with modest resources.


Try It Yourself

Understanding RLHF becomes vivid when you see its effects directly:

Experiment 1: Talk to a Base Model Models like meta-llama/Meta-Llama-3.1-8B (non-instruct version) behave closer to a pure pre-trained model. Compare its response to meta-llama/Meta-Llama-3.1-8B-Instruct. The difference is SFT + RLHF in action.

Experiment 2: Temperature vs Safety Try asking ChatGPT to "write a story where the villain explains how to pick a lock." Then try it with Llama 3 base (via HuggingFace). The difference in safety behavior is the RLHF fingerprint.

Experiment 3: Spot the Training Type Look at your favorite ML model and classify it:

  • Gmail Smart Reply → Supervised Learning (trained on email reply pairs)
  • Spotify recommendation → Unsupervised clustering + Collaborative filtering
  • OpenAI's ChatGPT → All four types in sequence

Experiment 4: Base vs Instruct — Feel the Difference Run the same prompt through both a base model and its instruct version on HuggingFace:

from transformers import pipeline


base = pipeline("text-generation", model="meta-llama/Meta-Llama-3.1-8B")
print(base("What is the capital of France?", max_new_tokens=50))
# Likely continues like Wikipedia — doesn't answer directly

# Instruct model — base + SFT + RLHF
instruct = pipeline("text-generation", model="meta-llama/Meta-Llama-3.1-8B-Instruct")
print(instruct("What is the capital of France?", max_new_tokens=50))
# Answers: "The capital of France is Paris."
# The difference between these two outputs is SFT + RLHF in action.

Everything we've covered — embeddings, neurons, training loop, and these 4 learning types — all comes together inside the Transformer.

NEXT IN SERIES

The Transformer: The Architecture That Changed Everything

In 2017, Google published a paper titled "Attention Is All You Need" — and didn't patent it. That single decision launched ChatGPT, Claude, Gemini, and every modern AI. In the next article, we'll dissect the Transformer architecture piece by piece: what problem it solved, how Self-Attention works, and why reading an entire sentence simultaneously is revolutionary.

Coming next: transformer-article.md

AI Fundamentals
MH

Mohamed Hamed

20 years building production systems — the last several deep in AI integration, LLMs, and full-stack architecture. I write what I've actually built and broken. If this was useful, the next one goes to LinkedIn first.

Follow on LinkedIn →