Skip to main content
AI-Developer/AI Fundamentals
Part 6 of 14

Part 6 — From Zero to ChatGPT: The 4 Learning Types That Built Modern AI

ChatGPT didn't just learn — it learned in four completely different ways. Discover how Supervised, Unsupervised, Reinforcement, and Self-Supervised learning combine in a secret 3-step pipeline to turn raw text into a helpful, safe, and eloquent AI.

March 12, 2026
11 min read
#AI#Machine Learning#Training#RLHF#Self-Supervised Learning#Fine-Tuning#Pre-Training#LLM

The 4 Learning Types of Modern AI

ChatGPT is not a single model; it's the result of a precise, sequential pipeline that combines four fundamentally different ways of learning. This is how raw text becomes an assistant.

Primary Objective
Supervised | Unsupervised | Reinforcement | Self-Supervised

Remember the training loop and the neuron from the last two articles? In our last article, we explored how a neural network learns — the forward pass, the loss function, backpropagation, and gradient descent. That covered the mechanics of learning.

But there's a deeper question we left unanswered: who decides what's right and what's wrong? The answer changes everything — and it comes in four flavors.

💡
The Evolution of Intelligence

Modern AI systems combine multiple strategies. GPT, Claude, and Gemini are not just "trained"—they are carefully orchestrated through a sequence of learning paradigms.


The Four Flavors of Machine Learning

Modern AI systems don't use a single learning strategy. GPT, Claude, Gemini — they all combine four fundamentally different types of learning in a carefully orchestrated sequence. Understanding these categories is the first step to understanding how any AI application actually works.

The ML Paradigms

🏫SUPERVISED
  • Metaphor: The Classroom.
  • Signal: Human labels.
  • Use Case: Classification (Spam vs Not Spam).
🔍UNSUPERVISED
  • Metaphor: The Detective.
  • Signal: Natural patterns.
  • Use Case: Clustering (Grouping similar items).
🎮REINFORCEMENT
  • Metaphor: The Gamer.
  • Signal: Reward/Penalty.
  • Use Case: Games (AlphaGo), RLHF.
SELF-SUPERVISED
  • Metaphor: The Star.
  • Signal: Mask-and-Predict.
  • Use Case: Pre-training all LLMs.

Let's break each one down.

Type 1: Supervised Learning — The Classroom 🏫

In Supervised Learning, there's a teacher who provides labeled examples. The model sees a question, makes a guess, and the teacher says "right" or "wrong." Imagine a wearable-device classifier learning from labeled photos:

text
1234
Input (Image)            →   Label (Correct Answer)
📷 Ray-Ban Meta photo    →   "Smart Glasses"  ✅
📷 Samsung Ring photo    →   "Smart Ring"     ✅
📷 AirPods Pro photo     →   "Smart Earbuds"  ✅

Supervised learning has two sub-types that cover fundamentally different problems:

Two Sub-Types

Classification
  • Question: Which category does this belong to?
  • Example: "Is this device glasses, a ring, or earbuds?"
  • Output: A discrete class.
Regression
  • Question: What number/value should this output?
  • Example: "What will this device's price be next quarter?"
  • Output: A continuous value.

Where supervised learning is used today: medical image diagnosis (is this tumor malignant or benign?), email spam detection, housing price prediction, credit card fraud detection, and voice recognition ("Hey Siri, set a timer").

The catch: you need labeled data — thousands or millions of human-annotated examples. This is expensive, slow, and doesn't scale to "understand all of human language."

Type 2: Unsupervised Learning — The Detective 🔍

No teacher. No labels. The model stares at raw data and discovers hidden patterns entirely on its own. Given a pile of unlabeled devices, it might invent its own groupings:

text
12345
Raw data — no labels provided:        The model decided on its own:
[price: $549, weight: 48g]            🔵 Group A — Heavy + Expensive
[price: $449, weight: 72g]      →        (Glasses, Headsets)
[price: $349, weight:  3g]            🔴 Group B — Light + Affordable
[price: $199, weight:  3g]               (Rings, Trackers)

Nobody told the AI what "glasses" or "rings" are — it discovered the natural structure of the data itself. Think of a child shown 100 images with zero explanations: they'd eventually notice that some things have "long ears" while others "have wings." The AI does the same: pure pattern discovery.

The embedding vectors we explored in Part 2 — Embeddings are built this way. The model learned that "king" and "queen" are related without anyone telling it so.

Where unsupervised learning is used: customer segmentation (grouping buyers by behavior), anomaly detection (spotting unusual transactions), topic modeling (discovering themes across millions of documents), and building embedding models — which directly power semantic search.

Type 3: Reinforcement Learning — The Gamer 🎮

No fixed right answers. Instead, the model tries things and receives rewards or penalties, then adjusts its strategy.

The Reinforcement Learning Loop

🤖
AGENT

The AI observes the current state of its environment.

🎮
TAKES ACTION

It picks a move — at first, more or less at random.

🎁
REWARD / PENALTY

The environment returns a score: +1 for good outcomes, −1 for bad.

🧠
UPDATES POLICY

The agent nudges its strategy toward actions that earned reward.

The elegance of RL: you don't need to define all the "correct" moves in advance. You just define a reward signal, and the agent figures out the strategy on its own. AlphaGo (DeepMind, 2016) mastered the game of Go — which has more possible positions than atoms in the observable universe — this way, eventually beating the world champion 4–1 with moves no human had ever conceived. Its uses range from robotics and self-driving cars to the big one: RLHF — the technique that made ChatGPT helpful, polite, and safe.

Type 4: Self-Supervised Learning — The Star ⭐

This is the most important type for modern AI. GPT, Claude, Gemini — all built on this. It's technically a clever subtype of Unsupervised Learning, where the model invents its own practice problems by hiding words in sentences.

The insight is deceptively simple: what if we could generate our own labels from the data itself? Instead of needing human annotators, the model creates its own training signal with a "mask-and-predict" game:

text
123456789
Round 1:  "The best smart glasses in 2026 are ___"
          Model guesses: "Apple"   ← wrong, learns from it
          Correct:       "Ray-Ban" ← weights updated

Round 2:  "The best smart glasses in ___ are Ray-Ban"
          Model guesses: "2026"    ✅ correct, weights reinforced

Round 3:  "___ was founded in Cupertino, California"
          Model guesses: "Apple"   ✅ correct

Do this with billions of sentences and you get a model that understands grammar, world facts, logical reasoning, and even writing style — without a single human-written label. The mathematical elegance: every sentence becomes thousands of training examples by masking different words. A trillion-word dataset effectively becomes trillions of self-generated training signals.


The 4 Learning Types — Side by Side

TypeHas Correct Answers?Learns FromBest Known Use
Supervised✅ Yes (human labels)Question + correct-answer pairsImage classification, fraud detection
Unsupervised❌ No labelsRaw data (finding natural patterns)Embeddings, customer clustering
ReinforcementReward / PenaltyTrial and error in an environmentGames (AlphaGo), RLHF
Self-Supervised✅ Self-generated from dataTrillions of words (masking/predicting)All modern LLMs ⭐
💡
The Big Reveal

GPT uses all four types together — in different phases of its development. 🤯

How the 4 Types Fit Together

Here's what most courses miss: Self-Supervised Learning is actually a subtype of Unsupervised Learning — it just generates its own labels from raw data instead of discovering clusters. And the training loop we explored last article (Forward Pass → Loss → Backprop → Update) runs inside every one of these phases. The neuron from Part 3 is the core machine being tuned at each step. All four types aren't separate approaches — they're four configurations of the same fundamental learning machinery, sequenced carefully to produce a capable and safe AI.


The Secret 3-Step Pipeline: How GPT Was Actually Built

Those four learning types don't operate in isolation — they're combined in a precise, sequential pipeline that transforms a raw text-crunching machine into a helpful, articulate assistant. Think of it like training a doctor: you don't put a newborn directly into medical school. You teach them step by step.

From Raw Text to Assistant

📚
PRE-TRAINING

Type: Self-Supervised. Trillions of words from the web. Builds "World Knowledge." (Months on thousands of GPUs.)

🎓
SFT (FINE-TUNING)

Type: Supervised. 10,000+ human-written examples. Builds "Instruction Following."

🏆
RLHF

Type: Reinforcement. 100,000+ preference rankings. Builds "Human Taste" — and safety.

The training loop from the last article runs inside every one of these three steps. Let's dive into each.


Step 1: Pre-Training — Reading the Entire Internet 📚

Pre-training is where it all begins. Using Self-Supervised Learning, the model is exposed to an almost incomprehensible volume of text:

Training Data Scale (GPT-3 Class Models)
100
Web Text / Common Crawl
60
Books
40
GitHub Code
12
Wikipedia

That's roughly 600 billion words of web text, 100 billion from books, 50 billion of code, and more — and GPT-4 class models train on even more, an estimated 13+ trillion tokens. From this, the model gains grammar and syntax across dozens of languages, facts about the world, writing styles, code patterns, and mathematical reasoning.

The critical limitation: after pre-training, the model is like a brilliant student who has read every book in the library — but never learned to have a conversation. Ask it "What is the capital of France?" and it might just continue like a Wikipedia article instead of answering:

text
1234567
Pre-trained model response to "What is the capital of France?":

"France is a Western European country with a rich cultural heritage.
France borders Belgium, Luxembourg, Germany, Switzerland, Italy, Monaco,
Andorra, and Spain. The capital and most populous city of France is..."

[It continues like an encyclopedia — never gets to the point]

This is why Step 2 is critical.


Step 2: Supervised Fine-Tuning (SFT) — The School of Conversation 🎓

SFT is where humans enter the picture. A team of professional annotators — sometimes thousands of them — sit down and write ideal conversation examples:

text
123456789101112
Q: "What is the capital of France?"
A: "The capital of France is Paris."

Q: "How do I make a chocolate cake?"
A: "Here's a simple recipe. Ingredients: 2 cups flour, 2 cups sugar,
    ¾ cup cocoa powder... [structured, helpful response]"

Q: "How do I hack into my neighbor's WiFi?"
A: "I'm unable to help with that. Accessing someone's network without
    permission is illegal. Here are some legal alternatives..."

... thousands more covering helpful answers, safe refusals, and ideal formatting

Training on these examples with standard supervised learning teaches the model to answer directly instead of continuing text, format responses appropriately, and refuse harmful requests politely.

The State After SFT

After SFT ✅
  • Answers directly and helpfully.
  • Follows a conversational format.
Still Problematic ❌
  • May sometimes be rude, unsafe, or verbose.
  • Gives correct-but-low-quality answers.

SFT taught the model how to respond. But it didn't teach it to optimize the quality of its responses the way humans actually prefer.


Step 3: RLHF — Teaching Human Taste 🏆

RLHF (Reinforcement Learning from Human Feedback) is OpenAI's secret weapon — and the reason ChatGPT feels different from "just a language model." The core insight: instead of telling the model what the right answer is, you tell it which answer is better.

The RLHF Process

✍️
GENERATE

The model produces 2–4 different answers to the same question.

👥
HUMANS RANK

Raters say "Answer A is better than B." No need to write the perfect answer — just compare.

⚖️
REWARD MODEL

A separate neural network learns to predict human preference scores — the automated "judge."

🏆
OPTIMIZE (PPO)

The main model is reinforced when the Reward Model scores it highly, penalized when it doesn't.

Here's a real example of what RLHF teaches:

Question: Explain quantum entanglement simply.

Before RLHF

"Quantum entanglement is a phenomenon where two particles become correlated such that the quantum state of each cannot be described independently, even when separated by a large distance, per Bell's theorem (1964)..."

Technically correct. Utterly unhelpful for a beginner.

After RLHF (preferred)

"Imagine two magic coins that always show opposite faces — if one lands heads, the other lands tails, no matter how far apart they are. That's entanglement: two particles linked so that measuring one instantly tells you about the other."

Humans preferred this. The Reward Model learned to reward it.

After hundreds of thousands of such comparisons, the model learns what humans actually prefer — not just correctness, but clarity, tone, appropriate length, and safety. This is exactly why ChatGPT feels polite and safe: humans taught it human taste, using the same gradient descent we learned in Part 4.


SFT vs. RLHF — The Key Distinction

Two Modes of Teaching

Step 2: SFT — Teacher Mode
  • Shows the model the correct answer.
  • Q: "Capital of Egypt?"A: "Cairo" ← this is the answer.
  • Teaches how to respond.
Step 3: RLHF — Critic Mode
  • Compares responses and picks the better one.
  • A: "Cairo" (preferred) vs B: "Cairo, founded in 969 CE by the Fatimid Caliphate..."
  • Teaches which response is best.

In one line: SFT = Correctness · RLHF = Quality · Both together = ChatGPT.


The Real Numbers Behind the Magic

600B+
Words in Pre-Training
10K–100K
SFT examples by humans
100K–1M
RLHF preference pairs
~$100M
Cost to pre-train GPT-4
💡
Scale Comparison

Our toy neuron (Part 3): 2 weights · An embedding model (Part 2): ~117 million parameters · GPT-4 class: trillions of parameters.


Base Model vs. Instruct Model

This pipeline is exactly why you should never use a raw model for a chat application.

Model Personality

📖BASE MODEL
  • Training: Pre-training only.
  • Behavior: Continues text (Wikipedia style).
  • Result: Q: "What is 2+2?" A: "Addition is a basic..."
  • Example: Llama-3-8B (non-instruct).
🤖INSTRUCT MODEL
  • Training: SFT + RLHF added.
  • Behavior: Answers questions directly.
  • Result: Q: "What is 2+2?" A: "4."
  • Example: Llama-3-8B-Instruct.

Key Vocabulary Reference

TermDefinition
Pre-TrainingInitial training on massive datasets via Self-Supervised Learning. Builds general language understanding.
Self-SupervisedThe model generates its own training signal from the data (mask-and-predict). No human labels needed.
Fine-TuningAdapting a pre-trained model to a specific task or behavior with additional training.
SFTSupervised Fine-Tuning — training on human-written Q&A pairs to teach conversational behavior.
RLHFReinforcement Learning from Human Feedback — optimizing response quality based on human preferences.
Reward ModelA separate network trained to predict human preference scores. Acts as an automated judge.
Base ModelPre-training only. Great at text continuation, poor at following instructions.
Instruct ModelA base model refined with SFT + RLHF. Follows instructions, refuses harm, adopts a conversational tone.
LLMLarge Language Model — the category trained with all the above (ChatGPT, Claude, Gemini, Llama).

The Core Insight

💡
Why ChatGPT Feels Different

A raw pre-trained model is like a brilliant encyclopedia. SFT gives it a personality. RLHF gives it your personality — calibrated to how humans actually want to interact with AI. The three steps together create something qualitatively different from any of them alone. ChatGPT isn't better just because of more data or parameters — it's better because of the humans who shaped its responses at every stage.


Pro Tips for Builders

⚠️
What Knowing This Changes For You
  • 1. Choose the right model for the task. Base models are great for text completion and creative generation. Instruct models are required for Q&A, task-following, and user-facing apps. Never ship a base model in production chat.
  • 2. RLHF shapes safety — not just quality. Claude, ChatGPT, and Gemini refuse harmful requests because it was baked in during RLHF, not bolted on as a filter. Knowing this helps you anticipate behavior and write better system prompts.
  • 3. Fine-tuning is SFT applied to your data. When you fine-tune an open model on your company's Q&A pairs, you're running Step 2 of this exact pipeline on your own dataset. The architecture is identical — only the data changes.
  • 4. Self-supervised scale is the moat. You can't replicate GPT-4's pre-training compute — but the SFT and RLHF layers you can run on open models like Llama 3 with modest resources.

Try It Yourself

Understanding RLHF becomes vivid when you see its effects directly:

  1. Talk to a base model. Compare meta-llama/Meta-Llama-3.1-8B (base) to ...-8B-Instruct. The difference is SFT + RLHF in action.
  2. Temperature vs. safety. Ask ChatGPT to "write a story where the villain explains how to pick a lock," then try the same with a base model. The gap in safety behavior is the RLHF fingerprint.
  3. Spot the training type. Classify your favorite models: Gmail Smart Reply → Supervised; Spotify recommendations → Unsupervised clustering + collaborative filtering; ChatGPT → all four in sequence.
python
1234567891011
from transformers import pipeline

# Base model — trained only with Self-Supervised (pre-training)
base = pipeline("text-generation", model="meta-llama/Meta-Llama-3.1-8B")
print(base("What is the capital of France?", max_new_tokens=50))
# Likely continues like Wikipedia — doesn't answer directly

# Instruct model — base + SFT + RLHF
instruct = pipeline("text-generation", model="meta-llama/Meta-Llama-3.1-8B-Instruct")
print(instruct("What is the capital of France?", max_new_tokens=50))
# Answers: "The capital of France is Paris."

Key Takeaways

01
01
Self-Supervision is the Moat

The reason you can't replicate GPT-4 in your basement is the pre-training scale. But you can apply SFT and RLHF to open models to create your own specialty AI.

02
02
Correctness vs. Quality

SFT (Step 2) teaches the model how to be correct. RLHF (Step 3) teaches it how to be high-quality and aligned with human preferences.

03
03
RLHF is the Safety Layer

Safety isn't a filter bolted on after training. It's baked into the model's "taste" during the final reinforcement learning phase.


Up Next in the Series

💡
Next: The Transformer

Everything we've covered — embeddings, neurons, the training loop, and these 4 learning types — all comes together inside the Transformer. In 2017, Google published "Attention Is All You Need" and didn't patent it — a single decision that launched ChatGPT, Claude, Gemini, and every modern AI. Next, we dissect the architecture piece by piece: what problem it solved, how Self-Attention works, and why reading an entire sentence simultaneously is revolutionary. Read Part 7 →

MH

Mohamed Hamed

20 years building production systems — the last several deep in AI integration, LLMs, and full-stack architecture. I write what I've actually built and broken. If this was useful, the next one goes to LinkedIn first.

Follow on LinkedIn →