The 4 Learning Types of Modern AI
ChatGPT is not a single model; it's the result of a precise, sequential pipeline that combines four fundamentally different ways of learning. This is how raw text becomes an assistant.
Remember the training loop and the neuron from the last two articles? In our last article, we explored how a neural network learns — the forward pass, the loss function, backpropagation, and gradient descent. That covered the mechanics of learning.
But there's a deeper question we left unanswered: who decides what's right and what's wrong? The answer changes everything — and it comes in four flavors.
Modern AI systems combine multiple strategies. GPT, Claude, and Gemini are not just "trained"—they are carefully orchestrated through a sequence of learning paradigms.
The Four Flavors of Machine Learning
Modern AI systems don't use a single learning strategy. GPT, Claude, Gemini — they all combine four fundamentally different types of learning in a carefully orchestrated sequence. Understanding these categories is the first step to understanding how any AI application actually works.
The ML Paradigms
- Metaphor: The Classroom.
- Signal: Human labels.
- Use Case: Classification (Spam vs Not Spam).
- Metaphor: The Detective.
- Signal: Natural patterns.
- Use Case: Clustering (Grouping similar items).
- Metaphor: The Gamer.
- Signal: Reward/Penalty.
- Use Case: Games (AlphaGo), RLHF.
- Metaphor: The Star.
- Signal: Mask-and-Predict.
- Use Case: Pre-training all LLMs.
Let's break each one down.
Type 1: Supervised Learning — The Classroom 🏫
In Supervised Learning, there's a teacher who provides labeled examples. The model sees a question, makes a guess, and the teacher says "right" or "wrong." Imagine a wearable-device classifier learning from labeled photos:
Input (Image) → Label (Correct Answer)
📷 Ray-Ban Meta photo → "Smart Glasses" ✅
📷 Samsung Ring photo → "Smart Ring" ✅
📷 AirPods Pro photo → "Smart Earbuds" ✅Supervised learning has two sub-types that cover fundamentally different problems:
Two Sub-Types
- Question: Which category does this belong to?
- Example: "Is this device glasses, a ring, or earbuds?"
- Output: A discrete class.
- Question: What number/value should this output?
- Example: "What will this device's price be next quarter?"
- Output: A continuous value.
Where supervised learning is used today: medical image diagnosis (is this tumor malignant or benign?), email spam detection, housing price prediction, credit card fraud detection, and voice recognition ("Hey Siri, set a timer").
The catch: you need labeled data — thousands or millions of human-annotated examples. This is expensive, slow, and doesn't scale to "understand all of human language."
Type 2: Unsupervised Learning — The Detective 🔍
No teacher. No labels. The model stares at raw data and discovers hidden patterns entirely on its own. Given a pile of unlabeled devices, it might invent its own groupings:
Raw data — no labels provided: The model decided on its own:
[price: $549, weight: 48g] 🔵 Group A — Heavy + Expensive
[price: $449, weight: 72g] → (Glasses, Headsets)
[price: $349, weight: 3g] 🔴 Group B — Light + Affordable
[price: $199, weight: 3g] (Rings, Trackers)Nobody told the AI what "glasses" or "rings" are — it discovered the natural structure of the data itself. Think of a child shown 100 images with zero explanations: they'd eventually notice that some things have "long ears" while others "have wings." The AI does the same: pure pattern discovery.
The embedding vectors we explored in Part 2 — Embeddings are built this way. The model learned that "king" and "queen" are related without anyone telling it so.
Where unsupervised learning is used: customer segmentation (grouping buyers by behavior), anomaly detection (spotting unusual transactions), topic modeling (discovering themes across millions of documents), and building embedding models — which directly power semantic search.
Type 3: Reinforcement Learning — The Gamer 🎮
No fixed right answers. Instead, the model tries things and receives rewards or penalties, then adjusts its strategy.
The Reinforcement Learning Loop
The AI observes the current state of its environment.
It picks a move — at first, more or less at random.
The environment returns a score: +1 for good outcomes, −1 for bad.
The agent nudges its strategy toward actions that earned reward.
The elegance of RL: you don't need to define all the "correct" moves in advance. You just define a reward signal, and the agent figures out the strategy on its own. AlphaGo (DeepMind, 2016) mastered the game of Go — which has more possible positions than atoms in the observable universe — this way, eventually beating the world champion 4–1 with moves no human had ever conceived. Its uses range from robotics and self-driving cars to the big one: RLHF — the technique that made ChatGPT helpful, polite, and safe.
Type 4: Self-Supervised Learning — The Star ⭐
This is the most important type for modern AI. GPT, Claude, Gemini — all built on this. It's technically a clever subtype of Unsupervised Learning, where the model invents its own practice problems by hiding words in sentences.
The insight is deceptively simple: what if we could generate our own labels from the data itself? Instead of needing human annotators, the model creates its own training signal with a "mask-and-predict" game:
Round 1: "The best smart glasses in 2026 are ___"
Model guesses: "Apple" ← wrong, learns from it
Correct: "Ray-Ban" ← weights updated
Round 2: "The best smart glasses in ___ are Ray-Ban"
Model guesses: "2026" ✅ correct, weights reinforced
Round 3: "___ was founded in Cupertino, California"
Model guesses: "Apple" ✅ correctDo this with billions of sentences and you get a model that understands grammar, world facts, logical reasoning, and even writing style — without a single human-written label. The mathematical elegance: every sentence becomes thousands of training examples by masking different words. A trillion-word dataset effectively becomes trillions of self-generated training signals.
The 4 Learning Types — Side by Side
| Type | Has Correct Answers? | Learns From | Best Known Use |
|---|---|---|---|
| Supervised | ✅ Yes (human labels) | Question + correct-answer pairs | Image classification, fraud detection |
| Unsupervised | ❌ No labels | Raw data (finding natural patterns) | Embeddings, customer clustering |
| Reinforcement | Reward / Penalty | Trial and error in an environment | Games (AlphaGo), RLHF |
| Self-Supervised | ✅ Self-generated from data | Trillions of words (masking/predicting) | All modern LLMs ⭐ |
GPT uses all four types together — in different phases of its development. 🤯
How the 4 Types Fit Together
Here's what most courses miss: Self-Supervised Learning is actually a subtype of Unsupervised Learning — it just generates its own labels from raw data instead of discovering clusters. And the training loop we explored last article (Forward Pass → Loss → Backprop → Update) runs inside every one of these phases. The neuron from Part 3 is the core machine being tuned at each step. All four types aren't separate approaches — they're four configurations of the same fundamental learning machinery, sequenced carefully to produce a capable and safe AI.
The Secret 3-Step Pipeline: How GPT Was Actually Built
Those four learning types don't operate in isolation — they're combined in a precise, sequential pipeline that transforms a raw text-crunching machine into a helpful, articulate assistant. Think of it like training a doctor: you don't put a newborn directly into medical school. You teach them step by step.
From Raw Text to Assistant
Type: Self-Supervised. Trillions of words from the web. Builds "World Knowledge." (Months on thousands of GPUs.)
Type: Supervised. 10,000+ human-written examples. Builds "Instruction Following."
Type: Reinforcement. 100,000+ preference rankings. Builds "Human Taste" — and safety.
The training loop from the last article runs inside every one of these three steps. Let's dive into each.
Step 1: Pre-Training — Reading the Entire Internet 📚
Pre-training is where it all begins. Using Self-Supervised Learning, the model is exposed to an almost incomprehensible volume of text:
That's roughly 600 billion words of web text, 100 billion from books, 50 billion of code, and more — and GPT-4 class models train on even more, an estimated 13+ trillion tokens. From this, the model gains grammar and syntax across dozens of languages, facts about the world, writing styles, code patterns, and mathematical reasoning.
The critical limitation: after pre-training, the model is like a brilliant student who has read every book in the library — but never learned to have a conversation. Ask it "What is the capital of France?" and it might just continue like a Wikipedia article instead of answering:
Pre-trained model response to "What is the capital of France?":
"France is a Western European country with a rich cultural heritage.
France borders Belgium, Luxembourg, Germany, Switzerland, Italy, Monaco,
Andorra, and Spain. The capital and most populous city of France is..."
[It continues like an encyclopedia — never gets to the point]This is why Step 2 is critical.
Step 2: Supervised Fine-Tuning (SFT) — The School of Conversation 🎓
SFT is where humans enter the picture. A team of professional annotators — sometimes thousands of them — sit down and write ideal conversation examples:
Q: "What is the capital of France?"
A: "The capital of France is Paris."
Q: "How do I make a chocolate cake?"
A: "Here's a simple recipe. Ingredients: 2 cups flour, 2 cups sugar,
¾ cup cocoa powder... [structured, helpful response]"
Q: "How do I hack into my neighbor's WiFi?"
A: "I'm unable to help with that. Accessing someone's network without
permission is illegal. Here are some legal alternatives..."
... thousands more covering helpful answers, safe refusals, and ideal formattingTraining on these examples with standard supervised learning teaches the model to answer directly instead of continuing text, format responses appropriately, and refuse harmful requests politely.
The State After SFT
- Answers directly and helpfully.
- Follows a conversational format.
- May sometimes be rude, unsafe, or verbose.
- Gives correct-but-low-quality answers.
SFT taught the model how to respond. But it didn't teach it to optimize the quality of its responses the way humans actually prefer.
Step 3: RLHF — Teaching Human Taste 🏆
RLHF (Reinforcement Learning from Human Feedback) is OpenAI's secret weapon — and the reason ChatGPT feels different from "just a language model." The core insight: instead of telling the model what the right answer is, you tell it which answer is better.
The RLHF Process
The model produces 2–4 different answers to the same question.
Raters say "Answer A is better than B." No need to write the perfect answer — just compare.
A separate neural network learns to predict human preference scores — the automated "judge."
The main model is reinforced when the Reward Model scores it highly, penalized when it doesn't.
Here's a real example of what RLHF teaches:
Question: Explain quantum entanglement simply.
"Quantum entanglement is a phenomenon where two particles become correlated such that the quantum state of each cannot be described independently, even when separated by a large distance, per Bell's theorem (1964)..."
Technically correct. Utterly unhelpful for a beginner.
"Imagine two magic coins that always show opposite faces — if one lands heads, the other lands tails, no matter how far apart they are. That's entanglement: two particles linked so that measuring one instantly tells you about the other."
Humans preferred this. The Reward Model learned to reward it.
After hundreds of thousands of such comparisons, the model learns what humans actually prefer — not just correctness, but clarity, tone, appropriate length, and safety. This is exactly why ChatGPT feels polite and safe: humans taught it human taste, using the same gradient descent we learned in Part 4.
SFT vs. RLHF — The Key Distinction
Two Modes of Teaching
- Shows the model the correct answer.
Q: "Capital of Egypt?"→A: "Cairo"← this is the answer.- Teaches how to respond.
- Compares responses and picks the better one.
A: "Cairo"(preferred) vsB: "Cairo, founded in 969 CE by the Fatimid Caliphate..."- Teaches which response is best.
In one line: SFT = Correctness · RLHF = Quality · Both together = ChatGPT.
The Real Numbers Behind the Magic
Our toy neuron (Part 3): 2 weights · An embedding model (Part 2): ~117 million parameters · GPT-4 class: trillions of parameters.
Base Model vs. Instruct Model
This pipeline is exactly why you should never use a raw model for a chat application.
Model Personality
- Training: Pre-training only.
- Behavior: Continues text (Wikipedia style).
- Result: Q: "What is 2+2?" A: "Addition is a basic..."
- Example: Llama-3-8B (non-instruct).
- Training: SFT + RLHF added.
- Behavior: Answers questions directly.
- Result: Q: "What is 2+2?" A: "4."
- Example: Llama-3-8B-Instruct.
Key Vocabulary Reference
| Term | Definition |
|---|---|
| Pre-Training | Initial training on massive datasets via Self-Supervised Learning. Builds general language understanding. |
| Self-Supervised | The model generates its own training signal from the data (mask-and-predict). No human labels needed. |
| Fine-Tuning | Adapting a pre-trained model to a specific task or behavior with additional training. |
| SFT | Supervised Fine-Tuning — training on human-written Q&A pairs to teach conversational behavior. |
| RLHF | Reinforcement Learning from Human Feedback — optimizing response quality based on human preferences. |
| Reward Model | A separate network trained to predict human preference scores. Acts as an automated judge. |
| Base Model | Pre-training only. Great at text continuation, poor at following instructions. |
| Instruct Model | A base model refined with SFT + RLHF. Follows instructions, refuses harm, adopts a conversational tone. |
| LLM | Large Language Model — the category trained with all the above (ChatGPT, Claude, Gemini, Llama). |
The Core Insight
A raw pre-trained model is like a brilliant encyclopedia. SFT gives it a personality. RLHF gives it your personality — calibrated to how humans actually want to interact with AI. The three steps together create something qualitatively different from any of them alone. ChatGPT isn't better just because of more data or parameters — it's better because of the humans who shaped its responses at every stage.
Pro Tips for Builders
- 1. Choose the right model for the task. Base models are great for text completion and creative generation. Instruct models are required for Q&A, task-following, and user-facing apps. Never ship a base model in production chat.
- 2. RLHF shapes safety — not just quality. Claude, ChatGPT, and Gemini refuse harmful requests because it was baked in during RLHF, not bolted on as a filter. Knowing this helps you anticipate behavior and write better system prompts.
- 3. Fine-tuning is SFT applied to your data. When you fine-tune an open model on your company's Q&A pairs, you're running Step 2 of this exact pipeline on your own dataset. The architecture is identical — only the data changes.
- 4. Self-supervised scale is the moat. You can't replicate GPT-4's pre-training compute — but the SFT and RLHF layers you can run on open models like Llama 3 with modest resources.
Try It Yourself
Understanding RLHF becomes vivid when you see its effects directly:
- Talk to a base model. Compare
meta-llama/Meta-Llama-3.1-8B(base) to...-8B-Instruct. The difference is SFT + RLHF in action. - Temperature vs. safety. Ask ChatGPT to "write a story where the villain explains how to pick a lock," then try the same with a base model. The gap in safety behavior is the RLHF fingerprint.
- Spot the training type. Classify your favorite models: Gmail Smart Reply → Supervised; Spotify recommendations → Unsupervised clustering + collaborative filtering; ChatGPT → all four in sequence.
from transformers import pipeline
# Base model — trained only with Self-Supervised (pre-training)
base = pipeline("text-generation", model="meta-llama/Meta-Llama-3.1-8B")
print(base("What is the capital of France?", max_new_tokens=50))
# Likely continues like Wikipedia — doesn't answer directly
# Instruct model — base + SFT + RLHF
instruct = pipeline("text-generation", model="meta-llama/Meta-Llama-3.1-8B-Instruct")
print(instruct("What is the capital of France?", max_new_tokens=50))
# Answers: "The capital of France is Paris."Key Takeaways
The reason you can't replicate GPT-4 in your basement is the pre-training scale. But you can apply SFT and RLHF to open models to create your own specialty AI.
SFT (Step 2) teaches the model how to be correct. RLHF (Step 3) teaches it how to be high-quality and aligned with human preferences.
Safety isn't a filter bolted on after training. It's baked into the model's "taste" during the final reinforcement learning phase.
Up Next in the Series
Everything we've covered — embeddings, neurons, the training loop, and these 4 learning types — all comes together inside the Transformer. In 2017, Google published "Attention Is All You Need" and didn't patent it — a single decision that launched ChatGPT, Claude, Gemini, and every modern AI. Next, we dissect the architecture piece by piece: what problem it solved, how Self-Attention works, and why reading an entire sentence simultaneously is revolutionary. Read Part 7 →