THE EVOLUTION OF LLMs
Zero to ChatGPT
4 Types of Learning — 3 Secret Steps — 1 Revolutionary AI
Remember the training loop and neuron from the last two articles? Today we answer who decides what the loop learns.
In our last article, we explored how a neural network learns — the forward pass, loss function, backpropagation, and gradient descent. That covered the mechanics of learning.
But there's a deeper question we left unanswered: Who decides what's right and what's wrong?
The answer changes everything. And it comes in four flavors.
The 4 Types of Machine Learning
Modern AI systems don't use a single learning strategy. GPT, Claude, Gemini — they all combine four fundamentally different types of learning in a carefully orchestrated sequence. Let's break each one down.
Type 1: Supervised Learning — The Classroom 🏫
In Supervised Learning, there's a teacher who provides labeled examples. The model sees a question, the model makes a guess, and the teacher says "right" or "wrong."
Real-World Example: Wearable Device Classifier
Supervised learning has two sub-types that cover fundamentally different problems:
Classification
Which category does this belong to?
Example: "Is this device glasses, a ring, or earbuds?" → Output is a discrete class
Regression
What number/value should this output?
Example: "What will this device's price be next quarter?" → Output is a continuous value
Where Supervised Learning is used today:
- Medical image diagnosis (is this tumor malignant or benign?)
- Email spam detection
- Housing price prediction
- Credit card fraud detection
- Voice recognition ("Hey Siri, set a timer")
The catch: You need labeled data — thousands or millions of human-annotated examples. This is expensive, slow, and doesn't scale to "understand all of human language."
Type 2: Unsupervised Learning — The Detective 🔍
No teacher. No labels. The model stares at raw data and discovers hidden patterns entirely on its own.
Self-Discovery Example
Raw data — no labels provided:
[price: $549, weight: 48g]
[price: $449, weight: 72g]
[price: $349, weight: 3g]
[price: $299, weight: 5g]
[price: $199, weight: 3g]
The model decided on its own:
(Glasses, Headsets)
(Rings, Trackers)
Nobody told the AI what "glasses" or "rings" are. It discovered the natural structure of the data itself. 🤯
Think of a child who was shown 100 images with zero explanations. They'd eventually notice that some things have "long ears" while others "have wings." The AI does the same — pure pattern discovery.
The embedding vectors we explored in our embeddings article — those are built using Unsupervised Learning. The model learned that "king" and "queen" are related without anyone telling it so.
Where Unsupervised Learning is used:
- Customer segmentation (e-commerce grouping buyers by behavior)
- Anomaly detection (spotting unusual transactions)
- Topic modeling (discovering themes in millions of documents)
- Building embedding models ← directly powers Similarity Search
Type 3: Reinforcement Learning — The Gamer 🎮
No fixed right answers. Instead, the model tries things and receives rewards or penalties.
The Reinforcement Learning Loop
AlphaGo (board games)
Robotics
Self-driving cars
This is what made ChatGPT
helpful, polite, and safe!
The elegance of RL: there's no need to define all the "correct" moves in advance. You just define a reward signal, and the agent figures out the strategy on its own.
AlphaGo (DeepMind, 2016) mastered the game of Go — a game with more possible positions than atoms in the observable universe — using RL. It eventually beat the world champion 4-1, making moves no human had ever thought of.
Type 4: Self-Supervised Learning — The Star ⭐
This is the most important type for modern AI. GPT, Claude, Gemini — all built on this. And it's technically a clever subtype of Unsupervised Learning (a clever subtype of Unsupervised Learning where the model invents its own practice problems by hiding words in sentences).
The insight is deceptively simple: what if we could generate our own labels from the data itself?
Instead of needing human annotators to label billions of examples, the model creates its own training signal:
The Mask-and-Predict Game
Round 1:
Input: "The best smart glasses in 2026 are ___"
Model guesses: "Apple" ← Wrong, learns from it
Correct: "Ray-Ban" ✅ ← Weights updated
Round 2:
Input: "The best smart glasses in ___ are Ray-Ban"
Model guesses: "2026" ✅ Correct! Weights reinforced
Round 3 (billions more like these):
Input: "___ was founded in Cupertino, California"
Model guesses: "Apple" ✅ Correct!
One trillion-word dataset becomes trillions of self-generated training signals — this is why no human labels were needed.
Do this with billions of sentences and you get a model that understands grammar, facts about the world, logical reasoning, and even writing style — without a single human-written label.
The mathematical elegance: every sentence in the training corpus becomes thousands of training examples by masking different words. A trillion-word dataset effectively becomes trillions of self-generated training signals.
The 4 Learning Types — Side by Side
How the 4 Types Fit Together in the Real Pipeline
Here's what most courses miss: Self-Supervised Learning is actually a subtype of Unsupervised Learning — it just generates its own labels from raw data instead of discovering clusters. And the training loop we explored in the last article (Forward Pass → Loss → Backprop → Update) runs inside every one of these phases. The neuron from Article 3 is the core machine being tuned at each step. All four types aren't separate approaches — they're four different configurations of the same fundamental learning machinery, sequenced carefully to produce a capable and safe AI.
The Secret 3-Step Pipeline: How GPT Was Actually Built
Now here's where it gets fascinating. Those four learning types don't operate in isolation — they're combined in a precise, sequential pipeline that transforms a raw text-crunching machine into a helpful, articulate AI assistant.
Think of it like training a doctor. You don't put a newborn directly into medical school. You teach them step by step.
Step 1: Pre-Training
Self-Supervised Learning on trillions of words
thousands of GPUs
Step 2: Supervised Fine-Tuning (SFT)
Humans write ideal Q&A examples, model learns to follow instructions
curated examples
Step 3: RLHF
Human raters compare responses, Reward Model trains, AI gets optimized
of comparisons
ChatGPT
Helpful ✅ Polite ✅ Safe ✅ Refuses dangerous requests ✅
Just like the training loop we saw last article (Forward Pass → Loss → Backprop → Update) runs inside every one of these three steps.
Now watch how OpenAI (and every major lab) stacks these four types into the exact 3-step pipeline that created ChatGPT.
Let's dive into each step.
Step 1: Pre-Training — Reading the Entire Internet 📚
Pre-training is where it all begins. Using Self-Supervised Learning, the model is exposed to an almost incomprehensible volume of text.
Training Data Scale (GPT-3 Class Models)
GPT-4 class models train on even more — estimated 13+ trillion tokens
What the model gains from Pre-Training:
- Grammar and syntax in dozens of languages
- Facts about the world (history, science, geography, culture)
- Writing styles (formal, casual, technical, creative)
- Code patterns across programming languages
- Mathematical reasoning
The critical limitation: After pre-training, the model is like a brilliant student who has read every book in the library — but never learned to have a conversation. Ask it "What is the capital of France?" and it might respond with more text that sounds like it continues a Wikipedia article, not a direct answer.
Pre-trained model response to "What is the capital of France?":
"France is a Western European country with a rich cultural heritage.
France borders Belgium, Luxembourg, Germany, Switzerland, Italy, Monaco,
Andorra, and Spain. The capital and most populous city of France is..."
[It continues like a Wikipedia article — never gets to the point]
This is why Step 2 is critical.
Step 2: Supervised Fine-Tuning (SFT) — The School of Conversation 🎓
SFT is where humans enter the picture. A team of professional annotators — sometimes thousands of them — sit down and write ideal conversation examples.
Human-Written Training Examples
Question: "What is the capital of France?"
Answer: "The capital of France is Paris."
Question: "How do I make a chocolate cake?"
Answer: "Here's a simple chocolate cake recipe. Ingredients: 2 cups flour, 2 cups sugar, ¾ cup cocoa powder... [structured, helpful response]"
Question: "How do I hack into my neighbor's WiFi?"
Answer: "I'm unable to help with that. Accessing someone's network without permission is illegal. If you're having connectivity issues, here are some legal alternatives..."
... thousands more examples covering helpful answers, safe refusals, and ideal formatting
The model trains on these examples using standard supervised learning. Now it learns to:
- Answer directly instead of continuing text
- Format responses appropriately (lists, code blocks, etc.)
- Refuse harmful requests politely but firmly
After SFT ✅
Answers directly and helpfully
Follows conversational format
Still problematic ❌
May sometimes be rude, unsafe,
or give poor-quality answers
SFT taught the model how to respond. But it didn't teach it to optimize the quality of its responses in the way humans actually prefer.
Step 3: RLHF — Teaching Human Taste 🏆
RLHF (Reinforcement Learning from Human Feedback) is OpenAI's secret weapon — and the reason ChatGPT feels different from just "a language model."
The core insight: instead of telling the model what the right answer is, you tell it which answer is better.
The RLHF Process — 3 Micro-Steps
The model produces 2-4 different answers to the same question.
Human raters read both and say "Answer A is better than B." No need to write the perfect answer — just compare.
A separate neural network learns to predict human preference scores. This becomes the automated "judge."
The main model gets reinforced when the Reward Model gives it high scores. Responses the Reward Model dislikes get penalized.
A real example of what RLHF teaches:
Question: "Explain quantum entanglement simply."
ANSWER B (before RLHF)
"Quantum entanglement is a phenomenon where two particles become correlated such that the quantum state of each particle cannot be described independently of the others, even when separated by a large distance, per Bell's theorem (1964)..."
Technically correct. Utterly unhelpful for a beginner.
ANSWER A (after RLHF preferred)
"Imagine two magic coins that always show opposite faces — if one lands heads, the other lands tails, no matter how far apart they are. That's quantum entanglement: two particles linked so that measuring one instantly tells you about the other."
Humans preferred this. Reward Model learned to reward it.
After hundreds of thousands of such comparisons, the model learns what humans actually prefer — not just correctness, but clarity, tone, appropriate length, and safety.
This is exactly why ChatGPT feels polite and safe — humans taught it human taste using the same gradient descent we learned in Article 4.
SFT vs RLHF — The Key Distinction
Teacher Mode
Shows the model the correct answer
A: "Cairo" ← this is the answer
Teaches: how to respond
Critic Mode
Compares responses and picks the better one
B: "Cairo, Egypt's capital, founded in 969 CE by the Fatimid Caliphate..."
Human: "A is better"
Teaches: which response is best
The Real Numbers Behind the Magic
600B+
Words in Pre-Training
10K–100K
SFT examples written by humans
100K–1M
Human preference comparisons for RLHF
~$100M
Estimated cost to pre-train GPT-4
Scale Comparison
Our toy neuron (Article 3): 2 weights | Embedding model (Article 2): 117 million parameters | GPT-4 class: trillions of parameters
Key Vocabulary Reference
The Core Insight
Why ChatGPT feels different
A raw pre-trained model is like a brilliant encyclopedia. SFT gives it a personality. RLHF gives it your personality — calibrated to how humans actually want to interact with AI. The three steps together create something qualitatively different from any of them alone.
ChatGPT is not just smarter because of more data or parameters. It's better because of the humans who carefully shaped its responses at every stage. Behind every helpful answer is a pipeline of billions of words, thousands of human-written examples, and hundreds of thousands of human preference judgments.
Pro Tips for Builders
💡 What Knowing This Changes For You
Choose the right model for the task. Base models are great for text completion and creative generation. Instruct models are required for Q&A, task following, and user-facing apps. Never use a base model in production chat.
RLHF shapes safety — not just quality. The reason Claude, ChatGPT, and Gemini refuse harmful requests isn't a filter bolted on after — it was baked in during RLHF training. Understanding this helps you anticipate model behavior and write better system prompts.
Fine-tuning is SFT applied to your data. When you fine-tune an open-source model on your company's Q&A pairs, you're running Step 2 of this exact pipeline on your own dataset. The architecture is identical — only the data changes.
Self-Supervised scale is the moat. The reason you can't replicate GPT-4 is the pre-training compute. But the SFT and RLHF layers? Those you can run on open models like Llama 3 with modest resources.
Try It Yourself
Understanding RLHF becomes vivid when you see its effects directly:
Experiment 1: Talk to a Base Model
Models like meta-llama/Meta-Llama-3.1-8B (non-instruct version) behave closer to a pure pre-trained model. Compare its response to meta-llama/Meta-Llama-3.1-8B-Instruct. The difference is SFT + RLHF in action.
Experiment 2: Temperature vs Safety Try asking ChatGPT to "write a story where the villain explains how to pick a lock." Then try it with Llama 3 base (via HuggingFace). The difference in safety behavior is the RLHF fingerprint.
Experiment 3: Spot the Training Type Look at your favorite ML model and classify it:
- Gmail Smart Reply → Supervised Learning (trained on email reply pairs)
- Spotify recommendation → Unsupervised clustering + Collaborative filtering
- OpenAI's ChatGPT → All four types in sequence
Experiment 4: Base vs Instruct — Feel the Difference Run the same prompt through both a base model and its instruct version on HuggingFace:
from transformers import pipeline
base = pipeline("text-generation", model="meta-llama/Meta-Llama-3.1-8B")
print(base("What is the capital of France?", max_new_tokens=50))
# Likely continues like Wikipedia — doesn't answer directly
# Instruct model — base + SFT + RLHF
instruct = pipeline("text-generation", model="meta-llama/Meta-Llama-3.1-8B-Instruct")
print(instruct("What is the capital of France?", max_new_tokens=50))
# Answers: "The capital of France is Paris."
# The difference between these two outputs is SFT + RLHF in action.
Everything we've covered — embeddings, neurons, training loop, and these 4 learning types — all comes together inside the Transformer.
NEXT IN SERIES
The Transformer: The Architecture That Changed Everything
In 2017, Google published a paper titled "Attention Is All You Need" — and didn't patent it. That single decision launched ChatGPT, Claude, Gemini, and every modern AI. In the next article, we'll dissect the Transformer architecture piece by piece: what problem it solved, how Self-Attention works, and why reading an entire sentence simultaneously is revolutionary.
Coming next: transformer-article.md