Skip to main content
AI-Developer/AI Fundamentals
Part 5 of 14

Part 5 — How AI Actually Learns: The Training Loop Explained

AI isn't programmed — it's trained. Four steps, repeated millions of times: guess, measure the mistake, find who's responsible, fix it. Here's exactly how that works.

April 7, 2026
11 min read
#AI#Training#Backpropagation#Gradient Descent#Neural Networks#Loss Function#Overfitting

The Training Loop: Learning by Failing

Nobody wrote rules for how ChatGPT writes poetry or translates Arabic. The AI figured it all out by failing—and failing—and failing—until it didn't. This is the training loop.

Primary Objective
Forward Pass | Loss Function | Backpropagation | Gradient Descent

In the previous article we built an artificial neuron and learned it has weights — importance multipliers that determine how much each input influences the output. The open question: how does the AI learn the right weights? The answer is the Training Loop. Think about how a child learns to walk: try a step, fall, figure out what went wrong, adjust, repeat — until it's automatic. An AI learns exactly the same way; the only difference is speed: it can "fall" and "adjust" millions of times in a few hours.

💡
Intelligence isn't Programmed

In traditional software, we write if/else rules. In AI, we write a learning algorithm and feed it data. The intelligence isn't in the code; it's in the repetition of the training loop.


The 4-Step Cycle

The Learning Loop

📸
FORWARD PASS

Make a guess. Data flows from input to output through (initially random) weights.

📊
LOSS

Measure the mistake. How far was the guess from the true answer?

🔍
BACKPROP

Find who's responsible. Calculate which weights caused the error.

⚙️
WEIGHT UPDATE

Fix a little bit. Adjust weights slightly using Gradient Descent — then repeat millions of times.

Let's walk each step with a concrete example: classifying whether a device is smart glasses or a smart ring.


Step 1: Forward Pass — Make a Guess

Data enters at the input layer and flows forward through every neuron to produce an output. At the start of training all weights are random, so the output is essentially a random guess:

Network Prediction for a Ray-Ban Image (true label: Glasses)
60
👓 Glasses
25
💍 Ring
15
🎧 Earbuds

It should be 100% Glasses; it got 60%. The network isn't "bad" for being wrong — it starts wrong. The whole point of training is to make it less wrong, step by step.

Step 2: Loss — Measure the Mistake

"How wrong was the guess?" is the job of the Loss Function. The most common is Mean Squared Error (MSE):

python
123
loss = (true_label - prediction) ** 2
# Predicted 60% glasses, true answer 100%:
loss = (1.0 - 0.60) ** 2 = 0.16   # high is bad, zero is perfect

Common Loss Functions

📐MSE
  • Mean Squared Error.
  • Best for regression (predicting numbers).
🎯CROSS-ENTROPY
  • Categorical Cross-Entropy.
  • Best for classification (glasses vs. rings) — handles probability distributions and trains faster.

Across training, the loss steadily falls toward zero:

Loss Over Training (lower = better)
48
Epoch 0
36
Epoch 25
24
Epoch 50
12
Epoch 75
2
Epoch 99

Step 3: Backpropagation — Find Who's Responsible

The magic step: once we know the total loss, which weights caused the error, and by how much? Imagine a factory of 1,000 workers and a defective product. You don't blame everyone equally — you ask each worker "how much did you contribute to the defect?" and adjust the biggest contributors most. Backpropagation is the mathematical version, using the calculus chain rule. It's like tracing a string of Christmas lights backwards from the dead end to find the one broken bulb. The output is a gradient for each weight: "if I nudge this weight, how much does the loss change?"

Step 4: Weight Update — Gradient Descent

Now we know which way to adjust each weight, but how much?

text
new_weight = old_weight - (learning_rate × gradient)

The Learning Rate is the size of each step downhill on the loss landscape:

Choosing the Learning Rate

LR = 0.9 (too large)

Giant steps overshoot the minimum and bounce around forever — unstable.

LR = 0.0001 (too small)

Tiny steps will get there eventually — in weeks. Inefficient.

LR = 0.01 (just right)

Steady, efficient progress toward the minimum. ✅

The Loss Landscape
  • Learning Rate: The size of each step.
  • Too Large: You overshoot the minimum and bounce around (unstable).
  • Too Small: You take forever to reach the bottom (inefficient).
  • Just Right: Steady, efficient progress toward the global minimum.

Key Training Vocabulary

TermMeaning
EpochOne complete pass through the entire training dataset. 10,000 images → one epoch means all 10,000 seen.
BatchA small group (e.g. 32 examples) processed before updating weights — far more efficient than 32 single updates.
IterationOne batch processed. 1,000 images at batch 32 ≈ 31 iterations/epoch; 100 epochs ≈ 3,100 updates.

The Trap: Overfitting

A network can get great at the training data while becoming terrible on real-world data it's never seen — it memorized the answers instead of learning the pattern. (This is exactly why the embedding model from Part 2 trained on billions of pairs — a small dataset would memorize phrases rather than learn the geometry of meaning.)

The Overfitting Spectrum

📉UNDERFITTING

The student who didn't study. Fails everything.

📚OVERFITTING

The student who memorized the exam. Fails any new question.

🎯JUST RIGHT

The student who understood the concepts. Passes any test.

Four ways to fix overfitting: (1) More data — the most reliable fix; memorizing 100,000 examples is impossible, so it must generalize. (2) Dropout — randomly disable some neurons each pass so the network can't rely on any single one. (3) Early stopping — when validation loss starts rising while training loss keeps falling, stop. (4) Data augmentation — flip/rotate images, paraphrase text, so it learns the concept, not the presentation.

python
123456
import torch.nn as nn
model = nn.Sequential(
    nn.Linear(10, 64), nn.ReLU(),
    nn.Dropout(p=0.3),          # 30% of neurons randomly disabled while training
    nn.Linear(64, 3), nn.Softmax(dim=1),
)

Complete Python Implementation

The full four-step loop classifying glasses vs. rings:

python
123456789101112131415161718192021
import numpy as np

X_train = np.array([[0.55,0.48],[0.45,0.72],[0.35,0.03],[0.20,0.03]])
y_train = np.array([1, 1, 0, 0])          # 1=glasses, 0=ring
weights, bias, lr = np.array([0.5,0.5]), 0.0, 0.1

for epoch in range(100):
    total_loss = 0
    for x, y_true in zip(X_train, y_train):
        prediction  = np.clip(np.dot(x, weights) + bias, 0, 1)   # 1. Forward pass
        total_loss += (y_true - prediction) ** 2                 # 2. Loss
        error       = y_true - prediction                        # 3 + 4. Backprop + update
        weights    += lr * error * x
        bias       += lr * error
    if epoch % 25 == 0:
        print(f"Epoch {epoch:3d}  Loss={total_loss:.4f}")
# Loss falls 0.48 → 0.002 over 100 epochs.

test = np.array([0.50, 0.55])
pred = np.clip(np.dot(test, weights) + bias, 0, 1)
print("Glasses ✅" if pred > 0.5 else "Ring ❌")   # → 0.98 → Glasses ✅

The network learned to distinguish glasses from rings from 4 examples and 100 epochs — without a single rule written explicitly.


Real-World Scale

GPT-3 Training Stats
  • Data: 300 Billion tokens (most of the internet).
  • Time: Months on thousands of GPUs in parallel.
  • Cost: ~$100 Million estimated compute.
  • Single GPU: Would take 355 years to train on one RTX 4090.

Our example: 4 examples, 100 epochs, 0.001 seconds. The math is identical — only the scale is incomprehensible.

How This Loop Created the 384-Dimensional Embeddings

In Part 2 we used a model that turned any sentence into a 384-dimensional vector. Now you know exactly how it was built: Data — billions of multilingual pairs labeled similar/different; Loss — contrastive loss that pulls similar sentences together and pushes different ones apart; Loop — this same 4-step loop run for millions of iterations until the 384 output neurons encoded meaning as geometry. The training loop is how embeddings are made.


The Core Insight

💡
Training Isn't Programming — It's Controlled Failure at Scale

Guess → Measure → Blame → Fix → Repeat. The intelligence isn't in any single step — it's in the repetition. Every AI capability you've used (image recognition, translation, text generation, code completion) is this loop running billions of times on massive data.


Pro Tips for Builders

⚠️
Training Pro Tips
  • Start with lr=0.01 — the safest default; tune from there with a learning-rate scheduler.
  • Watch both losses — track training and validation loss. Training down + validation up = overfitting.
  • Batch size affects generalization — smaller batches (16–32) add helpful noise; larger batches train faster but can overfit.
  • Use Adam, not plain SGD — it adapts the learning rate per weight automatically and converges faster in practice.
  • The 4-step loop is universal — fine-tuning GPT or training a 2-neuron toy, the loop is identical. Only the scale changes.

Try It Yourself

Experiment with the learning rate in the code above: set lr = 0.9 and watch the loss bounce (overshoots, never converges); set lr = 0.001 and watch it crawl; lr = 0.1 gives smooth convergence. Then add a 5th example that slightly contradicts the pattern and watch the loss floor rise — that's overfitting in miniature.


Key Takeaways

01
01
Repetition is the Key

The intelligence isn't in any single step. It's in the billions of repetitions of the Guess → Measure → Fix loop.

02
02
Measure, Don't Guess

You cannot "feel" your way to a better model. You must use a loss function and follow the gradient (mathematical direction) toward the answer.

03
03
Generalization is the Goal

The point isn't to get 100% on the training data. The point is to learn the underlying pattern so the model works on data it's never seen.


Up Next in the Series

💡
Next: How GPT Was Actually Built

Four types of learning, three secret training stages, and the human-feedback process that turned a chaotic language model into a polite, helpful assistant. Continue the series →

MH

Mohamed Hamed

20 years building production systems — the last several deep in AI integration, LLMs, and full-stack architecture. I write what I've actually built and broken. If this was useful, the next one goes to LinkedIn first.

Follow on LinkedIn →