Skip to main content
AI-Developer → AI Fundamentals#5 of 14

Part 5 — How AI Actually Learns: The Training Loop Explained

AI isn't programmed — it's trained. Four steps, repeated millions of times: guess, measure the mistake, find who's responsible, fix it. Here's exactly how that works.

March 12, 2026
11 min read
#AI#Training#Backpropagation#Gradient Descent#Neural Networks#Loss Function#Overfitting

Nobody programmed ChatGPT to write poetry. Nobody wrote rules for how to translate between Arabic and English. Nobody told the AI what "smart glasses" means.

The AI figured it all out by failing — and failing — and failing — until it didn't.

In the previous article we built an artificial neuron and learned that it has weights — importance multipliers that determine how much each input influences the output. The question we left open: how does the AI learn the right weights?

The answer is the Training Loop — four steps, repeated millions of times, that turn random numbers into intelligence.


The Core Idea: Learning from Mistakes

Think about how a child learns to walk. Nobody programs the angles their legs need to maintain. Nobody writes rules for balance. The child:

  1. Tries to take a step
  2. Falls over
  3. Somehow figures out what went wrong
  4. Adjusts the next attempt
  5. Repeats — until walking becomes automatic

An AI learns exactly the same way. The only difference is speed: a neural network can "fall" and "adjust" millions of times in a few hours.


The Training Loop: 4 Steps

📸
STEP 1
Forward Pass
Make a guess
📊
STEP 2
Loss
Measure the mistake
🔍
STEP 3
Backpropagation
Find who's responsible
⚙️
STEP 4
Weight Update
Fix a little bit
🔁 Repeat millions of times

Let's go through each step with a concrete example: classifying whether a device is smart glasses or a smart ring.


Step 1: Forward Pass — Make a Guess

Data enters the network at the input layer and flows forward through every neuron until it produces an output. We call this the Forward Pass.

At the very start of training, all the weights are random. So the output is essentially a random guess.

👓

Input: Ray-Ban image
(True label: Glasses)

→→→

Network prediction:

👓 Glasses
60%
💍 Ring
25%
🎧 Earbuds
15%

Should be 100% Glasses. Got 60%. The network is wrong — and that's expected at the start. ✅

The network isn't "bad" for being wrong here. It starts wrong. The whole point of training is to make it less wrong, step by step.


Step 2: Loss — Measure the Mistake

"How wrong was the guess?" is the job of the Loss Function (also called the Cost Function).

The most common version is Mean Squared Error (MSE):

loss = (true_label - prediction) ** 2

If we predicted 60% glasses and the true answer is 100% glasses:

loss = (1.0 - 0.60) ** 2 = 0.40 ** 2 = 0.16

A loss of 0.16 on a scale of 0–1. High is bad. Zero is perfect.

Note: For multi-class problems (3+ categories), Cross-Entropy loss is more common than MSE — it handles probability distributions better and trains faster on classification tasks.

Loss chart — early in training (first few epochs):

0.48
0.36
0.24
0.12
0.02 ⭐
Epoch 0Epoch 25Epoch 50Epoch 75Epoch 99

The bigger the number, the more the AI is "lost". Training drives this number toward zero.


Step 3: Backpropagation — Find Who's Responsible

This is the magic step. Once we know the total loss, we need to figure out: which weights caused the error, and by how much?

Imagine a factory with 1,000 workers. The product came out defective. How do you fix it?

❌ Blame everyone equally

Unfair and inefficient. Workers who did nothing wrong get punished.

✅ Ask each worker: "How much did you contribute to the defect?"

Adjust the biggest contributors more. Leave innocent workers alone.

Backpropagation is the mathematical version of that second approach. It uses calculus (specifically the chain rule) to calculate the exact contribution of each weight to the total loss.

Think of it like tracing a string of Christmas lights: one bulb goes out and the whole string fails. You don't replace every bulb — you trace backwards from the dead end of the string to find which single bulb broke the chain. Backpropagation does this mathematically, tracing backwards from the output error through every layer to find which weights contributed most.

The output: a number for each weight called the gradient — which tells us: "if we increase this weight by a tiny amount, how much does the loss increase or decrease?"


Step 4: Weight Update — Fix a Little Bit (Gradient Descent)

Now we know which way to adjust each weight. But how much should we adjust?

Too little: training takes forever. Too much: the network overshoots and bounces around without ever converging.

The formula for updating each weight is:

new_weight = old_weight - (learning_rate × gradient)

The Learning Rate is the key hyperparameter here. Think of it as the size of each step when walking down a hill toward the lowest point (minimum loss):

LR = 0.9 (too large)
Takes giant steps, overshoots the minimum, bounces around forever
LR = 0.0001 (too small)
Takes tiny steps, will eventually get there — in weeks
LR = 0.01 (just right)
Steady progress, reaches minimum efficiently ✅

This process of adjusting weights following the gradient is called Gradient Descent — mathematically walking downhill on the loss landscape.

The Loss Landscape — gradient descent finds the lowest valley

global min start Loss Weights local min

The ball (your model) rolls downhill one step at a time. Learning rate = step size. Goal: reach the global minimum.


Key Training Vocabulary

Three terms appear in every AI paper and framework:

Epoch

One complete pass through the entire training dataset. If you have 10,000 images, one epoch = the network has seen all 10,000.

Batch

We don't update weights after every single example — we process a small group (e.g., 32 images) first, average the loss, then update. A batch of 32 is far more efficient than 32 individual updates.

Iteration

One batch processed = one iteration. With 1,000 images and batch size 32: ~31 iterations per epoch. After 100 epochs: 3,100 weight updates.


The Problem That Derails Training: Overfitting

Here's the trap: a network can get very good at the training data while becoming terrible at real-world data it's never seen. This is called Overfitting — the AI memorized the answers instead of learning the pattern.

This is exactly why the embedding model from Article 2 needed to train on billions of multilingual sentence pairs — a smaller dataset would have overfit to memorized phrases rather than learning the underlying geometry of meaning.

📉

Underfitting

Student who didn't study at all. Fails everything.

📚

Overfitting

Student who memorized last year's questions word-for-word. Fails any new question.

🎯

Just Right

Student who understood the material. Passes any exam on the topic. ✅

Four Ways to Fix Overfitting

1. More Data — The most reliable fix. If the network has seen 100,000 examples instead of 100, memorizing becomes impossible. It has to generalize.

2. Dropout — During training, randomly "turn off" some neurons in each forward pass. The network is forced to not rely on any single neuron, so it develops redundant, distributed knowledge.


import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 64),
    nn.ReLU(),
    nn.Dropout(p=0.3),   # 30% of neurons randomly disabled during training
    nn.Linear(64, 3),
    nn.Softmax(dim=1)
)

3. Early Stopping — Monitor validation loss (on data the network hasn't trained on). When validation loss starts rising while training loss keeps falling — stop. The network has started memorizing.

4. Data Augmentation — For images: flip, rotate, change brightness, add noise. For text: paraphrase, translate and back-translate. The network sees the same concept presented differently, so it learns the concept — not the presentation.


Complete Python Implementation

Here's the full training loop working end-to-end to classify devices as glasses vs. rings:

import numpy as np
# This is a single-neuron version of the neuron we built in the previous article

# Training data: [price_normalized, weight_normalized] → label
X_train = np.array([
    [0.55, 0.48],   # mid price, mid weight → glasses
    [0.45, 0.72],   # lower price, heavier  → glasses
    [0.35, 0.03],   # low price, very light → ring
    [0.20, 0.03],   # very low, very light  → ring
])
y_train = np.array([1, 1, 0, 0])   # 1=glasses, 0=ring

# Initial weights (random start)
weights       = np.array([0.5, 0.5])
bias          = 0.0
learning_rate = 0.1

# The training loop
for epoch in range(100):
    total_loss = 0

    for x, y_true in zip(X_train, y_train):
        # Step 1: Forward Pass — make a prediction
        prediction = np.clip(np.dot(x, weights) + bias, 0, 1)

        # Step 2: Loss — measure the mistake
        loss        = (y_true - prediction) ** 2
        total_loss += loss

        # Step 3 + 4: Backprop + Weight Update
        error    = y_true - prediction
        weights += learning_rate * error * x
        bias    += learning_rate * error

    if epoch % 25 == 0:
        print(f"Epoch {epoch:3d}  Loss={total_loss:.4f}  "
              f"Weights=[{weights[0]:.3f}, {weights[1]:.3f}]")

Output:

Epoch   0  Loss=0.4823  Weights=[0.618, 0.523]
Epoch  25  Loss=0.1204  Weights=[0.743, 0.611]
Epoch  50  Loss=0.0312  Weights=[0.819, 0.684]
Epoch  75  Loss=0.0089  Weights=[0.867, 0.731]
Epoch  99  Loss=0.0021  Weights=[0.891, 0.752]

The loss dropped from 0.48 to 0.002 in 100 epochs. Now test on a new device:

test = np.array([0.50, 0.55])   # new device: mid price, mid weight
pred = np.clip(np.dot(test, weights) + bias, 0, 1)
label = "Glasses ✅" if pred > 0.5 else "Ring ❌"
print(f"Prediction: {pred:.2f} → {label}")
# Prediction: 0.98 → Glasses ✅

The network learned to distinguish glasses from rings — without a single rule written explicitly. It learned the pattern from 4 examples, 100 epochs, and the four-step training loop.


Real-World Scale

The base model behind early ChatGPT (GPT-3) was trained on roughly 300 billion tokens of text (about 500 billion words — most of the internet). The training loop ran for months on thousands of GPUs running in parallel. The estimated compute cost: over $100 million.

Our example: 4 examples, 100 epochs, 0.001 seconds.

The math is identical. The scale is incomprehensible.

The GPT training answer: If you trained GPT-3 on a single modern consumer GPU (RTX 4090), it would take approximately 355 years. That's why distributed training across thousands of specialized chips (H100s, TPUs) isn't optional — it's required.

How This Loop Created the 384-Dimensional Embeddings from Article 2

In Article 2, we used a model that converted any sentence into a 384-dimensional vector. Now you know exactly how that model was built:

Data

Billions of multilingual sentence pairs — "I need coffee" paired with "محتاج قهوة" labeled as similar; "coffee" paired with "sleep" labeled as different

Loss

Contrastive loss — penalizes the model when similar sentences produce vectors that are far apart, rewards it when different sentences produce vectors that are far apart

Loop

The same 4-step training loop, run for millions of iterations on thousands of GPUs — until the 384 output neurons learned to encode meaning as geometry

The training loop IS how embeddings are made. Now you've seen both ends of the pipeline.


The Core Insight

Training isn't programming.

It's controlled failure at scale.

Guess → Measure → Blame → Fix → Repeat. The intelligence isn't in any single step. It's in the repetition.

Every AI capability you've ever used — image recognition, translation, text generation, code completion — is the result of this loop running billions of times on massive amounts of data.


Pro Tips for Builders

  • Start with lr=0.01 — it's the safest default for most problems; tune from there with a learning rate scheduler
  • Watch both losses — always track training loss AND validation loss. If training falls but validation rises, you're overfitting
  • Batch size affects generalization — smaller batches (16–32) add noise that helps escape local minima; larger batches train faster but can overfit more easily
  • Use Adam, not plain SGD — Adam adapts the learning rate per weight automatically; it's more forgiving and converges faster in practice
  • The 4-step loop is universal — whether you're fine-tuning GPT or training a 2-neuron toy model, the loop is identical. Only the scale changes.

Try It Yourself

Experiment with the learning rate in the code above:

# Experiment 1: learning_rate = 0.9  (too large)
# Change the learning_rate line to 0.9 and re-run.
# Watch the loss BOUNCE — it overshoots the minimum and never converges.

# Experiment 2: learning_rate = 0.001  (too small)
# Loss drops but very slowly — training would need 10x more epochs.

# Experiment 3: learning_rate = 0.1   (just right — default above)
# Smooth, steady convergence. Loss reaches near-zero by epoch 99.

Try adding a 5th training example that contradicts the pattern slightly — watch how the loss floor rises. That's the model struggling to generalize. This is overfitting in miniature.


Next in AI Fundamentals

How GPT Was Actually Built

Four types of learning, three secret training stages, and the human feedback process that turned a chaotic language model into a polite, helpful assistant.

AI Fundamentals
MH

Mohamed Hamed

20 years building production systems — the last several deep in AI integration, LLMs, and full-stack architecture. I write what I've actually built and broken. If this was useful, the next one goes to LinkedIn first.

Follow on LinkedIn →