The Training Loop: Learning by Failing
Nobody wrote rules for how ChatGPT writes poetry or translates Arabic. The AI figured it all out by failing—and failing—and failing—until it didn't. This is the training loop.
In the previous article we built an artificial neuron and learned it has weights — importance multipliers that determine how much each input influences the output. The open question: how does the AI learn the right weights? The answer is the Training Loop. Think about how a child learns to walk: try a step, fall, figure out what went wrong, adjust, repeat — until it's automatic. An AI learns exactly the same way; the only difference is speed: it can "fall" and "adjust" millions of times in a few hours.
In traditional software, we write if/else rules. In AI, we write a learning algorithm and feed it data. The intelligence isn't in the code; it's in the repetition of the training loop.
The 4-Step Cycle
The Learning Loop
Make a guess. Data flows from input to output through (initially random) weights.
Measure the mistake. How far was the guess from the true answer?
Find who's responsible. Calculate which weights caused the error.
Fix a little bit. Adjust weights slightly using Gradient Descent — then repeat millions of times.
Let's walk each step with a concrete example: classifying whether a device is smart glasses or a smart ring.
Step 1: Forward Pass — Make a Guess
Data enters at the input layer and flows forward through every neuron to produce an output. At the start of training all weights are random, so the output is essentially a random guess:
It should be 100% Glasses; it got 60%. The network isn't "bad" for being wrong — it starts wrong. The whole point of training is to make it less wrong, step by step.
Step 2: Loss — Measure the Mistake
"How wrong was the guess?" is the job of the Loss Function. The most common is Mean Squared Error (MSE):
loss = (true_label - prediction) ** 2
# Predicted 60% glasses, true answer 100%:
loss = (1.0 - 0.60) ** 2 = 0.16 # high is bad, zero is perfectCommon Loss Functions
- Mean Squared Error.
- Best for regression (predicting numbers).
- Categorical Cross-Entropy.
- Best for classification (glasses vs. rings) — handles probability distributions and trains faster.
Across training, the loss steadily falls toward zero:
Step 3: Backpropagation — Find Who's Responsible
The magic step: once we know the total loss, which weights caused the error, and by how much? Imagine a factory of 1,000 workers and a defective product. You don't blame everyone equally — you ask each worker "how much did you contribute to the defect?" and adjust the biggest contributors most. Backpropagation is the mathematical version, using the calculus chain rule. It's like tracing a string of Christmas lights backwards from the dead end to find the one broken bulb. The output is a gradient for each weight: "if I nudge this weight, how much does the loss change?"
Step 4: Weight Update — Gradient Descent
Now we know which way to adjust each weight, but how much?
new_weight = old_weight - (learning_rate × gradient)The Learning Rate is the size of each step downhill on the loss landscape:
Choosing the Learning Rate
Giant steps overshoot the minimum and bounce around forever — unstable.
Tiny steps will get there eventually — in weeks. Inefficient.
Steady, efficient progress toward the minimum. ✅
- Learning Rate: The size of each step.
- Too Large: You overshoot the minimum and bounce around (unstable).
- Too Small: You take forever to reach the bottom (inefficient).
- Just Right: Steady, efficient progress toward the global minimum.
Key Training Vocabulary
| Term | Meaning |
|---|---|
| Epoch | One complete pass through the entire training dataset. 10,000 images → one epoch means all 10,000 seen. |
| Batch | A small group (e.g. 32 examples) processed before updating weights — far more efficient than 32 single updates. |
| Iteration | One batch processed. 1,000 images at batch 32 ≈ 31 iterations/epoch; 100 epochs ≈ 3,100 updates. |
The Trap: Overfitting
A network can get great at the training data while becoming terrible on real-world data it's never seen — it memorized the answers instead of learning the pattern. (This is exactly why the embedding model from Part 2 trained on billions of pairs — a small dataset would memorize phrases rather than learn the geometry of meaning.)
The Overfitting Spectrum
The student who didn't study. Fails everything.
The student who memorized the exam. Fails any new question.
The student who understood the concepts. Passes any test.
Four ways to fix overfitting: (1) More data — the most reliable fix; memorizing 100,000 examples is impossible, so it must generalize. (2) Dropout — randomly disable some neurons each pass so the network can't rely on any single one. (3) Early stopping — when validation loss starts rising while training loss keeps falling, stop. (4) Data augmentation — flip/rotate images, paraphrase text, so it learns the concept, not the presentation.
import torch.nn as nn
model = nn.Sequential(
nn.Linear(10, 64), nn.ReLU(),
nn.Dropout(p=0.3), # 30% of neurons randomly disabled while training
nn.Linear(64, 3), nn.Softmax(dim=1),
)Complete Python Implementation
The full four-step loop classifying glasses vs. rings:
import numpy as np
X_train = np.array([[0.55,0.48],[0.45,0.72],[0.35,0.03],[0.20,0.03]])
y_train = np.array([1, 1, 0, 0]) # 1=glasses, 0=ring
weights, bias, lr = np.array([0.5,0.5]), 0.0, 0.1
for epoch in range(100):
total_loss = 0
for x, y_true in zip(X_train, y_train):
prediction = np.clip(np.dot(x, weights) + bias, 0, 1) # 1. Forward pass
total_loss += (y_true - prediction) ** 2 # 2. Loss
error = y_true - prediction # 3 + 4. Backprop + update
weights += lr * error * x
bias += lr * error
if epoch % 25 == 0:
print(f"Epoch {epoch:3d} Loss={total_loss:.4f}")
# Loss falls 0.48 → 0.002 over 100 epochs.
test = np.array([0.50, 0.55])
pred = np.clip(np.dot(test, weights) + bias, 0, 1)
print("Glasses ✅" if pred > 0.5 else "Ring ❌") # → 0.98 → Glasses ✅The network learned to distinguish glasses from rings from 4 examples and 100 epochs — without a single rule written explicitly.
Real-World Scale
- Data: 300 Billion tokens (most of the internet).
- Time: Months on thousands of GPUs in parallel.
- Cost: ~$100 Million estimated compute.
- Single GPU: Would take 355 years to train on one RTX 4090.
Our example: 4 examples, 100 epochs, 0.001 seconds. The math is identical — only the scale is incomprehensible.
How This Loop Created the 384-Dimensional Embeddings
In Part 2 we used a model that turned any sentence into a 384-dimensional vector. Now you know exactly how it was built: Data — billions of multilingual pairs labeled similar/different; Loss — contrastive loss that pulls similar sentences together and pushes different ones apart; Loop — this same 4-step loop run for millions of iterations until the 384 output neurons encoded meaning as geometry. The training loop is how embeddings are made.
The Core Insight
Guess → Measure → Blame → Fix → Repeat. The intelligence isn't in any single step — it's in the repetition. Every AI capability you've used (image recognition, translation, text generation, code completion) is this loop running billions of times on massive data.
Pro Tips for Builders
- Start with
lr=0.01— the safest default; tune from there with a learning-rate scheduler. - Watch both losses — track training and validation loss. Training down + validation up = overfitting.
- Batch size affects generalization — smaller batches (16–32) add helpful noise; larger batches train faster but can overfit.
- Use Adam, not plain SGD — it adapts the learning rate per weight automatically and converges faster in practice.
- The 4-step loop is universal — fine-tuning GPT or training a 2-neuron toy, the loop is identical. Only the scale changes.
Try It Yourself
Experiment with the learning rate in the code above: set lr = 0.9 and watch the loss bounce (overshoots, never converges); set lr = 0.001 and watch it crawl; lr = 0.1 gives smooth convergence. Then add a 5th example that slightly contradicts the pattern and watch the loss floor rise — that's overfitting in miniature.
Key Takeaways
The intelligence isn't in any single step. It's in the billions of repetitions of the Guess → Measure → Fix loop.
You cannot "feel" your way to a better model. You must use a loss function and follow the gradient (mathematical direction) toward the answer.
The point isn't to get 100% on the training data. The point is to learn the underlying pattern so the model works on data it's never seen.
Up Next in the Series
Four types of learning, three secret training stages, and the human-feedback process that turned a chaotic language model into a polite, helpful assistant. Continue the series →