Attention Is All You Need.
In June 2017, eight researchers at Google Brain deleted the Recurrent Neural Network (RNN) and replaced it with a parallel mechanism called the Transformer. This is the engine inside ChatGPT, Claude, and every AI that matters today.
Last article we saw how the four learning types and the training loop built ChatGPT. Today we open the box and look at the exact architecture that made all of it possible.
June 2017. Eight researchers at Google Brain sat down and asked a dangerous question: "Why do we even need the RNN?" Then they deleted it. The paper they published — "Attention Is All You Need" — was not patented. It was released freely, and that single decision launched ChatGPT, Claude, Gemini, Llama, and every significant language model that exists today.
Before 2017, AI read word-by-word. This was slow, couldn't be parallelized, and suffered from 'memory decay'—the model would forget the start of a sentence by the time it reached the end.
Before 2017: The World Ran on RNNs
To understand why the Transformer was revolutionary, you need to understand what it replaced. The dominant architecture for language was the Recurrent Neural Network (RNN). The idea was elegant: read text the way humans do — one word at a time, remembering what came before.
The glasses → remember → are → remember → light → ... → but their battery...By the time it reaches "battery," the beginning of the sentence has almost completely faded. The RNN had three fatal problems that held AI back for over a decade.
Problem 1: Memory Decay
The RNN maintained a "hidden state" — a compressed memory updated with each new word. The trouble: each update overwrote part of the previous memory. Watch how much of the word "glasses" survives as the sentence goes on:
By the time it reaches "full day," it has forgotten the sentence started with "glasses." Engineers tried to fix this with LSTMs (1997). They helped, but didn't solve it — long documents remained an unsolvable challenge.
Problem 2: Sequential Processing (Speed)
RNNs are inherently sequential. Word 2 can't be processed until Word 1 is done; Word 3 waits for Word 2.
Sequential vs. Parallel
- Word 1 → finish → Word 2 → finish → Word 3 → … 100 steps in a row.
- Even with 8,000 GPUs you can't parallelize — each step depends on the previous one.
- All 100 words processed simultaneously across thousands of GPUs.
- One step instead of one hundred — which is why it scales to billions of parameters.
Problem 3: Long-Range Dependencies
In "The glasses are red," the word "red" clearly refers to "glasses." But in "The glasses I bought from the store downtown that's been open for 20 years and everyone trusts are red," the RNN has likely forgotten "glasses" by the time it reaches "red" — it might confusingly connect "red" to "years" instead. Forgetting, slowness, and poor long-range connections were the ceiling of AI language for years.
The 2015 Band-Aid: The Original Attention Mechanism
Before the Transformer, researchers found a partial fix: Attention. The insight was brilliant in its simplicity — instead of relying on the hidden state to carry everything forward, what if at each step the model could look back at any previous word and focus on the most relevant ones?
"When you read a confusing word in a long sentence, you glance back over everything you've read and shine a mental flashlight on the words that matter most."
Attention lets the model, while processing 'battery,' look back across the whole sentence and link it directly to 'glasses' — even 100 words away.
This helped significantly. But it was still bolted onto the RNN — it didn't fix the fundamental speed problem, and it added cost on top of an already slow architecture.
2017: The Paper That Changed Everything
Eight researchers at Google Brain — Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin — looked at all these problems and asked the audacious question.
"Why do we need the RNN at all? Let's remove it entirely — and use Attention alone!" 🚀
If Attention already lets you look at any word in the sentence, why process words sequentially? Instead, look at all words simultaneously and let them attend to each other in parallel. They called it the Transformer.
Self-Attention: The Core Innovation
The key mechanism inside the Transformer is Self-Attention. Each word simultaneously asks three questions about every other word:
The Q, K, V Roles
"What am I looking for?" — the word's search intent.
"What do I offer?" — the word's content/identity.
"What is my meaning?" — the actual information passed forward.
The attention score for each word pair is computed with one compact formula:
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) × VIn plain English: Q·Kᵀ measures how much a query matches a key (compatibility), √dₖ keeps the dot products from getting too large, softmax converts the scores into percentages, and V is the weighted blend of information passed forward. Each word votes on how much attention to pay every other word, and the information from relevant words flows through.
The Dinner Party: Self-Attention as a Conversation
- Scene: The word
banksits at a table withriverandoverflowed. - Process:
bankasksriverfor context.riversays: "I'm 26% related."bankis "48% me," andoverflowedis "26% connected." - Blend:
banktakes a weighted sip of each meaning — a big gulp of itself, smaller sips of river and flood. - Result: It's now "the kind of bank that hangs out with rivers and floods." A riverbank — the financial meaning never entered the picture.
That's self-attention: one ambiguous word, a room full of context, and a weighted blend that resolves the meaning — no formulas needed.
▶👀 Peek under the hood — the actual arithmetic
Those percentages (26% / 48% / 26%) aren't magic — they come from four lines of arithmetic. Each word carries three tiny vectors (Q, K, V):
1. Give each word a Q, K, V:
river = ([1,0], [1,0], [0.9, 0.1])
bank = ([1,1], [1,1], [0.5, 0.5])
overflowed = ([0,1], [0,1], [0.1, 0.9])
2. Match bank's Q against every K (compatibility):
scores = [1.0, 2.0, 1.0], then ÷√2 → [0.71, 1.41, 0.71]
3. Squish into percentages (softmax):
→ [0.26, 0.48, 0.26] ← the 26/48/26 split
4. Blend the V vectors by those percentages:
new_bank = [0.50, 0.50] — now carries river + flood contextThe √2 just keeps numbers from exploding when vectors get big. The Q/K/V numbers are made up for teaching; in a real model they're learned during training.
For a fuller sentence — "The bank by the river overflowed" — every word scores every other word at once:
| attends → | The | bank | river | overflowed |
|---|---|---|---|---|
| bank | 0.05 | 0.45 (self) | 0.92 ⭐ | 0.38 |
| river | 0.03 | 0.88 ⭐ | 0.52 (self) | 0.71 |
| overflowed | 0.02 | 0.79 ⭐ | 0.85 ⭐ | 0.60 (self) |
Higher score = stronger attention. "bank" scoring 0.92 on "river" is how the model learns this is a riverbank, not a financial institution. (Scores are illustrative; real weights are learned and sum to 1.0 per row after softmax.)
Multi-Head Attention: A Panel of Experts
The dinner-party conversation was only one type of conversation. Real sentences have many relationships at once — grammar, contrast, mood, big-picture meaning — and one conversation can't catch them all. So the Transformer hires a panel of experts, each reading the same sentence through a different lens. Take "The smart glasses are light but their battery is very weak":
The Expert Panel Reading One Sentence
"'weak' describes 'battery' — a clean adjective-noun pairing."
"The word 'but' is the whole point — 'light' is being pitted against 'weak.' A trade-off."
"Something positive ('light') undercut by something negative ('weak'). The mood is disappointment."
"Zooming out — this whole sentence is a complaint about a gadget. File under 'product review.'"
Each expert writes its own attention matrix, then the model staples all the reports together into one rich representation. It's the trick a good hospital uses — a cardiologist, neurologist, and radiologist examining the same patient, pooling notes for a diagnosis sharper than any one could produce alone. And the scale is wild: GPT-3 runs 96 of these experts in parallel inside every single layer, and each one learned its niche during training — nobody assigned them.
Positional Encoding: Remembering Order
Reading everything in parallel creates a subtle problem: word order gets lost. Without position info, "The dog bit the man" and "The man bit the dog" look like the same bag of tokens to the attention mechanism.
The Order Problem
Self-attention sees all words at once — so "dog bites man" and "man bites dog" are identical sets of tokens.
Add a unique position vector to each word's embedding: "dog" at position 1 gets a different fingerprint than "dog" at position 5.
"dog" + position 1 encoding → knows it's the subject
"bit" + position 2 encoding → knows it's the verb
"man" + position 3 encoding → knows it's the objectNow "The dog(1) bit(2) the man(3)" is mathematically distinct from "The man(1) bit(2) the dog(3)" — order preserved, without losing parallelism.
Inside a Transformer Block
A complete Transformer isn't just attention — it's a stack of blocks, each containing several components, repeated N times:
The Layered Stack
Injects word order so the model knows the difference between "Dog bites man" and "Man bites dog".
All words attend to each other in parallel across the expert panel.
Residual connections add the original input back — a safety net against information loss.
Each word is processed through dense neural networks for abstraction.
Stack 12 to 96+ blocks to build deeper reasoning and logic, then a final layer predicts the next token.
The Residual Connection is worth calling out: at each layer, the original input is added back to the attention output, so even if a head learns something unhelpful, the original information isn't destroyed. It's the same "add the original input back" trick that let us train deep networks in Part 4 without losing early information.
Every Concept You've Learned Lives Inside This Diagram
- ✓The neuron (Part 3) is inside the Feed-Forward Network — every position runs through dense layers of neurons after attention.
- ✓The training loop (Part 4: Forward → Loss → Backprop → Update) runs across all 96+ heads simultaneously during pre-training.
- ✓The embeddings (Part 2) are what the final output layer produces — the Transformer is the machine that creates them.
- ✓The 4 learning types (Part 6) — Self-Supervised pre-training, SFT, and RLHF — all use this exact stack as their underlying model.
The Timeline: From Research to Revolution
Language AI reads word-by-word. Long texts break. Slow. Can't parallelize.
Bolted onto the RNN. Better long-range connections, but still sequential. A partial fix.
Google Brain removes the RNN entirely. Parallel processing scales to billions of parameters. Released openly — no patent.
OpenAI and Google apply the Transformer at scale. First demonstrations of emergent language understanding.
The first model to show that scaling Transformers produces qualitatively new capabilities: reasoning, writing, code.
Transformer-based models enter everyday use on billions of devices. Every capability in this series became possible only because the Transformer removed the RNN bottleneck.
RNN vs. Transformer — The Final Scoreboard
| Before 2017 (RNN + Attention) | After 2017 (Transformer) |
|---|---|
| Reads word by word ❌ | Reads the whole sentence at once ✅ |
| Forgets distant words ❌ | Every word attends to every other word ✅ |
| Hard to parallelize on GPUs ❌ | Runs on thousands of GPUs simultaneously ✅ |
| Long texts cause failures ❌ | Scales to 1M+ token context windows ✅ |
| Max context ~500 tokens ❌ | Today: 1M+ tokens (Gemini 1.5 Pro) ✅ |
The Four Key Components — Summary
What Makes a Transformer
Sees multiple types of relationships at once — like a team of specialists each analyzing the same sentence from a different angle.
Guarantees original information is never lost as it passes through dozens of layers. The safety net of deep learning.
Injects word order so the model can distinguish "dog bites man" from "man bites dog" despite reading in parallel.
Early layers capture syntax; later layers capture abstract meaning and reasoning. Depth is what built ChatGPT and Claude.
The Transformer's fundamental advantage isn't just accuracy — it's scalability. Because it's fully parallelizable, you can throw more GPUs at it and it gets proportionally faster, enabling training on hundreds of billions of words in days rather than years. As models scaled, entirely new capabilities emerged — reasoning, code, creative writing — that nobody programmed explicitly. The decision not to patent it was arguably the most consequential act of open science in AI history.
Pro Tips for Builders
- 1. Encoder vs. decoder matters. BERT-style (encoder-only) models are best for understanding — classification, embeddings, similarity search. GPT-style (decoder-only) models are best for generation. Knowing the architecture helps you pick the right tool.
- 2. Context window = Transformer memory. Context limits exist because self-attention cost scales quadratically with sequence length. 1M-token models require tricks like sparse attention and sliding windows.
- 3. More layers = more abstraction. Early layers capture syntax, middle layers facts, late layers reasoning. This is why larger models are qualitatively — not just quantitatively — better.
- 4. Attention heads are interpretable. Tools like BertViz show which words each head attends to — one of the few places in deep learning where you can actually see what the model "thinks."
Try It Yourself
- Visualize attention. BertViz lets you watch how different attention heads in BERT focus on different words — syntax heads behave differently from semantic heads.
- Feel the difference. Load
bert-base-uncased(encoder) andgpt2(decoder). BERT sees the whole sentence at once; GPT-2 generates one token at a time. Same architecture, different configurations.
from transformers import pipeline
# BERT (encoder) — sees the full sentence at once and fills the blank
fill_mask = pipeline("fill-mask", model="bert-base-uncased")
print(fill_mask("The bank by the [MASK] overflowed.")[0])
# {'token_str': 'river', 'score': 0.89, ...}
# BERT picks "river" because it reads "overflowed" simultaneously with "bank".
# GPT-2 (decoder) — generates tokens left-to-right
generator = pipeline("text-generation", model="gpt2")
print(generator("The bank by the river", max_new_tokens=5)[0]["generated_text"])
# "The bank by the river was flooded..."- Count attention heads.
from transformers import GPT2Config
config = GPT2Config()
print(f"GPT-2 Small: {config.n_head} heads × {config.n_layer} layers "
f"= {config.n_head * config.n_layer} attention ops")
# GPT-2 Small: 12 heads × 12 layers = 144 attention opsKey Takeaways
The Transformer won not just because it's smarter, but because it's parallel. It can consume the entire internet's text in days.
Context limits exist because attention cost grows quadratically with length. 1M token windows are the peak of engineering.
Google's decision not to patent the Transformer is what allowed OpenAI, Meta, and Anthropic to build the modern AI ecosystem.
Up Next in the Series
Everything we've covered — from one neuron to embeddings to the full Transformer — comes together when the model actually writes its answer, one token at a time. Next, we'll watch generation happen step by step. Continue the series →