The Context Window: AI's Field of Vision
Every LLM has a finite 'stage' where tokens perform. If it's not on the stage, the AI doesn't know it exists. Understanding this limit is the difference between a broken app and a production-grade system.
Last article we saw how the model generates one token at a time inside the Transformer. Today we explore the box that limits how much it can see while doing so. Have you ever noticed that AI handles the beginning and end of a long conversation well, but occasionally "forgets" something critical from the middle? That's not random — it's the physics of how Transformer attention distributions work.
Models pay significantly more attention to the beginning and end of your message. Information in the middle gets systematically ignored. This is the 'Lost in the Middle' effect.
What's Inside the Box?
The context window is the total amount of text an LLM can "see" at one moment. It's not memory — it's a field of vision. Crucially, the LLM has no persistent memory between API calls. Every message re-sends everything:
- 1. System Prompt: The rules and persona (The Script).
- 2. Conversation History: Every previous message (The Backstory).
- 3. RAG Data: Retrieved knowledge chunks (The Research).
- 4. New Query: Your current question (The Cue).
- Result: ALL of this must fit within the model's token limit.
Anything not inside the window does not exist as far as the model is concerned — it cannot reference, remember, or reason about it. Think of a doctor who rereads your entire medical file from scratch at every appointment. They're not remembering — they're rereading. And if the file is too thick for the box, some pages get left out.
Model Capacity Comparison (2026)
The Window Sizes
- Size: 128K tokens.
- Capacity: ~300 pages of text.
- Strength: Balanced performance and cost.
- Size: 200K tokens.
- Capacity: ~500 pages.
- Strength: High reasoning and recall accuracy.
- Size: 1M+ tokens.
- Capacity: ~2,500 pages.
- Strength: Massive document analysis.
For contrast, Llama 3 (8B) ships an 8K window — ~20 pages, limited but free to self-host. An important nuance: larger windows don't automatically mean better quality. The "Lost in the Middle" problem means even a 1M-token model can miss information buried in the center. Bigger window ≠ perfect recall.
Problem 1: Your Data Is Bigger Than the Window
The most common production problem: your data simply doesn't fit. 10,000 products × 100 words = 1,000,000 words — far beyond GPT-4o's 128K tokens, and even Gemini's 1M can't hold a large catalog, legal database, or full codebase.
Fitting the Data
Try to stuff all 10,000 products into the prompt → won't even fit. 🚫
User query → similarity search → retrieve top 5 relevant items → send only those to the model.
The fundamental insight: you don't need to fit everything in the window — you need to fit the right things. A vector database handles "finding the right things"; the window receives only the most relevant subset.
Problem 2: Growing Conversation Cost
Every new message includes the entire conversation history up to that point — so cost compounds as the chat grows:
The output-cost asymmetry from the last article compounds fast here — every turn is re-sent as expensive input and triggers more expensive output. A 100-turn conversation can cost 50× more than a 10-turn one. Three strategies manage the growth:
Conversation Management Strategies
After N messages, replace the full history with a 200-token summary: "The user asked about X and we discussed Y."
Only include the last N messages. Trades perfect recall for bounded cost — fine when recent context matters most.
Store history as embeddings and retrieve only the most relevant past turns — conversation memory as a vector database.
- ✓Transformer self-attention (Part 7) only operates on tokens inside this window — it can not attend to anything outside it.
- ✓The KV Cache (Part 8) can only store keys/values for tokens inside the window; tokens that fall out have their cache discarded.
- ✓The autoregressive loop (Part 8) appends every generated token to the window — it fills up as the model writes.
- ✓Embeddings + similarity search (Parts 2 & 3) are how RAG decides which chunks earn a place in the window — signal, not noise.
Problem 3: Lost in the Middle — The Most Dangerous Blind Spot
A 2023 paper ("Lost in the Middle: How Language Models Use Long Contexts") demonstrated something alarming: LLMs pay significantly more attention to the beginning and end of the context, while the middle gets systematically underweighted. It's a direct consequence of how self-attention scores distribute — the model anchors to the first tokens and the most recent ones.
In practice: with 20 relevant documents in context, models answer correctly ~80% of the time when the answer is at the beginning or end — but accuracy drops to ~50% when it's in the middle. For a 10-chunk RAG response, the 5th and 6th chunks may be effectively invisible.
Surviving the Dead Zone
Given the effect, prompt structure matters as much as content:
Prompt Architecture
- Dump: 15 RAG chunks in random order.
- Structure: Rules buried in the middle.
- Result: ~50% accuracy on middle-zone facts.
- Filter: 3–5 high-quality chunks instead of 15.
- Anchor: Rules repeated at the start and end.
- Result: 90%+ recall across the entire window.
| Strategy | Implementation | Impact |
|---|---|---|
| Most important info first | Put critical constraints at the top, not buried | High — peak attention zone |
| Repeat key instructions at the end | "Always respond in JSON" at both start and end | High — leverages both peaks |
| Sort RAG chunks by relevance | Most relevant docs first and last | Medium |
| Fewer, better chunks | 3–5 highly relevant chunks, not 15 mediocre | High — less middle exposure |
| Structured markers | <important_context>, <retrieved_docs>, <user_question> | Medium — helps navigation |
Input vs. Output: The Pricing Asymmetry
The window directly determines API cost, but the input/output split matters enormously:
Where the Money Goes
System prompt + history + RAG + your question. Cheap: processed in one parallelized forward pass with the KV Cache.
The generated response, one token at a time. 4× more expensive: each token needs a sequential forward pass — no parallelization.
Developer tip: adding "Be concise. Respond in 1–2 sentences unless asked for detail." to your system prompt can cut output costs 60–80% with minimal quality loss for most use cases.
Context Budget Breakdown
For a typical RAG support chatbot with 10 turns of history:
That's ~10K input tokens per call × 1,000 daily calls × $2.50/M = $25/day just in input — before a single output token.
The Core Insight
The model doesn't read your context equally. It has attention peaks at the beginning and end, and a structural dead zone in the middle. Treat your context like a theater: put the most important actors front-stage (start) and for a final bow (end). Everything in the middle risks being forgotten — so curate ruthlessly.
Pro Tips for Builders
- 1. Put your most critical instruction both first and last. The model's attention peaks at both ends — use both for the rule that must never break (e.g. "Always respond in JSON").
- 2. Retrieve 3–5 high-quality RAG chunks, not 15 mediocre ones. More chunks = more dead-zone exposure. Quality beats quantity inside the window.
- 3. Add a summarization step around message 20. Replace the full history with a 200-token summary to stay oriented without ballooning input cost.
- 4. Use structured XML tags (
<system_rules>,<retrieved_docs>,<user_question>) to help the model navigate a dense window and attend to the right sections.
Try It Yourself
Measure exactly what's filling your window:
import tiktoken
def analyze_context(system, messages, rag_chunks, query):
enc = tiktoken.encoding_for_model("gpt-4o")
parts = {
"System Prompt": len(enc.encode(system)),
"Conversation Hist": sum(len(enc.encode(m["content"])) for m in messages),
"RAG Chunks": sum(len(enc.encode(c)) for c in rag_chunks),
"User Query": len(enc.encode(query)),
}
total = sum(parts.values())
for name, n in parts.items():
print(f"{name:<20}{n:>7,} tokens ({n/total*100:.1f}%)")
print(f"{'TOTAL':<20}{total:>7,} tokens — {128_000-total:,} left in GPT-4o")Then test "Lost in the Middle" yourself: list 15 facts, put the answer at positions 1, 8, and 15, and ask which the model recalls most reliably. You'll see the pattern — and a 20-turn history that costs 8,000 input tokens can often be summarized to under 200 (a ~97% reduction).
Key Takeaways
The context window is a stage, not a dump. If you overfill it, the model's attention mechanism breaks down, focusing only on the first and last lines.
Every turn in a chat increases the input tokens for the next turn. Long conversations are expensive because you pay for the same words over and over.
You don't need a 1M token window to search 30M products. You need RAG to find the 5 best matches and place them on the context stage.
Up Next in the Series
If an AI doesn't know something, why does it confidently invent an answer instead of saying "I don't know"? The answer reveals a fundamental property of token-by-token generation — and explains why RAG is so critical. We'll trace hallucinations to their root cause, cover real case studies (an airline chatbot lawsuit, fabricated legal citations), and lay out 5 proven solutions. Continue the series →