Skip to main content
AI-Developer → AI Fundamentals#9 of 14

Part 9 — Context Window: The Invisible Limit That Shapes Every AI Conversation

Every LLM has a context window — a hard limit on how much it can 'see' at once. Understanding what fills it up, why models forget things, and the 'Lost in the Middle' effect will change how you architect every AI feature you build.

March 12, 2026
10 min read
#AI#Context Window#LLM#RAG#Lost in the Middle#Prompt Engineering#Memory#API Costs

THE NEURAL DEAD ZONE

Lost in the Middle

START ✅ DEAD ZONE ⚠️ END ✅

The AI ignores the middle of your message — and this has massive implications for how you structure prompts.

Last article we saw how the model generates one token at a time inside the Transformer. Today we explore the exact box that limits how much it can see while doing so.

Have you ever noticed that AI seems to handle the beginning and end of a long conversation well, but occasionally "forgets" critical information you mentioned in the middle?

That's not random. It's physics — specifically, the physics of how Transformer attention distributions work. And once you understand it, you'll never structure a long AI prompt the same way again.


What Is a Context Window?

The context window is the total amount of text an LLM can "see" at any one moment. It's not a memory — it's more like a field of vision.

Critical distinction: The LLM has no persistent memory between API calls. Every time you send a message, the API receives:

📋 Step 1: System Prompt

Your instructions, persona, and rules for the assistant

💬 Step 2: Full Conversation History

Every message sent and received since the conversation started

🔍 Step 3: RAG Data (if using retrieval)

Documents, product info, or knowledge base chunks retrieved for this query

✍️ Step 4: Your New Question

The actual query this turn

ALL OF THIS = ONE CONTEXT BOX 🧠
Key principle: Anything NOT inside the context window does not exist as far as the model is concerned. It cannot reference, remember, or reason about information that isn't in the window right now.

Think of it like a doctor who reads your entire medical file from scratch at every appointment. They're not remembering — they're rereading. And if the file is too thick for the file box, some pages get left out.


Context Window Sizes: How Models Compare in 2026

GPT-4o
128K tokens
~300 pages of text
Claude 3.5+
200K tokens ✅
~500 pages
Gemini 1.5 Pro
1M tokens 🏆
~2,500 pages
Llama 3 (8B)
8K
~20 pages — limited but free

Larger context = longer conversations + bigger documents — but also higher cost per call

An important nuance: Larger context windows don't automatically mean better quality. The "Lost in the Middle" problem means that even models with 1M token windows may miss information buried in the middle. Bigger window ≠ perfect recall.


Problem 1: Your Data Is Bigger Than the Window

The most common context window problem in production: your data simply doesn't fit.

❌ Naive Approach

10,000 products × 100 words = 1,000,000 words
→ GPT-4o max: 128,000 tokens
→ Won't even fit! 🚫

Even Gemini 1.5's 1M window can't fit a large product catalog, legal database, or full codebase.

✅ The Solution: RAG

User query → similarity search
→ retrieve top 5 relevant items
→ send only those 5 to GPT-4o ✅

RAG (Retrieval Augmented Generation) solves this elegantly — we'll cover it in depth in a dedicated article.

The fundamental insight: you don't need to fit everything in the context window — you need to fit the right things. The vector database handles the "finding the right things" part. The context window receives only the most relevant subset.


Problem 2: Growing Conversation Cost

Every time a user sends a new message, your API call includes the entire conversation history up to that point.

Context size grows with every message:

Message 1
100 tok
Message 10
1K tok
Message 50
5K tok
Message 100
50K tok 💸

⚠️ Every new message re-sends ALL previous messages as input tokens. The cost compounds.

This is why the output-cost asymmetry from the last article compounds so fast — every turn you add to a conversation is re-sent as expensive input and triggers more expensive output. A 100-turn conversation can cost 50× more than a 10-turn one.

Three strategies to manage conversation growth:

📝
Summarization

After N messages, run a summarization step. Replace the full history with "Summary: The user asked about X and we discussed Y." Reduces context dramatically while preserving key context.

🪟
Sliding Window

Only include the last N messages in the context. Trades perfect recall for bounded cost. Works well for most conversational applications where recent context matters most.

🎯
Intelligent Retrieval

Store conversation history as embeddings. For each new message, retrieve only the most relevant past exchanges. This is the most sophisticated approach — conversation memory as a vector database.

How This Window Relates to Everything We've Learned So Far

The context window isn't an isolated concept — it's the stage where every previous article comes together:

🔲
Transformer self-attention (Article 6) — Self-attention only operates on tokens inside this window. The model can't attend to anything outside it. Bigger window = more tokens the attention mechanism can relate to each other.
KV Cache (Article 7) — The KV Cache we saw last article can only store attention keys and values for tokens inside this window. When tokens fall outside the window, their cached computations are discarded.
🔄
Autoregressive loop (Article 7) — Every token the model generates gets appended to the context window and becomes input for the next token. The window fills up as the model writes.
🔍
Embeddings + similarity search (Articles 2 & 3) — Remember how similarity search (Article 1) finds the right chunks? RAG uses that exact mechanism to decide which documents are worth placing in the context window — so the window holds signal, not noise.

Problem 3: Lost in the Middle — The Most Dangerous Blind Spot

This is the most counterintuitive and impactful context window phenomenon. A 2023 research paper ("Lost in the Middle: How Language Models Use Long Contexts") demonstrated something alarming:

LLMs pay significantly more attention to information at the beginning and end of the context window. Information in the middle gets systematically underweighted. (This is a direct consequence of how self-attention scores are distributed — the model's attention mechanism naturally anchors to the first tokens it sees and the most recent ones, leaving the middle statistically disadvantaged.)

Attention Distribution Heatmap

⚡ START (high attention) THE DEAD ZONE (low attention) ⚡ END (high attention)

What this means in practice:

The researchers showed that if you have 20 relevant documents in the context window, models answer correctly ~80% of the time when the answer is at the beginning or end — but accuracy drops to ~50% when the answer is in the middle of the document list.

For a 10,000-token RAG response with 10 document chunks, the 5th and 6th chunks may effectively be invisible to the model.


Practical Strategies: Positioning Information to Survive the Dead Zone

Given the Lost in the Middle effect, prompt structure matters as much as prompt content:

Strategy Implementation Impact
Most important info first Put critical constraints and requirements at the top of your system prompt, not buried in the middle High — positions key info in peak attention zone
Repeat key instructions at the end "Remember: always respond in JSON format" appears both at start and end of system prompt High — leverages both peak attention zones
Sort RAG chunks by relevance Put the most relevant retrieved documents first and last, not in the middle of the chunk list Medium — reduces dead zone impact on retrieval
Fewer, better chunks Retrieve 3-5 highly relevant chunks instead of 15 mediocre ones. The model reads less and retains more. High — reduces middle-zone exposure
Structured markers Use clear XML tags or markdown headers to demarcate sections: ``, ``, `` Medium — helps the model navigate the window

Input vs Output: The Pricing Asymmetry

The context window directly determines your API costs, but the split between input and output matters enormously:

Input Tokens

System prompt + conversation history + RAG data + your question

$2.50 / 1M tokens

Relatively cheap: all processed in one parallelized forward pass with KV Cache

Output Tokens

The response the model generates — one token at a time

$10.00 / 1M tokens

4x more expensive: each token requires a sequential forward pass — no parallelization possible

Input is cheap because:

KV Cache stores the attention computations. All input tokens are processed in parallel across GPU cores. One forward pass for thousands of input tokens.

Output is expensive because:

Each output token requires a full sequential forward pass. There's no parallelization — token N can't be generated until token N-1 exists. CPU time scales linearly with output length.

Developer tip: Adding "Be concise. Respond in 1-2 sentences unless asked for detail." to your system prompt can cut output costs by 60-80% with minimal quality loss for most use cases.


Context Budget Breakdown: What's Actually Taking Space?

For a typical RAG-powered customer support chatbot with a 10-turn conversation history:

System Prompt
~2K tokens
Persona, rules, formatting instructions
Conversation History
~5K tokens
10 turns × ~250 tokens each side
RAG Retrieved Docs
~3K tokens
3–5 retrieved document chunks
User Query
~100
The current question (surprisingly small)
Total: ~10K input tokens per call × 1,000 daily calls × $2.50/M = $25/day just in input

The Core Insight

The numbers are sobering — but the real power comes from treating the window as a stage rather than a dump.

The context window is not a dump — it's a stage

The model doesn't read your context equally. It has attention peaks at the beginning and end, and a structural dead zone in the middle. Treat your context like a theater: put the most important actors front-stage (start) and for a final bow (end). Everything in the middle risks being forgotten.


Pro Tips for Builders

⚡ PRO TIPS FOR BUILDERS

1.

Put your most critical instruction both first and last in your system prompt. The model's attention peaks at both ends — use both peaks for the rule that must never be broken (e.g., "Always respond in JSON").

2.

Retrieve 3–5 high-quality RAG chunks, not 15 mediocre ones. More chunks = more dead zone exposure. Quality beats quantity every time inside the context window.

3.

Add a summarization step at message 20. Replace the full history with a 200-token summary. It keeps the model oriented without ballooning your input costs.

4.

Use structured XML tags to demarcate sections. <system_rules>, <retrieved_docs>, <user_question> help the model navigate a dense context window and attend to the right sections.


Try It Yourself

Experiment 1: Measure your context usage

import tiktoken

def analyze_context(system: str, messages: list, rag_chunks: list, query: str):
    enc = tiktoken.encoding_for_model("gpt-4o")

    system_tokens = len(enc.encode(system))
    history_tokens = sum(len(enc.encode(m["content"])) for m in messages)
    rag_tokens = sum(len(enc.encode(c)) for c in rag_chunks)
    query_tokens = len(enc.encode(query))
    total = system_tokens + history_tokens + rag_tokens + query_tokens

    print(f"System Prompt:      {system_tokens:>7,} tokens ({system_tokens/total*100:.1f}%)")
    print(f"Conversation Hist:  {history_tokens:>7,} tokens ({history_tokens/total*100:.1f}%)")
    print(f"RAG Chunks:         {rag_tokens:>7,} tokens ({rag_tokens/total*100:.1f}%)")
    print(f"User Query:         {query_tokens:>7,} tokens ({query_tokens/total*100:.1f}%)")
    print(f"TOTAL:              {total:>7,} tokens")
    print(f"GPT-4o capacity remaining: {128_000 - total:,} tokens")

analyze_context(
    system="You are a helpful product assistant...",
    messages=[{"role": "user", "content": "Tell me about wearables"},
              {"role": "assistant", "content": "Wearables include..."}],
    rag_chunks=["Ray-Ban Meta specs: ...", "Samsung Ring: ...", "AirPods Pro: ..."],
    query="Which has the best battery life?"
)

Experiment 2: Test Lost in the Middle yourself Create a prompt with 15 facts listed. Put the answer to your question at positions 1, 8, and 15. Ask the model which position it remembers most reliably. You'll see the pattern.

Experiment 3: Measure summarization savings

from openai import OpenAI
import tiktoken

client = OpenAI()
enc = tiktoken.encoding_for_model("gpt-4o")


long_history = [
    {"role": "user", "content": "Tell me about wearables"},
    {"role": "assistant", "content": "Wearables include smartwatches, fitness trackers, AR glasses, and smart rings. Key players: Apple, Samsung, Meta, and Oura."},
    # ... imagine 18 more turns here
] * 10  # rough simulation

history_tokens = sum(len(enc.encode(m["content"])) for m in long_history)
print(f"Full history: {history_tokens:,} tokens")

# Summarize instead
summary_prompt = "Summarize this conversation in 3 bullet points:\n" + \
    "\n".join(f"{m['role']}: {m['content']}" for m in long_history[:6])

summary = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": summary_prompt}]
).choices[0].message.content

summary_tokens = len(enc.encode(summary))
print(f"Summary: {summary_tokens:,} tokens")
print(f"Savings: {history_tokens - summary_tokens:,} tokens ({(1 - summary_tokens/history_tokens)*100:.0f}% reduction)")
print(f"\nSummary:\n{summary}")

Run this and compare the token counts. A 20-turn conversation that costs 8,000 input tokens can often be summarized to under 200 — a 97% reduction with minimal context loss.


Now that you understand the limits of what the model can see, the next article explains why it sometimes confidently makes things up anyway.

NEXT IN SERIES

Hallucination: Why AI Lies With Absolute Confidence

If an AI doesn't know something, why does it confidently invent an answer instead of saying "I don't know"? The answer reveals a fundamental property of how token-by-token generation works — and explains why RAG is so critical for production AI applications. In the next article, we'll trace hallucinations to their root cause, show real case studies (including an airline chatbot lawsuit and fabricated legal citations), categorize them by severity, and lay out 5 proven solutions.

Coming next: hallucination-article.md

AI Fundamentals
MH

Mohamed Hamed

20 years building production systems — the last several deep in AI integration, LLMs, and full-stack architecture. I write what I've actually built and broken. If this was useful, the next one goes to LinkedIn first.

Follow on LinkedIn →