Skip to main content
AI-Developer/AI Fundamentals
Part 8 of 14

Part 8 — Token-by-Token: Why AI Generates Text One Word at a Time (And Why It Costs 4x More)

ChatGPT doesn't think its answer in advance — it predicts one token at a time in a recursive loop. Understanding this changes how you design prompts, control costs, and build responsive AI applications.

March 12, 2026
10 min read
#AI#LLM#Token Generation#KV Cache#Autoregressive#TTFT#Streaming#API Costs

Token-by-Token: The AI Recursive Loop

ChatGPT doesn't think its answer and then display it. It predicts one token, then another, then another—using each prediction as context for the next. This is autoregressive generation.

Primary Objective
Autoregressive | KV Cache | TTFT vs Throughput | Cost Asymmetry

Last article we saw the Transformer architecture. Today we watch it in action during live generation — and discover why the output side is 4× more expensive. Here's something that surprises most developers: ChatGPT doesn't think its answer in advance and then display it. It predicts one token, then another, each prediction using the previous ones as context. It's not writing — it's recursively predicting. The Transformer reads input in parallel, but generation flips that on its head: it's forced to be sequential, because each new token depends on the last.

🚫
The Output Tax

Output tokens cost 4× more than input tokens. Reading 1M tokens costs ~$2.50, but writing 1M tokens costs ~$10.00 (GPT-4o) — because generation is forced to be sequential and slow.


The One Thing an LLM Does

Strip away the magic, and a Large Language Model does exactly one thing: given all the tokens it has seen so far, predict the single most likely next token.

The Generation Engine
  • 1. Context: The model reads your prompt and all tokens generated so far.
  • 2. Prediction: It runs a forward pass to find the next most likely token.
  • 3. Append: The new token is fed back into the input for the next cycle.
  • Result: Intelligence emerges from this recursive loop.

Think of it like predictive text on your phone — except instead of suggesting 3 words, it chooses from 100,000+ possible tokens, thousands of times, to build a complete response.


Step-by-Step Generation

Let's trace a real example. Prompt: "What's the best smart glasses?"

Building: 'Ray-Ban Meta Ultra'
  • Step 1: [START] + Question → "Ray" (35% likely)
  • Step 2: … + "Ray" → "-" (85% likely)
  • Step 3: … + "-" → "Ban" (95% likely)
  • Step 4: … + "Ban" → "Meta""Ultra" → … → [END]

The final response — "Ray-Ban Meta Ultra — lightweight, 48MP camera, translates 40 languages, full-day battery" — was assembled from sequential predictions. It was never computed all at once.


Autoregressive Generation: The Mathematical Reality

The formal name is autoregressive generation — each output token becomes part of the input for the next prediction. (It's the same next-token prediction the training loop taught the model, now happening live at inference.) This creates a critical asymmetry:

Response LengthGeneration StepsImplication
10 tokens (~8 words)10 sequential predictionsFast, cheap
100 tokens (~75 words)100 sequential predictionsModerate
1,000 tokens (~750 words)1,000 sequential predictionsSlow, expensive
4,000 tokens (a blog post)4,000 sequential predictionsVery slow, very expensive

This is why output tokens cost 4× more than input. Reading your 10,000-token prompt can be largely parallelized. But generating each output token requires a sequential forward pass through the full model — there's no way to batch it without changing the output.


The Probability Distribution: Every Token Is a Vote

At each step, the model doesn't just know the one "right" answer. It produces a probability distribution over its entire vocabulary — every possible next token with a likelihood:

Probability Distribution After 'What's the best smart...'
35
Ray
20
Apple
15
Meta
8
currently
22
...others (100K tokens)

This is the same softmax activation we saw inside the neuron and the Transformer block — here it converts raw logits over the full vocabulary into a distribution over what to say next. The model selects one token, appends it, and runs the whole process again until it emits an [END] token or hits the max length.

💡
Note

The model doesn't always pick the highest-probability token — that's controlled by Temperature, a topic for another article. Higher temperature = more random/creative; lower = more deterministic.


KV Cache: The Hidden Optimization That Makes This Workable

Here's the obvious problem: if each new token required the model to re-read the entire context, compute time would grow quadratically. A 1,000-token response from a 10,000-token prompt would be impossibly slow. The solution is the KV Cache (Key-Value Cache):

The KV Cache

Without KV Cache ❌

Every new token reprocesses the entire context from scratch — Token 1 reads all 10K input, Token 2 reads all 10K again… 10,000× overhead per token. Slow and expensive.

With KV Cache ⚡

The attention Keys and Values for already-processed tokens are stored in GPU memory and reused — each step only computes the K/V for the one new token. Fast and smart.

During the Transformer's attention computation, every token produces a Key and Value vector that don't change once computed. The cache stores them, so each generation step only computes K/V for the new token. This reuse is only possible because of the self-attention mechanism from the last article — without Q/K/V there'd be nothing to cache. It's also another reason reading input is cheaper than generating output: input is one parallel forward pass, output is one-at-a-time even with the cache.


TTFT and Throughput: The Two Metrics That Matter

When building AI apps, "speed" is two different numbers:

TTFT vs Throughput

TTFT — Time to First Token

How long before the user sees the first word. Dominated by: input processing (bigger prompts = longer TTFT). Why it matters: users perceive TTFT as "responsiveness" — 3s feels laggy even if generation is fast.

📊Throughput — Tokens/Second

How fast tokens come after the first. Dominated by: model size, hardware, batching. Why it matters: determines total completion time. GPT-4o ~100–150 tok/s; Gemini Flash ~300+ tok/s.

⚠️
Developer Tip: TTFT is Your UX Problem

If your system prompt is 5,000 tokens and users send 2,000-token prompts, TTFT could be 2–4 seconds before the response even begins. Consider prompt caching, smaller system prompts, or a loading indicator that accounts for TTFT specifically.


The Real Cost Impact: A Developer's Calculator

Since output is 4× more expensive, how you instruct the model affects your bill more than how much data you send. Scenario: 1,000 API calls/day on GPT-4o.

Same Task, 5× Cost Difference

"Write a detailed response" — 500 output tokens

500 tok × 1,000 calls = 500K tokens/day → 500K ÷ 1M × $10 = $5.00/day → $150/month.

"Be concise, 1–2 sentences" — 100 output tokens

100 tok × 1,000 calls = 100K tokens/day → 100K ÷ 1M × $10 = $1.00/day → $30/month.

Same task. Same quality. 5× cost difference — just by controlling output length in your prompt.


Streaming: The UX Secret

Since the model generates token-by-token anyway, streaming is free — show each token as it's produced instead of waiting for the whole response.

The Streaming Difference

Without Streaming

User stares at a blank screen for 5 seconds, then the entire 400-word response appears at once. Perceived as "slow AI."

With Streaming

User sees the first word in 0.5s and watches the response build. Perceived as "fast, responsive AI" — even though total time is identical.

python
123456789101112
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What are the best smart glasses in 2026?"}],
    stream=True,      # ← all you need; free, transforms UX
    max_tokens=150,   # ← control output length = control cost
)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

This is literally how ChatGPT's web interface works — the streaming appearance is the model's natural behavior surfaced directly.


The One-Way Street: Generation Is Irreversible

The most important practical implication: the model cannot go back and correct a previous token. Once "Ray" is generated, every subsequent token is conditioned on it. The model can't reconsider.

The Cost of Early Mistakes

🔒
COMMITMENT

Once "Ray" is picked, every future token must be consistent with it.

🔄
PROPAGATION

If the first token is a hallucination, the model builds a "logical" lie on top of it — by the time it's "wrong," it's committed 50 tokens to a false premise.

🎯
STABILIZATION

Format instructions (JSON, tables) work by forcing the first token to be correct ({, #), which primes all the rest.

Three takeaways: prompt quality matters more than you think (an ambiguous prompt can commit the model to the wrong interpretation early); this is why hallucination is hard to fix (the model doesn't "realize" a mistake — it just stays consistent with it); and output-format instructions help because they shape the first-token probability distribution.


Developer Quick Reference

ConceptWhat It MeansAction For You
AutoregressiveEach token depends on all previous tokensLonger outputs = more time + cost
Output costs 4×Generating > reading (sequential vs parallel)Use max_tokens; prompt for conciseness
KV CacheInput attention scores are cached & reusedEnable prompt caching for repeated system prompts
TTFTTime to first token — perceived as "speed"Keep prompts lean; always stream
StreamingShow tokens as generatedAlways enable in user-facing apps
IrreversibleThe model can't backtrackUse clear prompts; consider structured outputs

The Core Insight

🚫
ChatGPT Doesn't Think — It Predicts

The illusion of intelligent, fluent text is produced by a model making one probabilistic choice at a time, each choice constrained by everything before it. There is no thinking ahead. There is no revision. There is only: given all of this, what word comes next? — done hundreds or thousands of times, very fast, very convincingly.


Pro Tips for Builders

⚠️
What Token Generation Changes For You
  • 1. Always enable streaming in user-facing apps. It costs nothing extra and makes responses feel 3–5× faster — the biggest free UX win in AI development.
  • 2. Output length is your biggest cost lever. "Explain in detail" vs "explain in 2 sentences" can be a 5–10× cost cut with no quality loss for many tasks.
  • 3. Put output-format instructions first. "Respond in JSON:" as line one primes the first token to be {, which propagates through every subsequent token.
  • 4. Enable prompt caching for repeated system prompts. If your system prompt is 5,000 tokens across 10,000 requests/day, caching can cut input costs 80–90%.

Try It Yourself

Estimate cost before you build:

python
123456789101112131415
import tiktoken

def estimate_cost(prompt: str, expected_output_words: int) -> dict:
    enc = tiktoken.encoding_for_model("gpt-4o")
    input_tokens = len(enc.encode(prompt))
    output_tokens = int(expected_output_words * 1.33)   # ~0.75 words/token
    input_cost = (input_tokens / 1_000_000) * 2.50
    output_cost = (output_tokens / 1_000_000) * 10.00
    return {
        "total_per_call": f"${input_cost + output_cost:.6f}",
        "daily_1000_calls": f"${(input_cost + output_cost) * 1000:.2f}",
    }

print(estimate_cost("You are a helpful assistant. What are the best smart glasses?", 200))
# → {'daily_1000_calls': '$2.67', ...}

Then measure TTFT live by timestamping the first streamed chunk, and compare stream=True vs stream=False to feel the UX difference — same total time, very different perception.


Key Takeaways

01
01
Predicting, Not Thinking

Fluency is an illusion produced by a model making one probabilistic choice at a time. There is no revision; there is only "given all of this, what word comes next?"

02
02
Streaming is Free UX

Since the model generates token-by-token, streaming is the natural behavior. Always enable it in user-facing apps to make them feel 5x faster.

03
03
Output is your Cost Lever

You pay for every sequential GPU cycle. Moving from "Detailed" to "Concise" can reduce your API bill by 80% without losing quality for many tasks.


Up Next in the Series

💡
Next: The Context Window

Now that you know how the model writes, the next limit you'll hit is how much it can remember at once. Every LLM has a context window — and the "Lost in the Middle" phenomenon means information in the center can get ignored even when the model technically can read it. Continue the series →

MH

Mohamed Hamed

20 years building production systems — the last several deep in AI integration, LLMs, and full-stack architecture. I write what I've actually built and broken. If this was useful, the next one goes to LinkedIn first.

Follow on LinkedIn →