THE HIDDEN TAX OF AI
Output Is King
INPUT COST
$2.50
Per 1M Tokens (GPT-4o)
OUTPUT COST
$10.00
Per 1M Tokens (GPT-4o)
The reason? The AI writes very slowly on the inside — one token at a time.
Last article we saw the Transformer architecture. Today we watch it in action during live generation — and discover why the output side is 4x more expensive.
Here's something that surprises most developers when they first hear it:
ChatGPT doesn't think its answer in advance and then display it.
It predicts one token. Then another. Then another. Each prediction uses the previous ones as context. It's not writing — it's recursively predicting.
Remember how the Transformer reads everything in parallel (previous article)? Generation flips that on its head — now it's forced to be sequential because each new token depends on the last.
And understanding this one fact changes how you design prompts, control API costs, build streaming UIs, and debug unexpected AI behavior.
The One Thing an LLM Actually Does
Strip away all the complexity and a large language model does exactly one thing:
Given all the tokens it has seen so far, predict the single most likely next token.
Think of it like predictive text on your phone — except instead of suggesting 3 words, it's choosing from 100,000+ possible tokens, and it does this thousands of times to build a complete response.
Step-by-Step: How Generation Actually Works
Let's trace through a real example. Prompt: "What's the best smart glasses?"
<div style="display: flex; gap: 16px; align-items: center; flex-wrap: wrap;">
<div style="padding: 10px 16px; background: rgba(34,211,238,0.1); border: 1px solid #22d3ee; border-radius: 10px; font-size: 0.85rem; flex-shrink: 0;">
"What's the best smart glasses?" + [START]
</div>
<div style="color: #555; font-size: 1.3rem;">→</div>
<div style="padding: 10px 16px; background: rgba(168,85,247,0.2); border: 2px solid #a855f7; border-radius: 10px; font-weight: bold; color: #e9d5ff;">"Ray" — 35% ⭐</div>
<span style="color: #aaa; font-size: 0.8rem;">Step 1</span>
</div>
<div style="display: flex; gap: 16px; align-items: center; flex-wrap: wrap;">
<div style="padding: 10px 16px; background: #0d1117; border: 1px solid rgba(34,211,238,0.4); border-radius: 10px; font-size: 0.85rem; flex-shrink: 0;">
"What's...glasses?" + "Ray"
</div>
<div style="color: #555; font-size: 1.3rem;">→</div>
<div style="padding: 10px 16px; background: rgba(168,85,247,0.2); border: 2px solid #a855f7; border-radius: 10px; font-weight: bold; color: #e9d5ff;">"-" — 85% ⭐</div>
<span style="color: #aaa; font-size: 0.8rem;">Step 2</span>
</div>
<div style="display: flex; gap: 16px; align-items: center; flex-wrap: wrap;">
<div style="padding: 10px 16px; background: #0d1117; border: 1px solid rgba(34,211,238,0.3); border-radius: 10px; font-size: 0.85rem; flex-shrink: 0;">
"What's...glasses?" + "Ray-"
</div>
<div style="color: #555; font-size: 1.3rem;">→</div>
<div style="padding: 10px 16px; background: rgba(168,85,247,0.2); border: 2px solid #a855f7; border-radius: 10px; font-weight: bold; color: #e9d5ff;">"Ban" — 95% ⭐</div>
<span style="color: #aaa; font-size: 0.8rem;">Step 3</span>
</div>
<div style="display: flex; gap: 16px; align-items: center; flex-wrap: wrap;">
<div style="padding: 10px 16px; background: #0d1117; border: 1px solid rgba(255,255,255,0.1); border-radius: 10px; font-size: 0.85rem; flex-shrink: 0;">
"...Ray-Ban" + all prev
</div>
<div style="color: #555; font-size: 1.3rem;">→</div>
<div style="padding: 10px 16px; background: rgba(168,85,247,0.1); border: 1px dashed #a855f7; border-radius: 10px; font-size: 0.85rem; color: #c4b5fd;">"Meta" → "Ultra" → "because" → ... → [END]</div>
</div>
✅ Final response (assembled from sequential predictions):
"Ray-Ban Meta Ultra — lightweight, 48MP camera, translates 40 languages, full-day battery."
Generated token-by-token — never computed all at once.
Autoregressive Generation: The Mathematical Reality
The formal name for this process is autoregressive generation — each output token becomes part of the input for the next prediction. (This is the same "next-token prediction" that the training loop from Article 4 taught the model to do — except now it's happening live during inference.)
This creates a critical asymmetry in how the model works:
This is why output tokens cost 4x more than input tokens. Reading your 10,000-token prompt can be largely parallelized. But generating each output token requires a sequential forward pass through the full model — there's no way to batch or parallelize this without changing the output.
The Probability Distribution: Every Token Is a Vote
At each generation step, the model doesn't just know the one "right" answer. It produces a probability distribution over its entire vocabulary — every possible next token, each with a likelihood score.
Probability Distribution After "What's the best smart..."
⚠️ The model doesn't always pick the highest-probability token — that's controlled by Temperature (a topic for another article).
This is the same softmax activation we saw inside the neuron (Article 3) and Transformer block (Article 6) — here it converts raw logits over the full vocabulary into a probability distribution over what to say next.
The model selects one token, appends it to the context, and runs the entire prediction process again. This continues until it generates an [END] token or hits a maximum length.
KV Cache: The Hidden Optimization That Makes This Workable
Here's the obvious problem with autoregressive generation: if each new token requires the model to re-read the entire context (your prompt + all previous outputs), the computation time would grow quadratically. A 1,000-token response from a 10,000-token prompt would be impossibly slow.
The solution is the KV Cache (Key-Value Cache):
Without KV Cache
Every new token requires reprocessing the entire context from scratch
Token 2: read all 10K input again
Token 3: read all 10K input again
... (10,000x overhead per token)
How it works technically: During the Transformer's attention computation, every token produces a Key (K) and Value (V) vector. These don't change for tokens already processed. The KV Cache stores them in GPU memory, so each new generation step only needs to compute the K and V for the one new token. This reuse is only possible because of the self-attention mechanism from the previous article — without Q/K/V, there would be nothing to cache.
This is also why reading input is cheaper than generating output — the entire input can be processed in one forward pass with full parallelization, while output tokens must be generated one at a time even with the cache.
How This All Fits Inside the Transformer We Just Learned
Every generation step is a partial forward pass through the full Transformer stack:
The new token passes through Positional Encoding (Article 6) — it gets a position vector so the model knows it's token #347, not #1.
Multi-Head Self-Attention runs — but with KV Cache, only the new token's Q is computed fresh; all previous K/V pairs are retrieved from cache.
The result flows through the Feed-Forward layers (where the neurons from Article 3 live) — all 96 layers, stacked.
The final layer outputs a probability distribution via softmax over the 100K+ vocabulary — one token is selected, appended, and the loop repeats.
TTFT and Throughput: The Two Metrics That Matter
When building AI applications, two performance metrics dominate:
⚡ TTFT (Time to First Token)
How long before the user sees the first word of the response.
Dominated by: Input processing time. Bigger prompts = longer TTFT.
Why it matters: Users perceive TTFT as "responsiveness." A 3-second TTFT feels laggy even if generation speed is fast.
📊 Throughput (Tokens/Second)
How fast the model generates tokens after the first one appears.
Dominated by: Model size, hardware, and batch efficiency.
Why it matters: For long responses, throughput determines total completion time. GPT-4o: ~100-150 tok/s. Gemini Flash: ~300+ tok/s.
⚡ Developer tip: TTFT is your user experience problem
If your system prompt is 5,000 tokens and your users are sending 2,000-token prompts, your TTFT could be 2-4 seconds before the response even begins. Consider prompt caching, smaller system prompts, or showing a loading indicator that accounts for TTFT specifically.
The Real Cost Impact: A Developer's Calculator
Since output generation is 4x more expensive than input processing, how you instruct the model affects your bill more than how much data you send.
Scenario: 1,000 API calls per day on GPT-4o
❌ "Write a detailed response" — 500 output tokens
500 tok × 1,000 calls = 500K tokens
500K ÷ 1M × $10.00
$5.00/day → $150/month
<div style="flex: 1; min-width: 220px; background: #0d1117; border: 2px solid #22d3ee; border-radius: 12px; padding: 18px;">
<p style="color: #22d3ee; font-weight: bold; margin-bottom: 10px;">✅ "Be concise, 1-2 sentences" — 100 output tokens</p>
<div style="font-family: monospace; font-size: 0.85rem; color: #94a3b8;">
<p style="margin: 4px 0;">100 tok × 1,000 calls = 100K tokens</p>
<p style="margin: 4px 0;">100K ÷ 1M × $10.00</p>
</div>
<p style="color: #22d3ee; font-size: 1.3rem; font-weight: bold; margin-top: 10px; margin-bottom: 0;">$1.00/day → $30/month</p>
</div>
Streaming: The UX Secret
Since the model generates token-by-token anyway, streaming is free — you can show each token to the user as it's produced instead of waiting for the complete response.
Without Streaming
User stares at a blank screen for 5 seconds. Then the entire 400-word response appears at once. Perceived as "slow AI."
With Streaming
User sees the first word appear in 0.5 seconds, then watches the response build. Perceived as "fast, responsive AI" — even if total time is the same.
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What are the best smart glasses in 2026?"}],
stream=True, # ← This is all you need — stream=True is free and transforms UX
max_tokens=150 # ← Control output length = control cost
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
# Output appears token by token, not all at once
This is literally how ChatGPT's web interface works — the streaming appearance is the natural behavior of the model surfaced directly to the user.
The One-Way Street: Generation Is Irreversible
Here's the most important practical implication of token-by-token generation:
The model cannot go back and correct a previous token.
Once "Ray" is generated and added to the context, the model is committed. Every subsequent token is conditioned on "Ray" appearing there. If the model had wanted to say "Apple" but statistical chance led it to generate "Ray" first, it now has to generate something coherent following "Ray" — it cannot reconsider.
If your prompt is ambiguous, the model might generate an early token that commits it to the wrong interpretation. It will then generate a coherent-but-wrong response. Better prompts → better first tokens → better entire responses.
If the model generates a confident-sounding but wrong fact early in a response, it doesn't "realize" the mistake — it just continues generating tokens that are consistent with the wrong fact. This is why early hallucinations are so hard to fix — by the time the model "knows" it's on the wrong track, it has already committed 50 tokens to a false premise. The next article covers hallucination in depth.
If you instruct the model to output JSON or markdown at the start, it will generate the opening `{` or `#` token first, which statistically primes all subsequent tokens to follow that format. Prompts like "respond in JSON" work because they shape the first-token probability distribution.
Developer Quick Reference
The Core Insight
ChatGPT doesn't think — it predicts
The illusion of intelligent, fluent text is produced by a model that makes one probabilistic choice at a time, each choice constrained by everything that came before. There is no thinking ahead. There is no revision. There is only: given all of this, what word comes next? Done 100, 500, 2,000 times — very fast, very convincingly.
Pro Tips for Builders
💡 What Knowing Token Generation Changes For You
Always enable streaming in user-facing apps. It costs nothing extra and makes responses feel 3-5x faster to users. The perceived latency drop is the biggest free UX win in AI development.
Output length is your biggest cost lever. The difference between "explain in detail" and "explain in 2 sentences" can be a 5-10x cost reduction with no quality loss for many tasks.
Put output format instructions first. "Respond in JSON:" as the first line of your prompt statistically primes the first token to be {, which propagates through every subsequent token. The model doesn't plan ahead — it just follows the path its first token started.
Enable prompt caching for repeated system prompts. Anthropic and OpenAI both offer prompt caching — if your system prompt is 5,000 tokens and you send 10,000 requests/day, caching can cut your input costs by 80-90%.
Try It Yourself
Experiment 1: Count the cost before building
import tiktoken
def estimate_cost(prompt: str, expected_output_words: int) -> dict:
enc = tiktoken.encoding_for_model("gpt-4o")
input_tokens = len(enc.encode(prompt))
output_tokens = int(expected_output_words * 1.33) # ~0.75 words per token
input_cost = (input_tokens / 1_000_000) * 2.50
output_cost = (output_tokens / 1_000_000) * 10.00
return {
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"input_cost": f"${input_cost:.6f}",
"output_cost": f"${output_cost:.6f}",
"total_per_call": f"${input_cost + output_cost:.6f}",
"daily_1000_calls": f"${(input_cost + output_cost) * 1000:.2f}"
}
# Test it
result = estimate_cost(
prompt="You are a helpful assistant. What are the best smart glasses?",
expected_output_words=200
)
print(result)
# → {'daily_1000_calls': '$2.67', ...}
Experiment 2: Visualize generation timing Add timestamps to streaming output to see TTFT vs token rate:
import time
start = time.time()
first_token = True
for chunk in client.chat.completions.create(model="gpt-4o",
messages=[...], stream=True):
if chunk.choices[0].delta.content:
if first_token:
print(f"\n⚡ TTFT: {time.time() - start:.2f}s")
first_token = False
print(chunk.choices[0].delta.content, end="", flush=True)
Experiment 3: stream=True vs stream=False — feel the UX difference
import time
from openai import OpenAI
client = OpenAI()
prompt = [{"role": "user", "content": "Write a 3-paragraph summary of how LLMs work."}]
# Without streaming — user waits for the full response
start = time.time()
response = client.chat.completions.create(model="gpt-4o", messages=prompt, stream=False)
print(f"Non-streaming total wait: {time.time() - start:.2f}s")
print(response.choices[0].message.content)
# With streaming — user sees first token almost immediately
print("\n--- Streaming version ---")
start = time.time()
first_token_time = None
for chunk in client.chat.completions.create(model="gpt-4o", messages=prompt, stream=True):
if chunk.choices[0].delta.content:
if first_token_time is None:
first_token_time = time.time() - start
print(f"⚡ TTFT: {first_token_time:.2f}s")
print(chunk.choices[0].delta.content, end="", flush=True)
# Same total time — but perceived as much faster because content starts immediately
Now that you understand how the model actually writes, the next limit you'll hit is how much it can remember at once — the context window.
NEXT IN SERIES
Context Window: The Invisible Limit on Every AI Conversation
Every LLM has a context window — a limit on how many tokens it can "see" at once. In the next article, we'll explore what fills up that window (system prompt, conversation history, RAG results, your query), how models ranging from GPT-4o (128K) to Gemini 1.5 Pro (1M) handle it differently, and the "Lost in the Middle" phenomenon — why information in the center of your context gets ignored even when the model technically can read it.
Coming next: context-window-article.md