Part 8 — Token-by-Token: Why AI Generates Text One Word at a Time (And Why It Costs 4x More)

Token-by-Token: The AI Recursive Loop

ChatGPT doesn't think its answer and then display it. It predicts one token, then another, then another—using each prediction as context for the next. This is autoregressive generation.

Primary Objective

Autoregressive | KV Cache | TTFT vs Throughput | Cost Asymmetry

🚫

The Output Tax

Output tokens cost 4x more than input tokens. Reading 1M tokens costs ~$2.50, but writing 1M tokens costs ~$10.00. This is because generation is forced to be sequential and slow.

The One Thing an LLM Does

Strip away the magic, and a Large Language Model does exactly one thing: predict the single most likely next token.

The Generation Engine

1. Context: The model reads your prompt and all tokens generated so far.
2. Prediction: It runs a forward pass to find the next most likely token.
3. Append: The new token is fed back into the input for the next cycle.
Result: Intelligence emerges from this recursive loop.

Step-by-Step Generation

Trace how a simple sentence is built token-by-token.

Building: 'Ray-Ban Meta Ultra'

Step 1: [START] + Question → "Ray" (35% likely)
Step 2: ... + "Ray" → "-" (85% likely)
Step 3: ... + "-" → "Ban" (95% likely)
Step 4: ... + "Ban" → "Meta" → "Ultra" → [END]

The Sequential Penalty: Why Output Costs More

Input processing can be parallelized across GPUs. Output generation cannot.

Parallel vs Sequential

📖INPUT (READING)

Mode: Parallel.
Cost: ~$2.50 / 1M.
Optimization: KV Cache stores results.

✍️OUTPUT (WRITING)

Mode: Sequential (Token-by-token).
Cost: ~$10.00 / 1M.
Constraint: Each token requires a full model pass.

Performance Metrics that Matter

When building AI apps, "speed" is defined by two different numbers.

TTFT vs Throughput

⚡TTFT

Meaning: Time to First Token.
Impact: Perceived responsiveness.
Bottleneck: Input size (Prompt processing).

📊THROUGHPUT

Meaning: Tokens per Second.
Impact: Total completion time.
Bottleneck: Model size & GPU hardware.

The One-Way Street: Generation is Irreversible

The model cannot go back and correct a previous token.

The Cost of Early Mistakes

🔒

COMMITMENT

Once "Ray" is picked, every future token must be consistent with it.

🔄

PROPAGATION

If the first token is a hallucination, the model will build a "logical" lie on top of it.

🎯

STABILIZATION

Format instructions (JSON, Tables) work by forcing the first token to be correct ({, #).

Key Takeaways

Predicting, Not Thinking

Fluency is an illusion produced by a model making one probabilistic choice at a time. There is no revision; there is only "given all of this, what word comes next?"

Streaming is Free UX

Since the model generates token-by-token, streaming is the natural behavior. Always enable it in user-facing apps to make them feel 5x faster.

Output is your Cost Lever

You pay for every sequential GPU cycle. Moving from "Detailed" to "Concise" can reduce your API bill by 80% without losing quality for many tasks.