Token-by-Token: The AI Recursive Loop
ChatGPT doesn't think its answer and then display it. It predicts one token, then another, then another—using each prediction as context for the next. This is autoregressive generation.
Output tokens cost 4x more than input tokens. Reading 1M tokens costs ~$2.50, but writing 1M tokens costs ~$10.00. This is because generation is forced to be sequential and slow.
The One Thing an LLM Does
Strip away the magic, and a Large Language Model does exactly one thing: predict the single most likely next token.
- 1. Context: The model reads your prompt and all tokens generated so far.
- 2. Prediction: It runs a forward pass to find the next most likely token.
- 3. Append: The new token is fed back into the input for the next cycle.
- Result: Intelligence emerges from this recursive loop.
Step-by-Step Generation
Trace how a simple sentence is built token-by-token.
- Step 1: [START] + Question → "Ray" (35% likely)
- Step 2: ... + "Ray" → "-" (85% likely)
- Step 3: ... + "-" → "Ban" (95% likely)
- Step 4: ... + "Ban" → "Meta" → "Ultra" → [END]
The Sequential Penalty: Why Output Costs More
Input processing can be parallelized across GPUs. Output generation cannot.
Parallel vs Sequential
- Mode: Parallel.
- Cost: ~$2.50 / 1M.
- Optimization: KV Cache stores results.
- Mode: Sequential (Token-by-token).
- Cost: ~$10.00 / 1M.
- Constraint: Each token requires a full model pass.
Performance Metrics that Matter
When building AI apps, "speed" is defined by two different numbers.
TTFT vs Throughput
- Meaning: Time to First Token.
- Impact: Perceived responsiveness.
- Bottleneck: Input size (Prompt processing).
- Meaning: Tokens per Second.
- Impact: Total completion time.
- Bottleneck: Model size & GPU hardware.
The One-Way Street: Generation is Irreversible
The model cannot go back and correct a previous token.
The Cost of Early Mistakes
Once "Ray" is picked, every future token must be consistent with it.
If the first token is a hallucination, the model will build a "logical" lie on top of it.
Format instructions (JSON, Tables) work by forcing the first token to be correct ({, #).
Key Takeaways
Fluency is an illusion produced by a model making one probabilistic choice at a time. There is no revision; there is only "given all of this, what word comes next?"
Since the model generates token-by-token, streaming is the natural behavior. Always enable it in user-facing apps to make them feel 5x faster.
You pay for every sequential GPU cycle. Moving from "Detailed" to "Concise" can reduce your API bill by 80% without losing quality for many tasks.