Part 1 — Tokens: Why Your Language Costs More Than English When You Use AI

The AI Doesn't Read Words. It Reads Numbers.

Every message you send to an AI is shredded into small numeric units called tokens. Learn why this technical abstraction creates a 'language tax' for Arabic speakers and how to optimize your AI budget.

Primary Objective

Tokenization | BPE | Context Windows | Arabic Language Tax

Did you know that AI doesn't actually read your words at all? To be precise: it doesn't see the letters or words you type. It sees something else entirely — and that something is called a Token.

💡

The Tokenization Pipeline

Text → Tokenize (small pieces) → Encode (numbers) → ⚡ AI Brain → Decode → Final Response.

The moment you send any text to an AI, it runs Tokenization — splitting your text into small units, converting them into numbers, processing those numbers, figuring out what response fits, and converting the output back into words. That's the complete pipeline. Every single time.

What is a Token?

A token is the smallest unit an AI processes. It's not necessarily a word — it can be a whole word, part of a word, a punctuation mark, a space, or a number. When you read a sentence, you read it word by word. The AI doesn't work that way — it cuts text into these small token pieces.

Real Example: 6 Words, 10 Tokens

Sentence: Ray-Ban Meta Ultra smart glasses for $549

Tokens: "Ray", "-", "Ban", " Meta", " Ultra", " smart", " glasses", " for", "$", "549"

Insight: Spaces, hyphens, and currency symbols are tokens too.

You might count 6 words. The AI sees 10 tokens:

Token	What it is
`Ray`	Part of a compound word
`-`	The hyphen, treated separately
`Ban`	The second part
`Meta`	A word (note the leading space)
`Ultra`	A word
`smart`	A word
`glasses`	A word
`for`	A word
`$`	The currency symbol, its own token
`549`	The number

Why Does It Cut Text This Way?

The technique is called Subword Tokenization. The model learned which sequences appear most frequently in its training data; common ones get their own single token, while rare or complex ones get split.

Simple vs. Complex Words

✅ENGLISH (High Data)

the → 1 token
cat → 1 token
hello → 1 token
Apple → 1 token

⚡COMPLEX (Low Data)

un·forget·table → 3 tokens
inter·national·ization → 3 tokens
نظ·ارة → 2 tokens
ال·ذك·اء → 3 tokens

Arabic words fall into the complex category because of the lack of Arabic training data compared to English. Less data → fewer complete Arabic words in the vocabulary → more aggressive splitting. (The exact splits depend on each model's learned vocabulary; these are illustrations.)

How the Vocabulary Was Built: Byte-Pair Encoding

The vocabulary of tens of thousands of tokens didn't appear by magic. It was learned using an algorithm called Byte-Pair Encoding (BPE) — and understanding it explains why some words are a single token while others get shredded.

The BPE Algorithm

🔤

START WITH CHARACTERS

The initial vocabulary is every single character in the training data: a, b, c… أ, ب, ت… 0, 1, 2…

🔗

MERGE THE MOST COMMON PAIR

Scan all text, find the two consecutive tokens that appear together most often, and merge them. If "t" + "h" is most frequent, it becomes "th".

🔁

REPEAT UNTIL FULL

Keep merging — "th"+"e" → "the", "the"+" " → "the " — until you hit the target vocabulary size (~100K–200K entries).

text

123456

BPE building "the" from characters:
Start:            "t" "h" "e" " " "c" "a" "t"
Merge "t"+"h" →   "th" "e" " " "c" "a" "t"
Merge "th"+"e" →  "the" " " "c" "a" "t"
Merge "the"+" " → "the " "c" "a" "t"
Result: "the " is ONE token — it appears millions of times in English.

This is why English has a richer vocabulary than Arabic in any given tokenizer — BPE found and encoded more frequent English patterns simply because there was more English text to learn from.

Impact #1: Cost

Every AI model charges you per token — not per message, not per word. And input (what you send) and output (what it replies) are billed separately. Here's what the major models cost as of early 2026:

Model	Input / 1M tokens	Output / 1M tokens
GPT-4o mini (budget)	$0.15	$0.60
Gemini 2.5 Flash	$0.30	$2.50
Claude Haiku 4.5	$1.00	$5.00
Gemini 2.5 Pro	$1.25	$10.00
GPT-4o	$2.50	$10.00
Claude Sonnet 4.5	$3.00	$15.00
Claude Opus 4.6 (premium)	$5.00	$25.00

Pricing as of March 2026 — always check official provider pages for current rates. Notice the range: GPT-4o mini at $0.15 input vs Claude Opus at $5.00. Understanding tokens lets you pick the right model for the right job.

The English vs. Arabic Gap

Here's where it gets real for Arabic speakers. The same 100 words of content costs very different amounts of tokens:

Tokens for 100 Words of Content

English (~130 tokens)

100

Arabic (~200 tokens)

You're paying ~50% more for the same content in Arabic — not because Arabic is "worse," but purely because the tokenizer has seen less Arabic text and hasn't built as rich a vocabulary for it. (This gap is improving; GPT-4o's tokenizer is notably more Arabic-efficient than older models.)

What This Means in Real Money

Say you send 100 messages a day for a month on a mid-range model:

~$22

English / month

4.5M tokens

~$34

Arabic / month

6.9M tokens

+$12

Language tax / month

Just from the language you write in

🚫

Real Money Impact

Scale that to 500–1000 messages/day and the premium becomes hundreds of dollars — purely from the script your text is written in. (Assumes GPT-4o pricing, ~70% input / 30% output split.)

Impact #2: Context Window

Every AI model has a maximum number of tokens it can see at once — the Context Window, the model's working RAM.

Regular models: 8K–32K tokens ≈ 20–70 pages
Advanced models: 128K–200K tokens ≈ 300–460 pages
Largest models: 1M–2M tokens ≈ an entire book, or two

When the context window fills up, the AI can't process anything beyond it — it forgets the older parts of your conversation. A bigger window means it remembers more, but the more you send, the more tokens you consume, and the more you pay.

Impact #3: Speed

Here's something most people don't think about: the AI doesn't generate its entire response at once. It produces one token at a time, and each token feeds into the next:

text

"smart" → " glasses" → " with" → " camera" → "..."

This is why you see text streaming word by word in chat interfaces — the model literally can't produce the full response in one shot. Golden rule: a longer response takes longer to arrive; every extra token adds to your wait time.

Every Model Has Its Own Vocabulary

Different AI models don't share the same tokenizer. Each has its own vocabulary — a dictionary of tens to hundreds of thousands of word-pieces, each mapped to a unique numeric ID. (GPT-4o uses ~200,000 entries; older GPT models used ~100,000.) So "smart" maps to completely different numbers depending on the model:

Same Word, Different IDs

Model A

"smart" = #10,119

Model B

"smart" = #28,644

This means token counts aren't directly comparable across models — a "1M token" context in one model doesn't hold the same amount of real text as in another.

Special Tokens: The Hidden Language

Beyond regular text tokens, every model uses special tokens — reserved entries that signal structure, boundaries, and roles. You don't type them, but they're always there, and they count against your budget:

Token	Purpose
`<\|endoftext\|>`	End-of-document marker — prevents context "bleeding" across separate inputs.
`[CLS]` / `[SEP]`	BERT classification & separator tokens (e.g. question vs. context in Q&A).
`<\|im_start\|>`	Role boundary tokens — mark who is speaking. Inserted whenever you set `role: "user"` or a system prompt.
`[PAD]`	Padding token — fills batches to equal length; masked during attention.

A typical system prompt in the OpenAI chat format adds ~4–10 special tokens on top of your text. Negligible in a long chat, but they add up fast in a tight budget or batched production pipeline.

See It in Code (Python)

python

12345678910111213141516

%pip install -q tiktoken
import tiktoken

tokenizer = tiktoken.encoding_for_model("gpt-4o")

text_en = "Ray-Ban Meta Ultra smart glasses with 48MP camera"
print(f"English token count: {len(tokenizer.encode(text_en))}")   # ~12

text_ar = "نظارة ذكية خفيفة بكاميرا وترجمة فورية"
print(f"Arabic token count: {len(tokenizer.encode(text_ar))}")    # ~15

# See exactly what the AI sees — the numeric IDs behind each token:
for t in tokenizer.encode(text_en):
    print(f"  ID {t:>6} → '{tokenizer.decode([t])}'")
# ID 37513 → 'Ray'   ID 8287 → '-B'   ID 270 → 'an'
# ID 27438 → ' Meta'  ID 38414 → ' Ultra'  ...

The AI doesn't see "Ray-Ban" — it sees [37513, 8287, 270].

3 Ways to Save Tokens (and Money)

3 Ways to Save Tokens

🇺🇸

BILINGUAL INPUT

Write prompts in English but request the answer in Arabic. You save on the input side — usually the larger part of a conversation.

✂️

CONCISION

A short, specific prompt outperforms a long, rambling one — and every extra sentence is a literal cost.

📏

CONSTRAINTS

Specify length: "Respond in 3 bullet points" to hard-cap how many output tokens the model generates.

Pro Tips for Builders

⚠️

What Knowing Tokens Changes For You

1. Count before you pay. Use tiktoken (OpenAI) or the provider's tokenizer to pre-count prompts in code. Build token budgeting in from day one — it's hard to retrofit.
2. Cap output with max_tokens. Without it, an open-ended question can return a 2,000-token essay when you needed 50 words.
3. System prompts aren't free. A 500-word system prompt costs 600–800 tokens on every request. Distill it, or use prompt caching where supported.
4. Match model to task, not to default. A summarization task that costs $0.50/day on Opus costs $0.015/day on GPT-4o mini — similar quality for simple work.
5. History multiplies cost. In chat, every prior message is re-sent each turn. The 20th message costs ~20× the input tokens of the 1st. Use a rolling window or summarization early.

The Core Insight

💡

The Core Insight

The AI doesn't understand language. It understands numbers. Every word you type becomes numbers; the model finds the right numbers to respond with; those numbers become words. Once you grasp that, the cost, speed, and memory limits all make intuitive sense — they're all just constraints on how many numbers the model can process at once.

Try It Yourself

The best way to make this click is to see it live. Use OpenAI's free tokenizer at platform.openai.com/tokenizer — paste any text and watch it split token by token. Try entering your full name in Arabic, then in English, and compare the counts.

python

1234567891011121314

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")

messages = {
    "English":  "Please summarize this article in three bullet points.",
    "Arabic":   "من فضلك لخص هذا المقال في ثلاث نقاط.",
    "French":   "Veuillez résumer cet article en trois points.",
    "Spanish":  "Por favor, resume este artículo en tres puntos.",
}
base = len(enc.encode(messages["English"]))
for lang, text in messages.items():
    n = len(enc.encode(text))
    print(f"{lang:<10} {n:>3} tokens  ({n/base*100:.0f}%)")
# English 10 (100%) · Arabic 11 (110%) · French 13 (130%) · Spanish 12 (120%)

Key Takeaways

Count Before You Pay

Use tiktoken in Python to pre-calculate costs before sending prompts to the API.

System Prompts are Heavy

A 500-word system prompt is charged on every single request. Keep them lean.

History Multiplies Cost

In chat, the 20th message re-sends all 19 previous messages. Summarize history often to stay within budget.

Up Next in the Series

💡

Next: Embeddings

The magic that lets the AI understand meaning from those numbers — and why "نظارة" and "glasses" end up representing the same concept even though they look nothing alike. Read Part 2 →