Part 2 — Embeddings: How AI Understands the World Without Knowing a Single Word

Embeddings: The Mathematical Fingerprint of Meaning

How does a computer understand that 'Apple' the fruit is different from 'Apple' the tech giant? It uses high-dimensional GPS coordinates called Embeddings.

Primary Objective

Understand how AI models encode meaning, compare similarity mathematically, and leverage semantic space for developer applications like RAG and Search.

Type "apple" into a grocery app → you get fruit. Type "apple" into a stock app → you get a $3 trillion company. Same four letters. Completely different meanings. How does the AI tell them apart — without actually "reading" the word?

The answer is a mathematical trick called Embeddings — and it's the silent engine powering every AI tool you use today.

💡

The $1 Billion Math Trick

Netflix's own 2016 research estimated their recommendation engine—powered by embeddings—saves them $1 billion per year in reduced churn by showing users content they actually love.

In our last article on Part 1 — Tokens, we saw how AI breaks words into numbers. But Token #27438 (" Meta") is just a label. It has no meaning. So how does the AI actually understand meaning? Welcome to the concept that makes modern AI possible: Embeddings and Vectors.

Phase 1 of 5

To understand how AI encodes meaning, we first need to understand how numbers can represent similarity.

The Coordinate System

To understand how AI thinks, we have to look at how humans classify the world. Imagine describing a person using only three numbers: Height (cm), Weight (kg), and Age (years).

If we represent three people using this exact format:

Alex: [180, 75, 25]
Marcus: [178, 74, 26]
Sarah: [165, 60, 30]

Height vs. Weight — A 2D Vector Slice

Alex [180, 75, 25]

Marcus [178, 74, 26]

Sarah [165, 60, 30]

Height (cm)

Weight (kg)

This is just a 2D slice — in reality, AI works in 384+ dimensions, making these clusters far more precise. But even here, Alex and Marcus land right next to each other, while Sarah sits far away.

The computer didn't need to look at their pictures. It didn't need to interview them. By seeing that their numbers are mathematically close, it deduced that they share similar physical characteristics. This flat list of numbers [180, 75, 25] is called a Vector.

Mental Model

GPS for Meaning

The Analogy

"Cairo sits at (30.04°N, 31.23°E). Paris sits at (48.85°N, 2.35°E). Nearby cities have similar coordinates, while Cairo and Tokyo are far apart."

The Reality

Words with similar meaning have similar coordinates (vectors) — just mapped in a 384-dimensional space instead of 2. The closer the coordinates, the closer the meaning.

Phase 2 of 5

Vectors work — but three dimensions aren't nearly enough to capture the complexity of real-world meaning.

The 3D Limit

Vectors are great, but three numbers can't describe the complexity of the real world. Suppose we represent smart devices in a 3D vector space using [Price ($), Weight (g), Features Count]:

Ray-Ban Smart Glasses

Vector: [549, 48, 5]
Price: $549, Weight: 48g, Features: 5

Xreal Air 3 Glasses

Vector: [449, 72, 4]
Price: $449, Weight: 72g, Features: 4

Galaxy Smart Ring

Vector: [349, 3, 4]
Price: $349, Weight: 3g, Features: 4

The AI looks at the math: the Ray-Bans and the Xreal glasses cluster close together, while the Galaxy Ring is far away. It has successfully grouped the eyewear together — without ever being told what "glasses" are.

But here is the fatal flaw: What if we need to encode battery life? Camera quality? Multilingual translation capability? We can't capture all of that in just three dimensions. To truly capture the meaning and context of a word or an object, we need more numbers. A lot more.

Phase 3 of 5

This is where it gets interesting — instead of humans picking the dimensions, we let the AI discover them automatically.

From Vector to Embedding: The Masterpiece

When humans pick the properties (Height, Weight, Price), it's a basic Vector — subjective and limited. When the AI reads the entire internet and mathematically decides on the properties itself? That's an Embedding.

Instead of 3 properties, the model assigns a dense list of 384 to 3,072 numbers to form a digital fingerprint of meaning:

Small Models: 384 dimensions
Medium Models: 768 dimensions
Large Models: 3,072 dimensions

So "Smart Glasses" might become [0.012, -0.443, 0.891, 0.112 ... 380 more dimensions]. Each of those hundreds of numbers represents a hidden layer of meaning the AI learned from reading the internet. We humans don't even know exactly what dimension #247 represents — it might encode "electronic-ness", while #89 might encode "wearable-ness".

What the Model Learned About Ray-Ban Smart Glasses

Wearable-ness (#89)

Electronic-ness (#247)

Visual-ness (#512)

Edible-ness (#33)

We don't actually know these labels — the AI invented these concepts itself. We only see the numbers. But by analyzing patterns across billions of pages, the model mathematically organizes words so that related concepts align.

How AI Learns Meaning: Contrastive Learning

You might wonder: who decides what dimension #89 means? Nobody. The model learns these dimensions automatically through a process called Contrastive Learning. During training, it's shown millions of sentence pairs with labels:

The Contrastive Training Loop

🧲PULLING (Similar Meanings)

Shown paired examples: "I need coffee" + "Necesito café" or "Smart glasses" + "AR eyewear".
Action: The AI pulls their vector coordinates closer together in space.

🚫PUSHING (Different Meanings)

Shown opposing examples: "I need coffee" + "Je veux dormir" (I want to sleep) or "Smart glasses" + "Running shoes".
Action: The AI pushes their vector coordinates far apart in space.

After billions of such iterations, the model develops internal dimensions that capture meaning — not because anyone defined them, but because that geometry is the only way to satisfy all the constraints simultaneously. The coordinates stabilize, creating a highly organized map of human concepts.

Phase 4 of 5

Now we can see what that learned geometry actually looks like across languages.

The Cross-Language Magic: Cosine Similarity

To measure the distance between vectors in high-dimensional space, we use a formula called Cosine Similarity. It measures the angle between two vectors:

1.0: Pointing in the exact same direction (identical meaning).
0.0: Orthogonal/unrelated.
-1.0: Pointing in opposite directions.

Here are real similarity scores calculated by a multilingual embedding model:

Necesito café ↔ I need coffee97%

I need coffee ↔ Je veux dormir (I want to sleep)12%

reset password ↔ forgot login credentials91%

Notice the last pair: they share zero keywords, yet the similarity is 91% because the semantic intent is identical. This is the superpower of semantic search.

Same meaning = same numbers, in any language. The AI isn't translating "café" to "coffee." It just noticed that in its training data, "café" appears in the exact same contexts, next to the same words, as "coffee" does — so their digital fingerprints end up nearly identical. This works because the model was trained on millions of multilingual sentence pairs where the same idea appears side by side, and the contrastive learning we saw in Step 3 forces those equivalents toward the same point in embedding space.

💡 Quick Cosine Similarity Reference

Score	Meaning	Example
0.95+	Identical meaning in different words/languages	"I need coffee" vs "Necesito café"
0.70 – 0.90	Highly related concepts	"Smart glasses" vs "AR eyewear"
0.30 – 0.60	Loosely related concepts	"Smart glasses" vs "Wearable tech"
< 0.30	Unrelated concepts	"I need coffee" vs "Je veux dormir"

Phase 5 of 5

Now that we understand how meaning is encoded in numbers, let's see why this shattered traditional search.

Why This Changed the Internet: Semantic Search

Having a map of meaning changes how we search. This single idea killed traditional keyword search. Think about the old internet — if you searched an online store for "device for my eyes":

Search Evolution

🔍KEYWORD SEARCH

Query: "device for my eyes"
Logic: Find exact letter match.
Result: ❌ 0 Matches (the word 'Glasses' doesn't contain 'device').

🤖SEMANTIC SEARCH

Query: "device for my eyes"
Logic: Find nearest vector neighbors.
Result: ✅ Ray-Ban Glasses (Similarity: 95%).

This is exactly how TikTok's algorithm knows what video to show you next, and how Netflix recommends movies. Your watch history is converted into a vector, and the system finds the content whose vectors are nearest to yours in the coordinate space.

Code Example: Running Semantic Search in Python

Are you a developer? You can run a full multilingual embedding model and query it locally on your machine for free. We'll use an open-source model called paraphrase-multilingual-MiniLM-L12-v2 — it supports 50 languages and outputs a 384-dimensional vector.

python

12345678910111213141516171819202122232425262728293031

# Install dependencies first:
# %pip install sentence-transformers scikit-learn

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load a lightweight, multilingual embedding model
encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Our catalog of devices
devices = [
    "Ray-Ban Meta Ultra - Smart glasses with camera and translation",
    "Sony LinkBuds Open - Lightweight earbuds with translation",
    "Garmin Fenix 9 - Rugged sports watch for running",
]

# Step 1: Convert catalog sentences into vectors (embeddings)
device_embeddings = encoder.encode(devices)

# Step 2: Encode the user search query
query = "I want something very light for translating that isn't too expensive"
query_embedding = encoder.encode(query)

# Step 3: Compute similarity scores between query and catalog
scores = cosine_similarity([query_embedding], device_embeddings)[0]

# Step 4: Output the ranked results
print(f"Query: \"{query}\"\n")
ranked = scores.argsort()[::-1]
for i in ranked:
    print(f"  {scores[i]:.3f}  {devices[i]}")

Output:

text

12345

Query: "I want something very light for translating that isn't too expensive"

  0.421  Sony LinkBuds Open - Lightweight earbuds with translation
  0.287  Ray-Ban Meta Ultra - Smart glasses with camera and translation
  0.031  Garmin Fenix 9 - Rugged sports watch for running

The AI ranked the Sony LinkBuds first even though the user never typed "Sony", "earbuds", or "LinkBuds". The model aligned the query to the correct product because of the semantic relationship between light/translating and lightweight/translation — and the Garmin sports watch scored almost zero.

The Heart of RAG

If you are building AI applications today, you are likely building something called RAG (Retrieval-Augmented Generation) — the process of giving an LLM access to your private data (PDFs, databases) so it can answer questions grounded in your facts. Embeddings are the silent engine behind it.

📄

1. Data Source

Company documents, PDFs, or database records.

🧬

2. Vectorization

Text is chunked and converted into numerical embeddings.

🗄️

3. Vector DB

Store vectors. Retrieve closest match to query via Cosine math.

🤖

4. Generation

The LLM writes an answer grounded in the retrieved text.

You can't do RAG without embeddings. You can't do modern search without embeddings. The AI does not understand words — it understands the distance between numbers in a high-dimensional space. And that turns out to be a far more powerful way to understand the universe.

Try It in Your Head

If we pass two words to an embedding model:

"Apple" (the fruit)
"Apple" (the tech company)

Will their vectors be close or far apart in the embedding space? Think about the context.

▶

▶ Reveal Answer

They will be far apart.

In modern Transformer models, embeddings are contextual. The word "Apple" (fruit) appears near words like orchard, eat, tree, pie, juice, red. The word "Apple" (company) appears near MacBook, stock, iPhone, App Store, CEO, Tim Cook.

Because the surrounding contexts are completely different, they map to distant points in the coordinate system, despite being spelled exactly the same. This is why semantic search beats keyword search every time.

Pro Tips for Builders

⚠️

Pro Tips for Builders

✅ Use the same model for indexing and querying. If you embed your documents with model A and your queries with model B, the scores are meaningless — they live in different spaces.
✅ Match the model to your domain. paraphrase-multilingual-MiniLM-L12-v2 is great for general text. For legal, medical, or code, use domain-specific models.
✅ Embed at the chunk level, not the document level. A 50-page PDF embedded as one vector loses all its detail. Split into paragraphs first.
⚠️ Cosine similarity ≠ probability. A score of 0.85 doesn't mean "85% match." Calibrate thresholds on your own data.
⚠️ Bigger isn't always better. A 3,072-dimensional model costs ~8× more to store and query than a 384-dim one, with often marginal accuracy gains for simple retrieval.

Common Misconceptions

❌ The Myth

The AI model understands what words actually mean.

✅ The Reality

AI models do not possess conscious conceptual understanding. They map statistical relationships of how words appear in context. It is advanced pattern matching, not true comprehension.

❌ The Myth

A word always maps to the same vector coordinate.

✅ The Reality

Context changes coordinates. Modern models generate context-aware embeddings, so 'Apple' in a fruit context and 'Apple' in a tech stock context will output completely different vectors.

❌ The Myth

You need to translate texts to English before comparing them.

✅ The Reality

Multilingual embedding models project many languages into the exact same semantic coordinate space. Meaning is language-agnostic in high dimensions.

Key Takeaways

✓What You Learned

✓
Embeddings convert text, images, and other data into lists of numbers called vectors.
✓
Words and phrases with similar semantic meaning sit close together in the vector space.
✓
AI learns these dimensions automatically via Contrastive Learning on billions of text examples.
✓
Cosine Similarity calculates the angle between vectors (1.0 = identical meaning, 0.1 = unrelated).
✓
Embeddings power Semantic Search, Recommendation Systems, and Retrieval-Augmented Generation (RAG).
✓
You can download and run multilingual embedding models locally on your computer for free.

Try It Yourself

Run the code above and try these three experiments to build your intuition:

Spanish query test: Change the query to "Quiero algo ligero para traducir que no sea muy caro". Does the Sony LinkBuds still rank first? It should — same meaning, different language.
The Apple test: Add two new devices, "Apple fruit - fresh red apple from the orchard" and "Apple MacBook - laptop computer". Then query "I want something to eat" — which Apple lands closer?
Swap the model: Replace paraphrase-multilingual-MiniLM-L12-v2 with all-MiniLM-L6-v2 (English only, faster). Run the Spanish query and watch the score collapse — proving the multilingual model is doing real cross-language work, not keyword matching.

💡

Full Code on GitHub

The complete Jupyter notebook for this article — all examples, the cosine similarity visualiser, and the semantic search engine — lives in the AI Fundamentals repository. Open Vectors.ipynb →

Up Next in the Series

💡

Series Roadmap

Part 3 — Vector Databases: Now that you know what embeddings are, discover how Vector Databases index and query millions of high-dimensional vectors in under a millisecond. Read Part 3 →

Coming later — RAG: Give Your AI a Memory: How to connect any LLM to your private data using the embedding pipeline you just learned.