Skip to main content
AI-Developer → AI Fundamentals#3 of 4

Part 3 — How AI Finds Your Answer in 30 Million Documents in Under a Second

The old internet searched for words. The new internet searches for meaning. Here's the math that killed keyword search — and powers everything from TikTok to RAG.

March 12, 2026
10 min read
#AI#Similarity Search#Vectors#Cosine Similarity#Vector Database#RAG#Semantic Search

Imagine a catalog with 30 million products. A customer types: "I want something lightweight for translation that isn't too expensive."

How does AI find the right product — without scanning every description one by one?

In our last article on Embeddings, we learned that AI converts words into lists of numbers called vectors — digital fingerprints of meaning. Now we go one level further: once you have millions of those fingerprints, how do you find the one that matches your question?

That's Similarity Search — and it's the engine behind Netflix, TikTok, ChatGPT's memory, and every RAG application in production today.


The Problem with the Old Internet

Before 2015 or so, search meant one thing: find pages that contain the exact words you typed.

The Old Way

Keyword Search

Matches exact letters. Nothing more.

🔍 "device for my eyes"
❌ 0 results
"glasses" doesn't contain the letters d-e-v-i-c-e
The AI Way

Semantic Search

Converts query to math. Finds nearest meaning.

🔍 "device for my eyes"
↳ [0.88, -0.42, 0.91 ...]
✅ Smart Glasses
Vectors are 95% similar in meaning-space

The difference isn't better engineering. It's a fundamentally different question being asked:

  • Keyword search asks: Does this page contain these letters?
  • Semantic search asks: Does this page mean the same thing as my query?

How Similarity Search Works in 3 Steps

Every semantic search system — from e-commerce to RAG — follows the same three-step process:

1️⃣
Embed the Catalog
Convert every item description into a vector. Done once, stored forever.
2️⃣
Embed the Query
Convert the user's question into a vector at query time. Done every request.
3️⃣
Find Top-K Matches
Measure the mathematical distance between the query vector and every catalog vector. Return the closest K.
The catalog embeddings are computed offline once. Only the query embedding runs on every request — that's why it's fast.

The Three Ways to Measure Similarity

Now comes the math. Given two vectors, how do you measure how "close" they are? There are three standard methods.

Method 1: Cosine Similarity — The Flashlight

Think of every sentence as an arrow pointing in some direction in high-dimensional space. Two people both facing north are pointing the same direction — even if one is 2 meters tall and the other is a child. Cosine Similarity measures the angle between two arrows — if two sentences point in nearly the same direction, they mean nearly the same thing. The direction tells you the meaning — not the length.

Your Query RingConverse 0.95 ✅ Ray-Ban 0.30 ⚠️ Garmin -0.03 ❌ small angle
0.95 — Near identical direction
Same topic, same intent
0.30 — Different direction
Related but not a close match
-0.03 — Opposite direction
Completely unrelated topic
+1 = identical  |  0 = unrelated  |  -1 = opposite

Why Cosine for text? A short tweet and a long article about the same topic will point in the same direction even though their lengths are completely different. Cosine only cares about direction — which is exactly what we want for meaning.

Method 2: Euclidean Distance — The Ruler

Instead of measuring the angle, measure the straight-line distance between two points in space. If two items are close together geometrically, they're similar.

🔦

Cosine (The Flashlight)

"Are you pointing the same direction as me?"
Best for: text, documents, queries

📏

Euclidean (The Ruler)

"How far apart are you from me?"
Best for: images, coordinates, numbers

Method 3: Dot Product — The Shortcut

Dot product multiplies matching elements of two vectors and sums the results. It combines direction and magnitude in one operation.

Example:

A = [1, 2, 3]

B = [2, 1, 1]

(1×2) + (2×1) + (3×1) = 2 + 2 + 3

= 7

The catch: if vectors are very long (large magnitude), the dot product inflates even for loosely related pairs. This is why it works best with normalized vectors (scaled to length = 1, so magnitude no longer affects the score). When you normalize, dot product becomes mathematically equivalent to cosine similarity — but faster to compute at scale.

Method Measures Best for Speed Real App Example
Cosine Similarity Angle between arrows Text, documents ✅ Medium Google Search, RAG retrieval
Euclidean Distance Straight-line distance Images, coordinates Medium Face recognition, map routing
Dot Product Direction × magnitude Normalized vectors, production scale Fast ⚡ OpenAI Embeddings API, Pinecone

The Three Similarity Methods in Practice

Take this real query: "I want something lightweight for translation that is affordable"

Run it against 7 devices with our embedding model and the scores look like this:

Query: "I want something lightweight for translation that is affordable"
0.49
RingConverse Translate
✅ $199
0.39
Apple AirPods Pro 3
✅ $299
0.38
Sony LinkBuds Open 2
✅ $199
0.30
Ray-Ban Meta Ultra
⚠️ $549
0.27
Xreal Air 3 Ultra
⚠️ $449
0.11
Samsung Galaxy Ring v2
❌ $349
-0.03
Garmin Fenix 9 Solar
❌ $799

The AI ranked the cheapest translation devices first — without a single filter rule written.

Notice: The Garmin sports watch scores negative — the AI correctly identified it as pointing in the opposite direction from "lightweight translation." We didn't write any rules. The math did it.


Where Do You Store 30 Million Vectors?

A regular SQL database can't efficiently search millions of vectors. You need a vector database (a database purpose-built for storing and searching high-dimensional vectors) — designed to find the closest matches using ANN (Approximate Nearest Neighbor) algorithms.

Database Type Free? Best For
ChromaDB Local ✅ Yes Learning, prototypes — easiest to start
FAISS (Meta) Local ✅ Yes Massive datasets — billions of vectors
Qdrant Cloud / Local ✅ Yes High performance + advanced filtering
Weaviate Cloud / Local ✅ Yes RAG pipelines, flexible schema
Pinecone Cloud Free tier Production without server management
pgvector PostgreSQL ext. ✅ Yes If you already use PostgreSQL
Start with ChromaDB. It runs locally, requires zero infrastructure, and the API is almost identical to production-grade options. When you need to scale, migrating to Qdrant or Pinecone is straightforward.

The Complete Python Implementation

Here's the full working pipeline — from raw text to ranked results:


# pip install sentence-transformers scikit-learn numpy

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Step 1: Load the embedding model
# 50+ languages, 384-dimensional output, runs on CPU
encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Step 2: Define your catalog
devices = [
    {"name": "Ray-Ban Meta Ultra",     "desc": "Smart glasses — 48MP camera, 40-language translation", "price": 549},
    {"name": "Xreal Air 3 Ultra",      "desc": "AR glasses with 4K display and AR apps",               "price": 449},
    {"name": "Samsung Galaxy Ring v2", "desc": "Smart ring for health and sleep tracking",              "price": 349},
    {"name": "RingConverse Translate", "desc": "Translation ring — 30 languages, instant",             "price": 199},
    {"name": "Apple AirPods Pro 3",    "desc": "Smart earbuds with real-time translation",             "price": 299},
    {"name": "Sony LinkBuds Open 2",   "desc": "Open earbuds with lightweight translation",            "price": 199},
    {"name": "Garmin Fenix 9 Solar",   "desc": "Sports watch for adventure and running",               "price": 799},
]

# Step 3: Embed the catalog (OFFLINE — done once, stored)
device_texts = [f"{d['name']} - {d['desc']}" for d in devices]
device_embeddings = encoder.encode(device_texts)
print(f"Stored {len(devices)} devices × {device_embeddings[0].shape[0]} dimensions")
# → Stored 7 devices × 384 dimensions

# Step 4: Embed the user query (ONLINE — every request)
query = "I want something lightweight for translation that is affordable"
query_embedding = encoder.encode(query)

# Step 5: Score and rank
scores = cosine_similarity([query_embedding], device_embeddings)[0]
ranked = scores.argsort()[::-1]  # highest first

print(f'\nQuery: "{query}"\n')
print(f"{'#':<4} {'Score':>7}  {'Device':<28}  {'Price':>6}")
print("─" * 55)
for rank, i in enumerate(ranked, 1):
    flag = "✅" if scores[i] > 0.4 else "⚠️ " if scores[i] > 0.2 else "❌"
    print(f"  {rank}.  {scores[i]:>6.3f}  {flag}  {devices[i]['name']:<25}  ${devices[i]['price']}")

Output:

Stored 7 devices × 384 dimensions

Query: "I want something lightweight for translation that is affordable"

#     Score  Device                         Price
───────────────────────────────────────────────────────
  1.   0.486  ✅  RingConverse Translate        $199
  2.   0.394  ✅  Apple AirPods Pro 3           $299
  3.   0.382  ✅  Sony LinkBuds Open 2          $199
  4.   0.301  ⚠️  Ray-Ban Meta Ultra             $549
  5.   0.271  ⚠️  Xreal Air 3 Ultra             $449
  6.   0.108  ❌  Samsung Galaxy Ring v2        $349
  7.  -0.027  ❌  Garmin Fenix 9 Solar          $799

The AI found the two cheapest translation devices as top matches, without any price filter. It understood that "affordable" and "$199" are semantically linked.

Scaling to production: In real production with 30M+ items, swap the final cosine_similarity + argsort lines for a vector DB query — e.g., collection.query(query_embeddings=[query_embedding], n_results=10) in ChromaDB. The math is identical; the vector DB handles the speed.

How It Finds 30 Million Answers in Under 1 Second

You might be wondering: cosine similarity between your query and every stored vector sounds slow. For 7 devices it's instant. For 30 million? At 384 dimensions each, a brute-force scan would take seconds — too slow for a real app.

This is solved by ANN (Approximate Nearest Neighbor) algorithms. Instead of checking every vector, they build smart index structures that let the database skip most of the search space.

Approach How It Works Speed Accuracy
Brute Force Compare query against every single vector 🐢 Slow at scale 100% exact
HNSW (Hierarchical Navigable Small World) Builds a graph of nearby vectors; hops between clusters ⚡ Fast ~95–99%
IVF (Inverted File Index) Groups vectors into clusters; searches only relevant clusters ⚡ Fast ~90–98%
PQ (Product Quantization) Compresses vectors to reduce memory; approximates distance 🚀 Very fast ~85–95%
The trade-off: ANN algorithms return approximate nearest neighbors — not guaranteed exact matches. In practice, a 95% accurate result returned in 10ms beats a 100% accurate result returned in 3 seconds. All major vector databases (Pinecone, Qdrant, Weaviate, FAISS) use ANN under the hood.

Real-World Case Studies

Netflix — ~$1B in Reduced Churn Per Year

Netflix doesn't recommend movies by matching genre tags. It converts your watch history into a vector — a mathematical fingerprint of your taste. Then it finds movie vectors that are closest to that fingerprint.

The result: "Because you watched X, you'll like Y" — even when X and Y share zero keywords or genres. Netflix estimates this recommendation engine saves them approximately $1 billion annually in reduced churn — users who find content they love don't cancel.

TikTok's For You Page

Each video is embedded based on its transcript, audio patterns, and visual content. Each user has a preference vector updated in real-time. The algorithm is literally running cosine similarity billions of times per second to decide what appears next.

Spotify's Discover Weekly

Spotify embeds both songs (audio features + lyrics) and users (listening patterns). Your Monday morning playlist is the result of finding the 30 song vectors closest to your personal taste vector.


Why This Is the Heart of RAG

If you've heard of RAG (Retrieval-Augmented Generation) — the technique that lets AI answer questions about your private documents — Similarity Search is its core engine.

📄
Your PDFs
🧬
Embeddings
🗄️
Vector DB
📐
Similarity Search
🤖
LLM
Answer

Similarity Search powers the retrieval step — finding the relevant paragraphs before the LLM generates an answer.

When you ask an AI "What's our refund policy?" about a company's documentation:

  1. Your question is converted to a vector
  2. Similarity Search finds the most relevant paragraphs in the knowledge base
  3. Those paragraphs are handed to the LLM as context
  4. The LLM answers using only that retrieved information

Without Similarity Search, there is no RAG.


The Core Insight

Similarity Search doesn't find matching words.

It finds matching meaning.

The query and the result can share zero words — and still be a perfect match.

That's why searching "something for my eyes" returns smart glasses. Why TikTok shows you videos you didn't know you wanted. Why your company's AI chatbot finds the right policy clause even when you phrase the question in ten different ways.

The math doesn't care about letters. It only cares about where two things land in meaning-space.


PRO TIPS & COMMON MISTAKES

Always use the same embedding model for both catalog and queries. Mixing models (e.g., encoding catalog with model A, queries with model B) produces scores that are meaningless.
Normalize your vectors before storing them in production. Most vector DBs do this automatically, but if you're using raw cosine_similarity, it removes a silent source of bias.
Set a score threshold (e.g., only return results above 0.3). Without a threshold, every query returns K results even when none are relevant — leading to garbage answers in RAG.
⚠️ Don't mix domains. A model trained on English news will produce poor embeddings for legal Arabic text. Match your embedding model to your language and domain.
⚠️ Cosine scores are not probabilities. A score of 0.49 doesn't mean "49% match" — it's a geometric angle. Always calibrate thresholds empirically on your own data.

Try It Yourself

Run the Python code above on your local machine — it downloads the model automatically on first run. Then try these experiments:

  1. Cross-language search: Change the query to Arabic — "أريد شيئاً للترجمة وسعره معقول". The multilingual model should return nearly the same ranking as the English query.
  2. Synonym test: Try "inexpensive" vs "affordable" vs "cheap" as the query. Watch how stable the top-3 ranking stays — the model understands they mean the same thing.
  3. Opposite test: Try "I want the most expensive flagship device with no budget limit" — watch the Garmin Fenix jump from last place to first.

All three experiments reveal the same truth: the model understands meaning, not just words.


Next in AI Fundamentals

The Artificial Neuron

The embedding model that powered everything in this article is built from billions of tiny decision-making units. We'll open the black box and build one from scratch in Python.

AI Fundamentals
MH

Mohamed Hamed

20 years building production systems — the last several deep in AI integration, LLMs, and full-stack architecture. I write what I've actually built and broken. If this was useful, the next one goes to LinkedIn first.

Follow on LinkedIn →