Similarity Search: Finding Meaning in the Haystack
How does Netflix know what you'll like? How does a chatbot find the right policy in a 50,000-page library? It's not magic—it's high-dimensional geometry called Similarity Search.
In the last article we saw how embeddings turn text into vectors — coordinates in meaning-space. This article answers the next question: once everything is a vector, how do you find the right one among millions? The answer killed keyword search and powers TikTok, Netflix, and every RAG system.
Keyword search asks: 'Does this page contain these letters?' Semantic search asks: 'Does this page mean the same thing as my query?' The difference is intelligence.
Keyword vs. Semantic Search: The Conceptual Leap
The Search Evolution
- Exact: Matches specific letters and strings.
- Context: Zero.
- Result: Searching 'eye device' won't find 'spectacles'.
- Math: Matches vector 'fingerprints' in high-dimensional space.
- Context: High.
- Result: Finds 'spectacles' for 'eye device' because they're semantic neighbors.
The 3-Step Production Workflow
The Similarity Pipeline
Convert your documents into vectors with a model like text-embedding-3-small, then store them in a vector database. (Done once, offline.)
When a user asks a question, convert their query into a vector using the exact same model.
The database finds the "nearest neighbors" — the documents whose vectors are mathematically closest to the query vector.
The Core Math: How We Measure 'Meaning'
To a computer, meaning is just a vector. We use distance metrics to see how "close" two vectors are.
1. Cosine Similarity (the gold standard for text)
Measures the angle between two vectors — ignoring document length, focusing only on the direction of meaning.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
a = np.array([[0.1, 0.9, 0.3]]) # "I love cats"
b = np.array([[0.15, 0.85, 0.25]]) # "Felines are great"
print(f"Similarity: {cosine_similarity(a, b)[0][0]:.4f}") # ~0.992. Euclidean Distance (L2)
The straight-line distance between two points. Best for image search or when magnitude (how much of a concept is present) matters.
3. Dot Product
Fastest to calculate. If vectors are normalized (length 1), the dot product is mathematically identical to cosine similarity — which is why production systems normalize and use it.
Similarity Ranking in Practice
Run the query "I want something lightweight for translation that is affordable" against 7 real devices, and the cosine scores rank them automatically:
The AI ranked the cheapest translation devices first — without a single price filter. The Garmin sports watch even scores negative — the math correctly identified it as pointing in the opposite direction from "lightweight translation." We didn't write any rules; the geometry did it.
Where Do You Store 30 Million Vectors?
A regular SQL database can't efficiently search millions of vectors. You need a vector database built for high-dimensional nearest-neighbor search:
| Database | Type | Free? | Best For |
|---|---|---|---|
| ChromaDB | Local | ✅ | Learning, prototypes — easiest to start |
| FAISS (Meta) | Local | ✅ | Massive datasets — billions of vectors |
| Qdrant | Cloud / Local | ✅ | High performance + advanced filtering |
| Weaviate | Cloud / Local | ✅ | RAG pipelines, flexible schema |
| Pinecone | Cloud | Free tier | Production without server management |
| pgvector | PostgreSQL ext. | ✅ | If you already use PostgreSQL |
Start with ChromaDB — it runs locally, needs zero infrastructure, and its API is nearly identical to production options. Migrating to Qdrant or Pinecone later is straightforward.
The Complete Python Implementation
# pip install sentence-transformers scikit-learn numpy
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2') # 50+ langs, 384-dim
devices = [
{"name": "Ray-Ban Meta Ultra", "desc": "Smart glasses — 48MP camera, 40-language translation", "price": 549},
{"name": "RingConverse Translate", "desc": "Translation ring — 30 languages, instant", "price": 199},
{"name": "Apple AirPods Pro 3", "desc": "Smart earbuds with real-time translation", "price": 299},
{"name": "Sony LinkBuds Open 2", "desc": "Open earbuds with lightweight translation", "price": 199},
{"name": "Garmin Fenix 9 Solar", "desc": "Sports watch for adventure and running", "price": 799},
]
# Embed the catalog once (offline), then embed the query (online, every request)
catalog = encoder.encode([f"{d['name']} - {d['desc']}" for d in devices])
query = encoder.encode("I want something lightweight for translation that is affordable")
scores = cosine_similarity([query], catalog)[0]
for i in scores.argsort()[::-1]:
print(f"{scores[i]:>6.3f} {devices[i]['name']:<25} ${devices[i]['price']}")
# 0.486 RingConverse Translate $199 ← cheapest translation, found without a price filterScaling to production: with 30M+ items, swap the cosine_similarity + argsort lines for a vector-DB query like collection.query(query_embeddings=[query], n_results=10). The math is identical; the DB handles the speed.
How It Finds 30 Million Answers in Under a Second
Cosine similarity against every vector (brute force) is O(N) — fine for 7 devices, far too slow for 30 million. Production uses ANN (Approximate Nearest Neighbor) algorithms that skip most of the search space:
- The concept: a skip-list for geometry.
- Level 0: all vectors · Level 1: ~10% · Level 2: ~1%.
- Process: start at the top, jump to the nearest neighbor, drop down a level, repeat.
- Result: turns O(N) into O(log N).
| Approach | How It Works | Speed | Accuracy |
|---|---|---|---|
| Brute Force | Compare against every vector | 🐢 Slow at scale | 100% exact |
| HNSW | Graph of nearby vectors; hop between clusters | ⚡ Fast | ~95–99% |
| IVF | Group into clusters; search only relevant ones | ⚡ Fast | ~90–98% |
| PQ | Compress vectors; approximate distance | 🚀 Very fast | ~85–95% |
The trade-off: ANN returns approximate neighbors. In practice, a 95% result in 10ms beats a 100% result in 3 seconds. Every major vector DB (Pinecone, Qdrant, Weaviate, FAISS) uses ANN under the hood.
Real-World Case Studies
- Netflix — ~$1B/year in reduced churn. It converts your watch history into a taste vector and finds movie vectors nearest to it — "because you watched X, you'll like Y," even when X and Y share zero keywords.
- TikTok's For You page. Each video is embedded from transcript, audio, and visuals; your preference vector updates in real time. The algorithm runs cosine similarity billions of times per second.
- Spotify's Discover Weekly. Songs (audio + lyrics) and users (listening patterns) are embedded; your playlist is the 30 song vectors closest to your taste vector.
Why This Is the Heart of RAG
If you've heard of RAG (Retrieval-Augmented Generation) — the technique that lets AI answer questions about your private documents — Similarity Search is its core engine. Ask "What's our refund policy?" and: (1) your question becomes a vector, (2) similarity search finds the most relevant paragraphs, (3) those are handed to the LLM as context, (4) the LLM answers using only that retrieved information. Without similarity search, there is no RAG.
The Core Insight
That's why searching "something for my eyes" returns smart glasses, why TikTok shows you videos you didn't know you wanted, and why a company chatbot finds the right clause however you phrase the question. The math only cares about where two things land in meaning-space.
Try It Yourself
Run the code above (it downloads the model on first run), then experiment:
- Cross-language search: change the query to Arabic —
"أريد شيئاً للترجمة وسعره معقول". The multilingual model returns nearly the same ranking. - Synonym test: try
"inexpensive"vs"affordable"vs"cheap". The top-3 stays stable — the model knows they mean the same thing. - Opposite test: try
"the most expensive flagship device with no budget limit"— watch the Garmin Fenix jump from last to first.
Key Takeaways
Similarity search doesn't find matching words. It finds matching meaning. The query and result can share zero words and still be a 100% match.
Always use the SAME embedding model for the catalog and the query. If you embed your data with OpenAI and your query with Google, the math will fail.
In production, 95% accuracy in 10ms (via HNSW) is significantly better than 100% accuracy in 3 seconds. Scalability is a feature.
Up Next in the Series
You've seen what embeddings do and how to search them. But what actually computed those vectors? Next we isolate a single artificial neuron — the tiny decision machine that, scaled a trillion times, becomes ChatGPT. Continue the series →