You don't need to retrain anything. You don't need to spend $100M. You need RAG — and you can build a production-quality version in about 100 lines of Python.
What RAG Actually Is
RAG stands for Retrieval Augmented Generation. The name describes exactly what it does:
The simplest analogy: it's the difference between a closed-book exam (model answers from memory, may be wrong or outdated) and an open-book exam (model reads from your documents, answers are grounded in real data).
→ Recommends products that may no longer exist
→ Prices are outdated or fabricated
→ Your custom inventory: unknown
→ Finds matching products with current specs
→ Generates answer from real inventory data
→ Cites the source it's using
The Two Phases of RAG
Every RAG system runs in two distinct phases. Confusing them is the most common architectural mistake.
Phase A Deep Dive: Building the Knowledge Base
Step 1: Chunking — The Most Important Decision
Chunking determines the granularity of your retrieval. Too large = irrelevant context included. Too small = you lose surrounding context needed for understanding.
| Strategy | How It Works | Best For | Typical Size |
|---|---|---|---|
| Fixed Size | Split every N tokens | Long-form text, books, articles | 300-500 tokens |
| Paragraph-based | Split on paragraph breaks | Documentation, articles, blogs | 100-300 tokens |
| Semantic Chunking | Split when topic changes (embedding distance) | Mixed-topic documents | Variable |
| Row-based (CSV/DB) | One row = one chunk | Product catalogs, FAQs, tables ✅ | 1 record |
The Overlap Pattern — For fixed-size and paragraph chunking, always include 50-100 tokens of overlap between adjacent chunks. This prevents critical information from being split across chunks and lost during retrieval.
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
"""Split text into overlapping chunks by word count."""
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = ' '.join(words[i:i + chunk_size])
if chunk:
chunks.append(chunk)
return chunks
# Example output:
# Chunk 0: "words 0-499"
# Chunk 1: "words 450-949" ← 50-word overlap with chunk 0
# Chunk 2: "words 900-1399" ← 50-word overlap with chunk 1
Step 2: Embedding Models
An embedding model converts text into a vector — a list of numbers that encodes the semantic meaning of the text. Similar meanings produce similar vectors. This is what enables the AI to find "lightweight glasses for translation" when the database contains "Ray-Ban Meta Ultra: 48g, 40-language real-time translation."
| Model | Dimensions | Languages | Cost | Best For |
|---|---|---|---|---|
| multilingual-MiniLM-L12-v2 | 384 | 50+ languages | Free | Arabic/multilingual apps ✅ |
| all-MiniLM-L6-v2 | 384 | English only | Free | English-only prototyping |
| text-embedding-3-small | 1536 | Multilingual | $0.02/1M tokens | Production English apps |
| text-embedding-3-large | 3072 | Multilingual | $0.13/1M tokens | Highest accuracy needed |
Why multilingual-MiniLM for Arabic-language apps? If your users query in Arabic but your data is in English or mixed, an English-only model will fail — the query vector and document vectors will be in completely different semantic spaces. The multilingual model maps both to the same space.
Step 3: Vector Databases
A vector database stores your embeddings alongside the original data (called the "payload") and provides extremely fast similarity search — finding the top-K most similar vectors in milliseconds, even across millions of records.
Complete Implementation: The Gadget Guru
Let's build a complete RAG system — a product advisor that knows every AI wearable device released in 2026. This demonstrates every concept in a working, runnable system.
pip install pandas qdrant-client sentence-transformers openai python-dotenv
"""
The Gadget Guru — AI Wearable Advisor (Complete RAG Implementation)
Demonstrates: Chunking → Embedding → Vector Store → Retrieval → Generation
"""
import os
import pandas as pd
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
# ═══════════════════════════════════════════════════════════════
# INITIALIZATION
# ═══════════════════════════════════════════════════════════════
encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
qdrant = QdrantClient(":memory:") # Use QdrantClient("localhost", port=6333) for persistence
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
COLLECTION = "gadgets_2026"
EMBEDDING_DIM = encoder.get_sentence_embedding_dimension() # 384
# ═══════════════════════════════════════════════════════════════
# PHASE A: PREPARE (Run once — builds the knowledge base)
# ═══════════════════════════════════════════════════════════════
def load_and_index_data(data: list[dict]) -> None:
"""
Phase A complete:
1. Build text profile for each item (the "chunk")
2. Embed each profile into a vector
3. Store vectors + original data in Qdrant
"""
# Step 1: Chunking — for structured data, one row = one chunk
# We create a rich text profile that captures all searchable attributes
profiles = [
f"{g['name']} ({g['category']}). "
f"Specs: {g['specs']}. "
f"Pros: {g['pros']}. "
f"Cons: {g['cons']}. "
f"Price: ${g['price']}"
for g in data
]
# Step 2: Embedding — convert each profile to a vector
embeddings = encoder.encode(profiles)
print(f"Embedding shape: {embeddings.shape}")
# Output: (20, 384) = 20 products, 384-dimensional vectors
# Step 3: Store in Vector Database
qdrant.create_collection(
collection_name=COLLECTION,
vectors_config=models.VectorParams(
size=EMBEDDING_DIM,
distance=models.Distance.COSINE # Best for text similarity
)
)
points = [
models.PointStruct(
id=idx,
vector=embeddings[idx].tolist(),
payload=data[idx] # Store original data as payload
)
for idx in range(len(data))
]
qdrant.upload_points(collection_name=COLLECTION, points=points)
print(f"Indexed {len(points)} products across {len(set(g['category'] for g in data))} categories")
# ═══════════════════════════════════════════════════════════════
# PHASE B: QUERY (Runs for every user request)
# ═══════════════════════════════════════════════════════════════
def retrieve(query: str, top_k: int = 3) -> list[dict]:
"""
Phase B, Steps 1-2: Encode question → Similarity Search → Return top matches
"""
# Step 1: Encode the user's question using the SAME embedding model
query_vector = encoder.encode(query).tolist()
# Step 2: Find the most similar product profiles
results = qdrant.query_points(
collection_name=COLLECTION,
query=query_vector,
limit=top_k
)
# Return products with their similarity scores
return [
{**hit.payload, "similarity_score": round(hit.score, 3)}
for hit in results.points
]
def generate(query: str, context: list[dict]) -> str:
"""
Phase B, Step 3: Send retrieved context + question to LLM for generation
"""
# Format the retrieved products for the LLM
context_text = "\n".join([
f"- {g['name']} ({g['category']}): "
f"{g['specs']}. "
f"Pros: {g['pros']}. "
f"Cons: {g['cons']}. "
f"Price: ${g['price']}"
for g in context
])
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"""You are a technical advisor specializing in 2026 AI wearable devices.
Your rules:
- Recommend ONLY based on the product data provided below
- If no product matches the requirements, say so clearly
- Always mention price and one key limitation for every recommendation
- If multiple options fit, rank them from best to least suitable
- Be direct and specific — no fluff
Available products from the database:
{context_text}"""
},
{
"role": "user",
"content": query
}
],
temperature=0.1, # Low temperature = consistent, factual responses
max_tokens=400
)
return response.choices[0].message.content
def ask(query: str) -> str:
"""
The complete RAG pipeline: Retrieve → Augment → Generate
"""
print(f"\nQuery: {query}")
print("-" * 50)
# Retrieval
context = retrieve(query, top_k=3)
print(f"Retrieved {len(context)} products:")
for g in context:
print(f" [{g['similarity_score']}] {g['name']} — ${g['price']}")
# Generation (with retrieved context)
answer = generate(query, context)
print(f"\nAdvisor Response:\n{answer}")
return answer
# Sample dataset (in production, load from CSV/database)
gadgets_2026 = [
{
"name": "Ray-Ban Meta Ultra",
"category": "Smart Glasses",
"specs": "Weight: 48g, Camera: 48MP, Translation: 40 languages real-time, Battery: 4h active",
"pros": "Lightest smart glasses, best translation, stylish everyday design",
"cons": "No AR display, $549 is premium pricing",
"price": 549
},
{
"name": "Xreal Air 3 Ultra",
"category": "AR Glasses",
"specs": "Weight: 67g, Display: 4K equivalent AR, Battery: 3h active, Audio: spatial",
"pros": "Best AR display quality, ideal for productivity and gaming",
"cons": "Heavier than Ray-Ban, no translation feature",
"price": 449
},
{
"name": "Samsung Galaxy Ring v2",
"category": "Smart Ring",
"specs": "Weight: 2.8g, Battery: 10 days, Sensors: heart rate, SpO2, sleep, temperature",
"pros": "Invisible health tracking, longest battery life, waterproof",
"cons": "No display, limited notification support",
"price": 349
},
{
"name": "Oppo Air Glass 3",
"category": "AR Glasses",
"specs": "Weight: 38g, Translation: offline in 12 languages, Battery: 6h, Display: monochrome",
"pros": "Lightest AR glasses, offline translation, longest AR battery",
"cons": "Monochrome display only, no color AR",
"price": 399
},
]
# Run the complete pipeline
load_and_index_data(gadgets_2026)
# Test queries
ask("I want the lightest glasses for real-time translation under $500")
ask("Best option for AR gaming and productivity")
ask("Cheapest health tracking device")
Expected output:
Indexed 4 products across 3 categories
Query: I want the lightest glasses for real-time translation under $500
--------------------------------------------------
Retrieved 3 products:
[0.942] Ray-Ban Meta Ultra — $549
[0.871] Oppo Air Glass 3 — $399
[0.734] Xreal Air 3 Ultra — $449
Advisor Response:
For real-time translation under $500, the Oppo Air Glass 3 ($399) is your best bet:
- Lightest AR option at 38g
- Offline translation in 12 languages (no internet needed)
- 6-hour battery — longest in this category
Limitation: Monochrome display only, which may bother you if you want full-color AR.
If you primarily need style over AR and can stretch to $549, Ray-Ban Meta Ultra offers
translation in 40 languages — but it exceeds your $500 budget.
Why the Context Goes in System (Not User)
This is a critical RAG best practice that dramatically affects output quality:
{"role": "user",
"content": f"Data: {context}\n\nQ: {query}"}
]
{"role": "system",
"content": f"Use only this data:\n{context}"},
{"role": "user",
"content": query}
]
Advanced RAG Techniques
Once your basic RAG pipeline works, these techniques improve accuracy for production systems:
Hybrid Search: Combining Vector + Keyword
Pure semantic search misses exact matches. "Model XR-7" might not be found semantically if the query says "XR7 glasses." Hybrid search combines vector search (meaning) with keyword search (exact terms):
# Qdrant supports filtering (keyword-style) alongside vector search
results = qdrant.query_points(
collection_name=COLLECTION,
query=query_vector,
query_filter=models.Filter(
must=[
models.FieldCondition(
key="category",
match=models.MatchValue(value="AR Glasses")
),
models.FieldCondition(
key="price",
range=models.Range(lte=500) # Under $500
)
]
),
limit=5
)
Re-Ranking: Two-Stage Retrieval
Retrieve more candidates (top-20), then re-rank them with a more accurate cross-encoder model:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Stage 1: Fast vector search (top 20 candidates)
candidates = retrieve(query, top_k=20)
# Stage 2: Accurate re-ranking (pick top 3)
pairs = [(query, f"{g['name']}: {g['specs']} {g['pros']}") for g in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
top_3 = [item for item, score in ranked[:3]]
Query Expansion: Cover More Angles
Generate alternative phrasings of the user's question before searching:
def expand_query(query: str) -> list[str]:
"""Generate alternative versions of the query for broader retrieval."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"Generate 3 alternative phrasings of this search query, one per line:\n{query}"
}],
max_tokens=100
)
alternatives = response.choices[0].message.content.strip().split('\n')
return [query] + alternatives # Original + alternatives
# Search with all phrasings, deduplicate results
all_queries = expand_query("light glasses for translation")
# → ["light glasses for translation",
# "lightweight smart glasses with translation feature",
# "AR glasses that support language translation",
# "translation glasses under 50 grams"]
Evaluating RAG Quality
RAG is only as good as its retrieval. Here are the three metrics that matter:
# RAGAS — the standard library for RAG evaluation
pip install ragas
from ragas import evaluate
from ragas.metrics import faithfulness, context_relevancy, answer_relevancy
from datasets import Dataset
# Build evaluation dataset
eval_data = {
"question": ["Best lightweight translation glasses?"],
"answer": ["Oppo Air Glass 3 at 38g with 12-language offline translation..."],
"contexts": [["Ray-Ban Meta Ultra: 48g...", "Oppo Air Glass 3: 38g..."]],
"ground_truth": ["The Oppo Air Glass 3 is the lightest with offline translation."]
}
dataset = Dataset.from_dict(eval_data)
results = evaluate(dataset, metrics=[faithfulness, context_relevancy, answer_relevancy])
print(results)
# Output: {'faithfulness': 0.95, 'context_relevancy': 0.88, 'answer_relevancy': 0.92}
Real-World Case Studies
RAG Solution: Chunk each policy section → embed → store. Employees ask in natural language, get precise answers with the exact policy page cited.
Result: HR inquiry time reduced from 2 days to 3 minutes.
RAG Solution: Embed product descriptions → semantic search → LLM ranks and explains matches.
Result: 34% increase in "search to add-to-cart" conversion rate.
RAG Solution: Embed all support docs and past tickets → AI resolves 78% automatically.
Result: Support team focuses on complex cases only. 78% deflection rate.
RAG Solution: Embed case law database → semantic search finds relevant precedents → LLM summarizes with citations.
Result: Research time cut by 70%. All citations are real (solving the hallucination problem).
RAG vs Fine-Tuning: When to Use Which
| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Purpose | Inject knowledge | Change behavior/style |
| Data updates | Real-time ✅ | Requires re-training ❌ |
| Cost | Low — storage + API calls | High — GPU training cost |
| Hallucination risk | Low with citations ✅ | Medium (still hallucinates) |
| Best use case | Product Q&A, document search, support bots | Brand voice, domain jargon, specialized format |
The Power Combo: Use RAG for factual knowledge + Fine-tuning for tone and style. This is how the best enterprise AI systems are built in 2026.