Part 14 — RAG: Give Your AI a Memory | Mohamed Hamed

THE REAL PROBLEM

You asked an AI assistant for the best AR glasses in 2026. It confidently recommended a product discontinued in 2024.

This isn't a hallucination problem. It's a knowledge problem. The model's training data has a cutoff. It doesn't know what happened after that date. And it has absolutely no access to your company's private data, current inventory, or internal documentation.

You don't need to retrain anything. You don't need to spend $100M. You need RAG — and you can build a production-quality version in about 100 lines of Python.

What RAG Actually Is

RAG stands for Retrieval Augmented Generation. The name describes exactly what it does:

Retrieval

Find the relevant information from your knowledge base before generating an answer

Augmented

Add that retrieved information to the context window before sending to the LLM

Generation

The LLM generates its answer based on the retrieved facts, not its training memory

The simplest analogy: it's the difference between a closed-book exam (model answers from memory, may be wrong or outdated) and an open-book exam (model reads from your documents, answers are grounded in real data).

Without RAG — Closed Book

"What's the best AR glasses under $400?"

→ AI recalls training data from 2024 cutoff
→ Recommends products that may no longer exist
→ Prices are outdated or fabricated
→ Your custom inventory: unknown

With RAG — Open Book

"What's the best AR glasses under $400?"

→ Searches your 2026 product catalog
→ Finds matching products with current specs
→ Generates answer from real inventory data
→ Cites the source it's using

The Two Phases of RAG

Every RAG system runs in two distinct phases. Confusing them is the most common architectural mistake.

PHASE A: Prepare (Run Once)

Step 1: CHUNKING ✂️

Break your documents into appropriately-sized pieces

Step 2: EMBEDDING 📐

Convert each chunk to a vector (list of numbers)

Step 3: STORE 🗄️

Save vectors + original data in a Vector Database

PHASE B: Query (Every Request)

Step 1: ENCODE QUESTION 🔍

Convert user's question to a vector using the same model

Step 2: SIMILARITY SEARCH 🔎

Find the most similar chunks in the database

Step 3: GENERATE 🤖

LLM answers using retrieved chunks as context

Phase A runs once when you set up the system. Phase B runs for every user query — typically in under 500ms.

Phase A Deep Dive: Building the Knowledge Base

Step 1: Chunking — The Most Important Decision

Chunking determines the granularity of your retrieval. Too large = irrelevant context included. Too small = you lose surrounding context needed for understanding.

Strategy	How It Works	Best For	Typical Size
Fixed Size	Split every N tokens	Long-form text, books, articles	300-500 tokens
Paragraph-based	Split on paragraph breaks	Documentation, articles, blogs	100-300 tokens
Semantic Chunking	Split when topic changes (embedding distance)	Mixed-topic documents	Variable
Row-based (CSV/DB)	One row = one chunk	Product catalogs, FAQs, tables ✅	1 record

The Overlap Pattern — For fixed-size and paragraph chunking, always include 50-100 tokens of overlap between adjacent chunks. This prevents critical information from being split across chunks and lost during retrieval.

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks by word count."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk:
            chunks.append(chunk)
    return chunks

# Example output:
# Chunk 0: "words 0-499"
# Chunk 1: "words 450-949"  ← 50-word overlap with chunk 0
# Chunk 2: "words 900-1399" ← 50-word overlap with chunk 1

Step 2: Embedding Models

An embedding model converts text into a vector — a list of numbers that encodes the semantic meaning of the text. Similar meanings produce similar vectors. This is what enables the AI to find "lightweight glasses for translation" when the database contains "Ray-Ban Meta Ultra: 48g, 40-language real-time translation."

Embedding Model Comparison (2026)

Model	Dimensions	Languages	Cost	Best For
multilingual-MiniLM-L12-v2	384	50+ languages	Free	Arabic/multilingual apps ✅
all-MiniLM-L6-v2	384	English only	Free	English-only prototyping
text-embedding-3-small	1536	Multilingual	$0.02/1M tokens	Production English apps
text-embedding-3-large	3072	Multilingual	$0.13/1M tokens	Highest accuracy needed

Why multilingual-MiniLM for Arabic-language apps? If your users query in Arabic but your data is in English or mixed, an English-only model will fail — the query vector and document vectors will be in completely different semantic spaces. The multilingual model maps both to the same space.

Step 3: Vector Databases

A vector database stores your embeddings alongside the original data (called the "payload") and provides extremely fast similarity search — finding the top-K most similar vectors in milliseconds, even across millions of records.

Qdrant

Runs locally in memory or as a persistent server. Best for learning and self-hosted production.

Free · Open source · Fast

ChromaDB

Simplest setup. Great for prototyping and local development. Built-in embedding support.

Free · Easiest to start

Pinecone

Fully managed cloud vector database. Enterprise-grade reliability and scale.

Paid · Best for production scale

Complete Implementation: The Gadget Guru

Let's build a complete RAG system — a product advisor that knows every AI wearable device released in 2026. This demonstrates every concept in a working, runnable system.

pip install pandas qdrant-client sentence-transformers openai python-dotenv

"""
The Gadget Guru — AI Wearable Advisor (Complete RAG Implementation)
Demonstrates: Chunking → Embedding → Vector Store → Retrieval → Generation
"""

import os
import pandas as pd
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

# ═══════════════════════════════════════════════════════════════
# INITIALIZATION
# ═══════════════════════════════════════════════════════════════

encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
qdrant = QdrantClient(":memory:")   # Use QdrantClient("localhost", port=6333) for persistence
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

COLLECTION = "gadgets_2026"
EMBEDDING_DIM = encoder.get_sentence_embedding_dimension()  # 384

# ═══════════════════════════════════════════════════════════════
# PHASE A: PREPARE (Run once — builds the knowledge base)
# ═══════════════════════════════════════════════════════════════

def load_and_index_data(data: list[dict]) -> None:
    """
    Phase A complete:
    1. Build text profile for each item (the "chunk")
    2. Embed each profile into a vector
    3. Store vectors + original data in Qdrant
    """

    # Step 1: Chunking — for structured data, one row = one chunk
    # We create a rich text profile that captures all searchable attributes
    profiles = [
        f"{g['name']} ({g['category']}). "
        f"Specs: {g['specs']}. "
        f"Pros: {g['pros']}. "
        f"Cons: {g['cons']}. "
        f"Price: ${g['price']}"
        for g in data
    ]

    # Step 2: Embedding — convert each profile to a vector
    embeddings = encoder.encode(profiles)
    print(f"Embedding shape: {embeddings.shape}")
    # Output: (20, 384) = 20 products, 384-dimensional vectors

    # Step 3: Store in Vector Database
    qdrant.create_collection(
        collection_name=COLLECTION,
        vectors_config=models.VectorParams(
            size=EMBEDDING_DIM,
            distance=models.Distance.COSINE  # Best for text similarity
        )
    )

    points = [
        models.PointStruct(
            id=idx,
            vector=embeddings[idx].tolist(),
            payload=data[idx]  # Store original data as payload
        )
        for idx in range(len(data))
    ]

    qdrant.upload_points(collection_name=COLLECTION, points=points)
    print(f"Indexed {len(points)} products across {len(set(g['category'] for g in data))} categories")

# ═══════════════════════════════════════════════════════════════
# PHASE B: QUERY (Runs for every user request)
# ═══════════════════════════════════════════════════════════════

def retrieve(query: str, top_k: int = 3) -> list[dict]:
    """
    Phase B, Steps 1-2: Encode question → Similarity Search → Return top matches
    """
    # Step 1: Encode the user's question using the SAME embedding model
    query_vector = encoder.encode(query).tolist()

    # Step 2: Find the most similar product profiles
    results = qdrant.query_points(
        collection_name=COLLECTION,
        query=query_vector,
        limit=top_k
    )

    # Return products with their similarity scores
    return [
        {**hit.payload, "similarity_score": round(hit.score, 3)}
        for hit in results.points
    ]

def generate(query: str, context: list[dict]) -> str:
    """
    Phase B, Step 3: Send retrieved context + question to LLM for generation
    """
    # Format the retrieved products for the LLM
    context_text = "\n".join([
        f"- {g['name']} ({g['category']}): "
        f"{g['specs']}. "
        f"Pros: {g['pros']}. "
        f"Cons: {g['cons']}. "
        f"Price: ${g['price']}"
        for g in context
    ])

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"""You are a technical advisor specializing in 2026 AI wearable devices.

Your rules:
- Recommend ONLY based on the product data provided below
- If no product matches the requirements, say so clearly
- Always mention price and one key limitation for every recommendation
- If multiple options fit, rank them from best to least suitable
- Be direct and specific — no fluff

Available products from the database:
{context_text}"""
            },
            {
                "role": "user",
                "content": query
            }
        ],
        temperature=0.1,    # Low temperature = consistent, factual responses
        max_tokens=400
    )

    return response.choices[0].message.content

def ask(query: str) -> str:
    """
    The complete RAG pipeline: Retrieve → Augment → Generate
    """
    print(f"\nQuery: {query}")
    print("-" * 50)

    # Retrieval
    context = retrieve(query, top_k=3)
    print(f"Retrieved {len(context)} products:")
    for g in context:
        print(f"  [{g['similarity_score']}] {g['name']} — ${g['price']}")

    # Generation (with retrieved context)
    answer = generate(query, context)
    print(f"\nAdvisor Response:\n{answer}")

    return answer

# Sample dataset (in production, load from CSV/database)
gadgets_2026 = [
    {
        "name": "Ray-Ban Meta Ultra",
        "category": "Smart Glasses",
        "specs": "Weight: 48g, Camera: 48MP, Translation: 40 languages real-time, Battery: 4h active",
        "pros": "Lightest smart glasses, best translation, stylish everyday design",
        "cons": "No AR display, $549 is premium pricing",
        "price": 549
    },
    {
        "name": "Xreal Air 3 Ultra",
        "category": "AR Glasses",
        "specs": "Weight: 67g, Display: 4K equivalent AR, Battery: 3h active, Audio: spatial",
        "pros": "Best AR display quality, ideal for productivity and gaming",
        "cons": "Heavier than Ray-Ban, no translation feature",
        "price": 449
    },
    {
        "name": "Samsung Galaxy Ring v2",
        "category": "Smart Ring",
        "specs": "Weight: 2.8g, Battery: 10 days, Sensors: heart rate, SpO2, sleep, temperature",
        "pros": "Invisible health tracking, longest battery life, waterproof",
        "cons": "No display, limited notification support",
        "price": 349
    },
    {
        "name": "Oppo Air Glass 3",
        "category": "AR Glasses",
        "specs": "Weight: 38g, Translation: offline in 12 languages, Battery: 6h, Display: monochrome",
        "pros": "Lightest AR glasses, offline translation, longest AR battery",
        "cons": "Monochrome display only, no color AR",
        "price": 399
    },
]

# Run the complete pipeline
load_and_index_data(gadgets_2026)

# Test queries
ask("I want the lightest glasses for real-time translation under $500")
ask("Best option for AR gaming and productivity")
ask("Cheapest health tracking device")

Expected output:

Indexed 4 products across 3 categories

Query: I want the lightest glasses for real-time translation under $500
--------------------------------------------------
Retrieved 3 products:
  [0.942] Ray-Ban Meta Ultra — $549
  [0.871] Oppo Air Glass 3 — $399
  [0.734] Xreal Air 3 Ultra — $449

Advisor Response:
For real-time translation under $500, the Oppo Air Glass 3 ($399) is your best bet:
- Lightest AR option at 38g
- Offline translation in 12 languages (no internet needed)
- 6-hour battery — longest in this category

Limitation: Monochrome display only, which may bother you if you want full-color AR.

If you primarily need style over AR and can stretch to $549, Ray-Ban Meta Ultra offers
translation in 40 languages — but it exceeds your $500 budget.

Why the Context Goes in System (Not User)

This is a critical RAG best practice that dramatically affects output quality:

Production Best Practice: Context Placement

❌ Naive approach — context in user message

messages=[
  {"role": "user",
   "content": f"Data: {context}\n\nQ: {query}"}
]

LLM treats data as "user opinion" — less authoritative

✅ Production approach — context in system

messages=[
  {"role": "system",
   "content": f"Use only this data:\n{context}"},
  {"role": "user",
   "content": query}
]

LLM treats data as "ground truth" — significantly fewer hallucinations

Advanced RAG Techniques

Once your basic RAG pipeline works, these techniques improve accuracy for production systems:

Hybrid Search: Combining Vector + Keyword

Pure semantic search misses exact matches. "Model XR-7" might not be found semantically if the query says "XR7 glasses." Hybrid search combines vector search (meaning) with keyword search (exact terms):

# Qdrant supports filtering (keyword-style) alongside vector search
results = qdrant.query_points(
    collection_name=COLLECTION,
    query=query_vector,
    query_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="category",
                match=models.MatchValue(value="AR Glasses")
            ),
            models.FieldCondition(
                key="price",
                range=models.Range(lte=500)  # Under $500
            )
        ]
    ),
    limit=5
)

Re-Ranking: Two-Stage Retrieval

Retrieve more candidates (top-20), then re-rank them with a more accurate cross-encoder model:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# Stage 1: Fast vector search (top 20 candidates)
candidates = retrieve(query, top_k=20)

# Stage 2: Accurate re-ranking (pick top 3)
pairs = [(query, f"{g['name']}: {g['specs']} {g['pros']}") for g in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
top_3 = [item for item, score in ranked[:3]]

Query Expansion: Cover More Angles

Generate alternative phrasings of the user's question before searching:

def expand_query(query: str) -> list[str]:
    """Generate alternative versions of the query for broader retrieval."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"Generate 3 alternative phrasings of this search query, one per line:\n{query}"
        }],
        max_tokens=100
    )
    alternatives = response.choices[0].message.content.strip().split('\n')
    return [query] + alternatives  # Original + alternatives

# Search with all phrasings, deduplicate results
all_queries = expand_query("light glasses for translation")
# → ["light glasses for translation",
#    "lightweight smart glasses with translation feature",
#    "AR glasses that support language translation",
#    "translation glasses under 50 grams"]

Evaluating RAG Quality

RAG is only as good as its retrieval. Here are the three metrics that matter:

🎯

Faithfulness

Does the answer actually come from the retrieved context? Or is the LLM still hallucinating?

Goal: >90%

🔍

Context Relevance

Are the retrieved chunks actually relevant to the question? Or is retrieval pulling noise?

Goal: >85%

✅

Answer Relevance

Does the final answer actually address what the user asked? Good context, bad answer?

Goal: >90%

# RAGAS — the standard library for RAG evaluation
pip install ragas

from ragas import evaluate
from ragas.metrics import faithfulness, context_relevancy, answer_relevancy
from datasets import Dataset

# Build evaluation dataset
eval_data = {
    "question": ["Best lightweight translation glasses?"],
    "answer": ["Oppo Air Glass 3 at 38g with 12-language offline translation..."],
    "contexts": [["Ray-Ban Meta Ultra: 48g...", "Oppo Air Glass 3: 38g..."]],
    "ground_truth": ["The Oppo Air Glass 3 is the lightest with offline translation."]
}

dataset = Dataset.from_dict(eval_data)
results = evaluate(dataset, metrics=[faithfulness, context_relevancy, answer_relevancy])

print(results)
# Output: {'faithfulness': 0.95, 'context_relevancy': 0.88, 'answer_relevancy': 0.92}

Real-World Case Studies

📄 Enterprise Document Q&A

Challenge: 50,000-page policy library that employees need to navigate daily.

RAG Solution: Chunk each policy section → embed → store. Employees ask in natural language, get precise answers with the exact policy page cited.

Result: HR inquiry time reduced from 2 days to 3 minutes.

🛒 E-Commerce Product Discovery

Challenge: 500,000+ SKUs. Keyword search fails for "lightweight eco-friendly running shoes for wide feet."

RAG Solution: Embed product descriptions → semantic search → LLM ranks and explains matches.

Result: 34% increase in "search to add-to-cart" conversion rate.

🎧 Customer Support Automation

Challenge: Support team answering 2,000 repetitive tickets/day about product specs, return policies, compatibility.

RAG Solution: Embed all support docs and past tickets → AI resolves 78% automatically.

Result: Support team focuses on complex cases only. 78% deflection rate.

⚖️ Legal Research Assistant

Challenge: Lawyers spending 60% of time searching case law across millions of documents.

RAG Solution: Embed case law database → semantic search finds relevant precedents → LLM summarizes with citations.

Result: Research time cut by 70%. All citations are real (solving the hallucination problem).

RAG vs Fine-Tuning: When to Use Which

Dimension	RAG	Fine-Tuning
Purpose	Inject knowledge	Change behavior/style
Data updates	Real-time ✅	Requires re-training ❌
Cost	Low — storage + API calls	High — GPU training cost
Hallucination risk	Low with citations ✅	Medium (still hallucinates)
Best use case	Product Q&A, document search, support bots	Brand voice, domain jargon, specialized format

The Power Combo: Use RAG for factual knowledge + Fine-tuning for tone and style. This is how the best enterprise AI systems are built in 2026.

Key Takeaways

①

RAG solves the knowledge problem, not the capability problem. If your AI is hallucinating about your data, RAG fixes it. If your AI writes in the wrong tone, you need fine-tuning.

②

Chunking quality determines RAG quality. Poor chunking means poor retrieval means poor answers — no matter how good your LLM is. Invest time here.

③

Use multilingual embedding models for non-English apps. Mismatched embedding spaces between query and documents is a silent killer that will give you terrible retrieval results.

④

Put retrieved context in the system prompt. The LLM treats system-level content as authoritative ground truth. This single change significantly reduces hallucination rates.

⑤

RAG is the most important skill in applied AI right now. You don't need $100M to give an LLM your company's knowledge. You need Qdrant, a sentence transformer, and about 100 lines of Python.

What's Next in the Series

NEXT IN SERIES

Fine-Tuning: Teaching AI Your Brand's Voice

RAG gives your AI factual knowledge. Fine-tuning gives it your voice, your style, your domain-specific vocabulary. With LoRA, you can customize a massive model on a laptop — no GPU cluster needed. Learn when to use fine-tuning vs RAG, and how to combine both for maximum impact.

✦ RAG vs Fine-Tuning: the clear decision framework

✦ How LoRA works (the small adapter that changes everything)

✦ Complete fine-tuning pipeline with code

✦ The ultimate hybrid: RAG + Fine-Tuning