Skip to main content
AI-Developer/AI Fundamentals
Part 13 of 14

Part 14 — RAG: Give Your AI a Memory

ChatGPT confidently recommended a smart glasses model discontinued two years ago. The fix doesn't require retraining anything. RAG — Retrieval Augmented Generation — lets any LLM answer from YOUR data, in real time, without a single GPU.

March 12, 2026
15 min read
#RAG#Retrieval Augmented Generation#Vector Database#Embeddings#LLM#Qdrant#ChromaDB#AI Applications

RAG: Give Your AI a Memory

When an AI hallucinates, it's often not a lack of intelligence, but a lack of context. Retrieval Augmented Generation (RAG) turns 'closed-book' models into 'open-book' experts that can read your private data in real-time.

Primary Objective
Retrieval | Vector Search | Context Augmentation | Hallucination Reduction
💡
The Knowledge Problem

AI training data has a cutoff date. RAG solves this by letting the model search your 2026 catalog or internal docs before it speaks.


What RAG Actually Is: The R-A-G Framework

The Three Pillars

🔍RETRIEVAL (R)

Find relevant chunks from your database before generating. The 'Search' phase.

AUGMENTED (A)

Inject those chunks into the context window as ground truth. The 'Context' phase.

🤖GENERATION (G)

The LLM writes the response based only on the facts provided. The 'Answer' phase.

RAG vs. Fine-Tuning: When to Use What

A common mistake is thinking you must fine-tune a model to teach it about your company. You almost never do.

Implementation Choice

🧠RAG (Retrieval)
  • Use for: New knowledge, inventory, real-time facts.
  • Updates: Instant (just add a vector). Cost: $0–$10/mo. Trust: High (citable).
🎭FINE-TUNING
  • Use for: Tone, style, specialized format, niche terminology.
  • Updates: Requires retraining ($$$). Cost: hundreds–thousands. Trust: Low (can still hallucinate).

The Production Pipeline

The RAG Lifecycle
  • Phase A (Prepare): Chunk documents → Embed to vectors → Store in Vector DB. (Run once.)
  • Phase B (Query): Encode user question → Similarity search → Context Augmentation → Generate. (Every request.)

Step 1: Chunking — The Most Important Decision

Chunking sets the granularity of retrieval. Too large = irrelevant context bloats the prompt. Too small = you lose the surrounding context needed to understand a fact.

StrategyHow It WorksBest ForTypical Size
Fixed SizeSplit every N tokensLong-form text, books300–500 tokens
Paragraph-basedSplit on paragraph breaksDocs, articles, blogs100–300 tokens
Semantic ChunkingSplit when the topic shifts (embedding distance)Mixed-topic documentsVariable
Row-based (CSV/DB)One row = one chunkProduct catalogs, FAQs, tables ✅1 record

Always add 50–100 tokens of overlap between fixed/paragraph chunks so a fact isn't split across a boundary and lost:

python
12345
def chunk_text(text, chunk_size=500, overlap=50):
    words = text.split()
    return [' '.join(words[i:i+chunk_size])
            for i in range(0, len(words), chunk_size - overlap)]
# Chunk 1 starts 50 words before Chunk 0 ends — no fact falls through the cracks.

Step 2: Pick the Right Embedding Model

ModelDimsLanguagesCostBest For
multilingual-MiniLM-L12-v238450+FreeArabic/multilingual apps ✅
all-MiniLM-L6-v2384EnglishFreeEnglish-only prototyping
text-embedding-3-small1536Multilingual$0.02/1MProduction English apps
text-embedding-3-large3072Multilingual$0.13/1MHighest accuracy needed

If users query in Arabic but your data is English, an English-only model fails — query and document vectors land in different spaces. A multilingual model maps both to the same space. (And remember: use the same model for indexing and querying.)

Step 3: Store in a Vector Database

A vector DB stores embeddings alongside the original data (the "payload") and finds the top-K most similar vectors in milliseconds. Qdrant (fast, open-source, local or server), ChromaDB (simplest to start), and Pinecone (managed cloud) are the common choices.


Complete Implementation: The Gadget Guru

A product advisor that knows every 2026 AI wearable — every concept in one runnable system:

python
1234567891011121314151617181920212223242526272829303132333435363738394041
# pip install qdrant-client sentence-transformers openai
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer
from openai import OpenAI

encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
qdrant  = QdrantClient(":memory:")          # in-memory for demo
client  = OpenAI()
COLLECTION, DIM = "gadgets_2026", encoder.get_sentence_embedding_dimension()  # 384

# ── PHASE A: PREPARE (run once) ──────────────────────────────
def index_data(data):
    profiles = [f"{g['name']} ({g['category']}). Specs: {g['specs']}. "
                f"Pros: {g['pros']}. Cons: {g['cons']}. Price: ${g['price']}" for g in data]
    embeddings = encoder.encode(profiles)
    qdrant.create_collection(COLLECTION,
        vectors_config=models.VectorParams(size=DIM, distance=models.Distance.COSINE))
    qdrant.upload_points(COLLECTION, [
        models.PointStruct(id=i, vector=embeddings[i].tolist(), payload=data[i])
        for i in range(len(data))])

# ── PHASE B: QUERY (every request) ───────────────────────────
def retrieve(query, top_k=3):
    hits = qdrant.query_points(COLLECTION, query=encoder.encode(query).tolist(), limit=top_k)
    return [{**h.payload, "score": round(h.score, 3)} for h in hits.points]

def generate(query, context):
    ctx = "\n".join(f"- {g['name']} ({g['category']}): {g['specs']}. "
                    f"Pros: {g['pros']}. Cons: {g['cons']}. Price: ${g['price']}" for g in context)
    return client.chat.completions.create(
        model="gpt-4o-mini", temperature=0.1, max_tokens=400,
        messages=[
            {"role": "system", "content":
                "You are a 2026 AI-wearable advisor. Recommend ONLY from the products below. "
                "If none match, say so. Always mention price and one limitation. Rank multiple fits.\n\n"
                f"Available products:\n{ctx}"},
            {"role": "user", "content": query},
        ]).choices[0].message.content

def ask(query):                              # Retrieve → Augment → Generate
    return generate(query, retrieve(query, top_k=3))

The system finds the lightest translation glasses for "lightest glasses for real-time translation under $500" — matching on meaning, ranked, grounded in your data, citable.


Prompt Engineering for RAG

The most critical piece is the system prompt — strictly instruct the model to use only the provided context:

markdown
12345678910
You are a helpful assistant. Answer the user's question using ONLY the
provided context. If the answer is not in the context, say
"I don't have enough information to answer that." Do not use your
internal knowledge to fill in gaps.

<CONTEXT>
{{ retrieved_chunks }}
</CONTEXT>

QUESTION: {{ user_query }}
🚫
Production Tip: The 'I Don't Know' Clause

Forcing the AI to say "I don't know" when context is missing is the #1 way to build trust. Never let a RAG bot guess. Put the context in the system prompt (not the user turn) so it reads as authoritative ground truth.


Advanced RAG Techniques

Once the basic pipeline works, three techniques sharpen production accuracy:

  • Hybrid search — pure semantic search misses exact tokens ("XR-7" vs "XR7"). Combine vector search with keyword/metadata filters (category = "AR Glasses", price <= 500) in the same query.
  • Re-ranking (two-stage retrieval) — retrieve top-20 cheaply with vector search, then re-score with a slower, more accurate cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2) and keep the top 3.
  • Query expansion — generate 3 alternative phrasings of the question, search with all of them, and dedupe — covering angles a single phrasing would miss.
python
123456
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
candidates = retrieve(query, top_k=20)                       # stage 1: fast & broad
pairs = [(query, f"{g['name']}: {g['specs']} {g['pros']}") for g in candidates]
top_3 = [c for c, _ in sorted(zip(candidates, reranker.predict(pairs)),
                              key=lambda x: x[1], reverse=True)[:3]]  # stage 2: accurate

Evaluating RAG Quality

RAG is only as good as its retrieval. The standard library RAGAS measures the three metrics that matter:

python
123456
# pip install ragas
from ragas import evaluate
from ragas.metrics import faithfulness, context_relevancy, answer_relevancy

results = evaluate(dataset, metrics=[faithfulness, context_relevancy, answer_relevancy])
# → {'faithfulness': 0.95, 'context_relevancy': 0.88, 'answer_relevancy': 0.92}

Faithfulness: is the answer grounded in the retrieved context (no invention)? Context relevancy: did retrieval fetch the right chunks? Answer relevancy: does the answer actually address the question? Low faithfulness → strengthen the prompt; low context relevancy → fix chunking/embeddings.


Troubleshooting Common RAG Failures

Why RAG Fails

RETRIEVAL FAILURE

The right chunk was never fetched. Fix: improve chunking (add overlap) or use a stronger embedding model (text-embedding-3-large).

AUGMENTATION FAILURE

The context was correct but too long and the fact got buried. Fix: fewer chunks + re-order for the "Lost in the Middle" effect.

GENERATION FAILURE

The LLM ignored the context. Fix: strengthen the system prompt and use a more capable model (GPT-4o / Claude 3.5 Sonnet).

Real-world RAG is everywhere now: customer-support bots grounded in help docs, internal "ask your wiki" assistants, legal/medical research tools that cite sources, and e-commerce advisors like the Gadget Guru above.


Key Takeaways

01
01
Knowledge is External

In 2026, we don't 'teach' models facts; we give them tools to find facts. RAG is the bridge between frozen model intelligence and live data reality.

02
02
Chunking is Your Database Schema

Poor chunking = poor retrieval. Invest in semantic chunking and 100-token overlaps for complex documents.

03
03
Grounding is a Choice

By putting context in the System Prompt and requiring citations, you transform an AI from a 'creative writer' into a 'precise librarian.'

MH

Mohamed Hamed

20 years building production systems — the last several deep in AI integration, LLMs, and full-stack architecture. I write what I've actually built and broken. If this was useful, the next one goes to LinkedIn first.

Follow on LinkedIn →