RAG: Give Your AI a Memory
When an AI hallucinates, it's often not a lack of intelligence, but a lack of context. Retrieval Augmented Generation (RAG) turns 'closed-book' models into 'open-book' experts that can read your private data in real-time.
AI training data has a cutoff date. RAG solves this by letting the model search your 2026 catalog or internal docs before it speaks.
What RAG Actually Is: The R-A-G Framework
The Three Pillars
Find relevant chunks from your database before generating. The 'Search' phase.
Inject those chunks into the context window as ground truth. The 'Context' phase.
The LLM writes the response based only on the facts provided. The 'Answer' phase.
RAG vs. Fine-Tuning: When to Use What
A common mistake is thinking you must fine-tune a model to teach it about your company. You almost never do.
Implementation Choice
- Use for: New knowledge, inventory, real-time facts.
- Updates: Instant (just add a vector). Cost: $0–$10/mo. Trust: High (citable).
- Use for: Tone, style, specialized format, niche terminology.
- Updates: Requires retraining ($$$). Cost: hundreds–thousands. Trust: Low (can still hallucinate).
The Production Pipeline
- Phase A (Prepare): Chunk documents → Embed to vectors → Store in Vector DB. (Run once.)
- Phase B (Query): Encode user question → Similarity search → Context Augmentation → Generate. (Every request.)
Step 1: Chunking — The Most Important Decision
Chunking sets the granularity of retrieval. Too large = irrelevant context bloats the prompt. Too small = you lose the surrounding context needed to understand a fact.
| Strategy | How It Works | Best For | Typical Size |
|---|---|---|---|
| Fixed Size | Split every N tokens | Long-form text, books | 300–500 tokens |
| Paragraph-based | Split on paragraph breaks | Docs, articles, blogs | 100–300 tokens |
| Semantic Chunking | Split when the topic shifts (embedding distance) | Mixed-topic documents | Variable |
| Row-based (CSV/DB) | One row = one chunk | Product catalogs, FAQs, tables ✅ | 1 record |
Always add 50–100 tokens of overlap between fixed/paragraph chunks so a fact isn't split across a boundary and lost:
def chunk_text(text, chunk_size=500, overlap=50):
words = text.split()
return [' '.join(words[i:i+chunk_size])
for i in range(0, len(words), chunk_size - overlap)]
# Chunk 1 starts 50 words before Chunk 0 ends — no fact falls through the cracks.Step 2: Pick the Right Embedding Model
| Model | Dims | Languages | Cost | Best For |
|---|---|---|---|---|
multilingual-MiniLM-L12-v2 | 384 | 50+ | Free | Arabic/multilingual apps ✅ |
all-MiniLM-L6-v2 | 384 | English | Free | English-only prototyping |
text-embedding-3-small | 1536 | Multilingual | $0.02/1M | Production English apps |
text-embedding-3-large | 3072 | Multilingual | $0.13/1M | Highest accuracy needed |
If users query in Arabic but your data is English, an English-only model fails — query and document vectors land in different spaces. A multilingual model maps both to the same space. (And remember: use the same model for indexing and querying.)
Step 3: Store in a Vector Database
A vector DB stores embeddings alongside the original data (the "payload") and finds the top-K most similar vectors in milliseconds. Qdrant (fast, open-source, local or server), ChromaDB (simplest to start), and Pinecone (managed cloud) are the common choices.
Complete Implementation: The Gadget Guru
A product advisor that knows every 2026 AI wearable — every concept in one runnable system:
# pip install qdrant-client sentence-transformers openai
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer
from openai import OpenAI
encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
qdrant = QdrantClient(":memory:") # in-memory for demo
client = OpenAI()
COLLECTION, DIM = "gadgets_2026", encoder.get_sentence_embedding_dimension() # 384
# ── PHASE A: PREPARE (run once) ──────────────────────────────
def index_data(data):
profiles = [f"{g['name']} ({g['category']}). Specs: {g['specs']}. "
f"Pros: {g['pros']}. Cons: {g['cons']}. Price: ${g['price']}" for g in data]
embeddings = encoder.encode(profiles)
qdrant.create_collection(COLLECTION,
vectors_config=models.VectorParams(size=DIM, distance=models.Distance.COSINE))
qdrant.upload_points(COLLECTION, [
models.PointStruct(id=i, vector=embeddings[i].tolist(), payload=data[i])
for i in range(len(data))])
# ── PHASE B: QUERY (every request) ───────────────────────────
def retrieve(query, top_k=3):
hits = qdrant.query_points(COLLECTION, query=encoder.encode(query).tolist(), limit=top_k)
return [{**h.payload, "score": round(h.score, 3)} for h in hits.points]
def generate(query, context):
ctx = "\n".join(f"- {g['name']} ({g['category']}): {g['specs']}. "
f"Pros: {g['pros']}. Cons: {g['cons']}. Price: ${g['price']}" for g in context)
return client.chat.completions.create(
model="gpt-4o-mini", temperature=0.1, max_tokens=400,
messages=[
{"role": "system", "content":
"You are a 2026 AI-wearable advisor. Recommend ONLY from the products below. "
"If none match, say so. Always mention price and one limitation. Rank multiple fits.\n\n"
f"Available products:\n{ctx}"},
{"role": "user", "content": query},
]).choices[0].message.content
def ask(query): # Retrieve → Augment → Generate
return generate(query, retrieve(query, top_k=3))The system finds the lightest translation glasses for "lightest glasses for real-time translation under $500" — matching on meaning, ranked, grounded in your data, citable.
Prompt Engineering for RAG
The most critical piece is the system prompt — strictly instruct the model to use only the provided context:
You are a helpful assistant. Answer the user's question using ONLY the
provided context. If the answer is not in the context, say
"I don't have enough information to answer that." Do not use your
internal knowledge to fill in gaps.
<CONTEXT>
{{ retrieved_chunks }}
</CONTEXT>
QUESTION: {{ user_query }}Forcing the AI to say "I don't know" when context is missing is the #1 way to build trust. Never let a RAG bot guess. Put the context in the system prompt (not the user turn) so it reads as authoritative ground truth.
Advanced RAG Techniques
Once the basic pipeline works, three techniques sharpen production accuracy:
- Hybrid search — pure semantic search misses exact tokens ("XR-7" vs "XR7"). Combine vector search with keyword/metadata filters (
category = "AR Glasses",price <= 500) in the same query. - Re-ranking (two-stage retrieval) — retrieve top-20 cheaply with vector search, then re-score with a slower, more accurate cross-encoder (
cross-encoder/ms-marco-MiniLM-L-6-v2) and keep the top 3. - Query expansion — generate 3 alternative phrasings of the question, search with all of them, and dedupe — covering angles a single phrasing would miss.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
candidates = retrieve(query, top_k=20) # stage 1: fast & broad
pairs = [(query, f"{g['name']}: {g['specs']} {g['pros']}") for g in candidates]
top_3 = [c for c, _ in sorted(zip(candidates, reranker.predict(pairs)),
key=lambda x: x[1], reverse=True)[:3]] # stage 2: accurateEvaluating RAG Quality
RAG is only as good as its retrieval. The standard library RAGAS measures the three metrics that matter:
# pip install ragas
from ragas import evaluate
from ragas.metrics import faithfulness, context_relevancy, answer_relevancy
results = evaluate(dataset, metrics=[faithfulness, context_relevancy, answer_relevancy])
# → {'faithfulness': 0.95, 'context_relevancy': 0.88, 'answer_relevancy': 0.92}Faithfulness: is the answer grounded in the retrieved context (no invention)? Context relevancy: did retrieval fetch the right chunks? Answer relevancy: does the answer actually address the question? Low faithfulness → strengthen the prompt; low context relevancy → fix chunking/embeddings.
Troubleshooting Common RAG Failures
Why RAG Fails
The right chunk was never fetched. Fix: improve chunking (add overlap) or use a stronger embedding model (text-embedding-3-large).
The context was correct but too long and the fact got buried. Fix: fewer chunks + re-order for the "Lost in the Middle" effect.
The LLM ignored the context. Fix: strengthen the system prompt and use a more capable model (GPT-4o / Claude 3.5 Sonnet).
Real-world RAG is everywhere now: customer-support bots grounded in help docs, internal "ask your wiki" assistants, legal/medical research tools that cite sources, and e-commerce advisors like the Gadget Guru above.
Key Takeaways
In 2026, we don't 'teach' models facts; we give them tools to find facts. RAG is the bridge between frozen model intelligence and live data reality.
Poor chunking = poor retrieval. Invest in semantic chunking and 100-token overlaps for complex documents.
By putting context in the System Prompt and requiring citations, you transform an AI from a 'creative writer' into a 'precise librarian.'