Skip to main content
AI-Developer → AI Engineering

Part 2 of 4 — Build a Real RAG System with LlamaIndex 0.12: PDFs, Chat, and a Live API

You understand what LlamaIndex does. Now let's build something real. In this article you'll build three complete RAG applications: a document Q&A system, a PDF querying tool, and a full Express.js API — all in TypeScript with LlamaIndex 0.12.

March 24, 2026
18 min read
#LlamaIndex#RAG#TypeScript#Express.js#PDF#QueryEngine#ChatEngine#AI Engineering#OpenAI

Your company has 3,000 support tickets from last year, a 200-page product manual, and a database of FAQ articles.

Right now, support agents answer the same questions every single day. Every answer is in those 3,000 tickets — but no one has time to read them all.

By the end of this article, you will have built the system that changes that. In TypeScript. With real data. With a real API.

📋

WHAT YOU'LL BUILD IN THIS ARTICLE

🔷 Project 1: A custom Q&A system over your own text documents
🔷 Project 2: A PDF querying tool (query a real 200-page report)
🔷 Project 3: A production-grade Express.js API for your RAG system

This is Part 2 of a 4-part series. If you haven't read Part 1, start there — it covers installation, the 3-phase architecture, and your first working example.

Your environment from Part 1 is all you need. Same package.json, same .env file, same project structure. Let's build.

STEP 1 OF 6

How LlamaIndex Actually Loads Your Data

Before we build anything, you need to understand what happens when you call reader.loadData(). This is the foundation everything else is built on.

The Document Object

Every piece of data in LlamaIndex starts as a Document. A Document has three things:

// This is what a Document looks like internally
{
  id_: "doc-123-abc",          // Unique identifier
  text: "The full text...",    // The actual content
  metadata: {
    file_name: "report.pdf",   // Where it came from
    file_type: "application/pdf",
    page_label: "1",           // PDF page number (if applicable)
    creation_date: "2026-03-01",
    // ... any custom metadata you add
  }
}

This metadata travels with every chunk when LlamaIndex splits the document. That means when you get an answer, you can always trace which file and even which page it came from.

SimpleDirectoryReader — Your Swiss Army Knife

SimpleDirectoryReader is the easiest way to load documents. Pass it a folder, and it handles everything:

import { SimpleDirectoryReader } from "llamaindex/ingestion";

const reader = new SimpleDirectoryReader();

// Load all files in a folder
const docs = await reader.loadData("./data");

// Load only specific file types
const pdfDocs = await reader.loadData("./data", {
  recursive: true, // Include subdirectories
  // excludedExtensions: [".png", ".jpg"],  // Skip these file types
});

console.log(`Loaded ${docs.length} documents`);
docs.forEach((doc) => {
  console.log(`  • ${doc.metadata.file_name} (${doc.text.length} chars)`);
});

Supported file types out of the box: .txt, .md, .pdf, .docx, .csv, .json, .html, .epub, and more.

Custom Metadata — Track Where Every Answer Comes From

// Add custom metadata when creating documents manually
import { Document } from "llamaindex";

const doc = new Document({
  text: "Your custom text here...",
  metadata: {
    source: "internal-wiki",
    author: "engineering-team",
    lastUpdated: "2026-03-24",
    department: "product",
  },
});
STEP 2 OF 6

Chunking: The Most Important Tuning Decision You'll Make

When LlamaIndex indexes your documents, it splits them into chunks called Nodes. How you split determines everything: answer quality, retrieval accuracy, and token costs.

Document → Chunks → Nodes (Visualized)

📄 Original Document (~2,000 tokens)
The full text of your document: an intro paragraph, several body sections, code examples, a conclusion. It's too long to fit in a single retrieval step.
↓ SentenceSplitter (chunkSize=512, overlap=50)
Node 1
~512 tokens
Intro text
Node 2
~512 tokens
Section 1
Node 3
~512 tokens
Section 2
Node 4
~424 tokens
Conclusion
The 50-token overlap: The last 50 tokens of Node 1 are repeated as the first 50 tokens of Node 2. This prevents important context from being cut at a boundary — e.g., a sentence that starts at the end of one chunk and finishes in the next.

Configuring the SentenceSplitter

import { Settings } from "llamaindex";
import { SentenceSplitter } from "llamaindex/node-parser";

// The SentenceSplitter is smart — it tries to split at sentence boundaries
// rather than cutting words in the middle of a sentence.
Settings.nodeParser = new SentenceSplitter({
  chunkSize: 512, // tokens per chunk (default: 1024)
  chunkOverlap: 50, // overlap between adjacent chunks (default: 200)
});

Chunk Size: The Golden Rule

Chunk Size Best For Trade-off Use Case
128–256 Precise fact retrieval Loses surrounding context FAQs, product specs
512 ✓ General purpose (sweet spot) Good balance Most RAG applications
1024 Reasoning over long context Slower, more expensive Research papers, books
2048+ Full-section understanding Very expensive, less precise Legal contracts, transcripts
STEP 3 OF 6

QueryEngine vs ChatEngine — Choosing the Right Tool

This is the most important architectural decision in every LlamaIndex application. They look similar but serve completely different purposes.

🔍 QUERYENGINE
Single-Turn Q&A
Each query is independent. The engine has no memory of previous questions.
✓ Stateless — scales infinitely
✓ Best for document search
✓ Best for batch processing
✓ Returns source nodes
✗ No conversation memory
✗ Can't follow up on answers
Use when: Search bars, document lookup, batch analysis, API endpoints without sessions.
💬 CHATENGINE
Multi-Turn Conversation
Maintains full conversation history. Each reply considers everything said before.
✓ Full conversation memory
✓ "What did you just say?"
✓ "Tell me more about that"
✓ Best for chatbots
✗ Stateful — must manage sessions
✗ Grows token usage over time
Use when: Customer support chatbots, interactive document exploration, tutoring systems, any multi-turn UI.

How Both Engines Work — The Retrieval + Synthesis Flow

User Query
Embed query
Find top-K chunks
Build prompt
LLM answers
ChatEngine adds: Before retrieving, it condenses the conversation history into a standalone question. "What about the second one?" becomes "What are the details of the second pricing tier?" — then it retrieves on that full question.
STEP 4 OF 6

Project 1 — Custom Document Q&A System

Let's build a document Q&A system that uses ChatEngine to allow follow-up questions. This is the pattern you'll use for customer support bots, internal knowledge assistants, and document explorers.

First, create your data folder with some sample content:

mkdir -p src/data/project1


cat > src/data/project1/remote-work-policy.txt << 'EOF'
Remote Work Policy — Effective March 2026

OVERVIEW
Our company supports flexible remote work arrangements for all full-time employees.
Employees may work remotely up to 4 days per week, with one mandatory in-office day on Wednesdays.

EQUIPMENT POLICY
The company provides a $1,500 equipment budget for remote work setup, renewed every 3 years.
Approved items include monitors, keyboards, ergonomic chairs, and high-speed internet equipment.

WORKING HOURS
Core hours are 10:00 AM – 3:00 PM in the employee's local timezone.
Meetings must be scheduled within core hours unless mutually agreed otherwise.

TIME OFF
Remote work does not change time-off policies. PTO requests must be submitted 2 weeks in advance.
All national holidays remain paid days off regardless of remote status.
EOF

Now write the Q&A system:

// src/project1-chat.ts
import "dotenv/config";
import * as readline from "readline";
import { Settings } from "llamaindex";
import { OpenAI } from "llamaindex/llms";
import { OpenAIEmbedding } from "llamaindex/embeddings";
import { SimpleDirectoryReader } from "llamaindex/ingestion";
import { VectorStoreIndex } from "llamaindex/indices";
import { SentenceSplitter } from "llamaindex/node-parser";

async function main() {
  // ── Configure LlamaIndex ────────────────────────────────────────────
  Settings.llm = new OpenAI({
    model: "gpt-4o",
    apiKey: process.env.OPENAI_API_KEY,
    temperature: 0.1,
  });

  Settings.embedModel = new OpenAIEmbedding({
    model: "text-embedding-3-large",
    apiKey: process.env.OPENAI_API_KEY,
  });

  // Use SentenceSplitter for policy docs — 512 tokens works well for
  // documents with clearly separated policy sections
  Settings.nodeParser = new SentenceSplitter({
    chunkSize: 512,
    chunkOverlap: 64,
  });

  // ── Load and index documents ────────────────────────────────────────
  console.log("Loading documents...");
  const reader = new SimpleDirectoryReader();
  const documents = await reader.loadData("./src/data/project1");

  console.log(`Indexing ${documents.length} document(s)...`);
  const index = await VectorStoreIndex.fromDocuments(documents);
  console.log("Ready! Type your question (or 'exit' to quit)\n");

  // ── Create a ChatEngine with custom system prompt ───────────────────
  const chatEngine = index.asChatEngine({
    // The system prompt shapes how the AI responds
    systemPrompt: `You are an HR assistant with deep knowledge of the company's
    policies. Answer questions accurately based ONLY on the provided policy documents.
    If the information is not in the documents, say so clearly.
    Always be helpful, professional, and concise.`,
  });

  // ── Interactive chat loop ───────────────────────────────────────────
  const rl = readline.createInterface({
    input: process.stdin,
    output: process.stdout,
  });

  const askQuestion = () => {
    rl.question("\n💬 You: ", async (input) => {
      const message = input.trim();

      if (message.toLowerCase() === "exit") {
        console.log("\nGoodbye!");
        rl.close();
        return;
      }

      if (!message) {
        askQuestion();
        return;
      }

      try {
        const response = await chatEngine.chat({ message });
        console.log(`\n🤖 Assistant: ${response.toString()}`);
      } catch (error) {
        console.error("Error:", error);
      }

      askQuestion(); // Continue the conversation loop
    });
  };

  askQuestion();
}

main().catch(console.error);

Run it and have a multi-turn conversation:

npx ts-node src/project1-chat.ts
Loading documents...
Indexing 1 document(s)...
Ready! Type your question (or 'exit' to quit)

💬 You: How many days can I work from home each week?

🤖 Assistant: You can work remotely up to 4 days per week. One day per week
(Wednesday) is mandatory in-office attendance.

💬 You: What about the equipment budget?

🤖 Assistant: The company provides a $1,500 equipment budget for your remote
work setup, which is renewed every 3 years. You can use it for monitors,
keyboards, ergonomic chairs, and high-speed internet equipment.

💬 You: When do I need to submit PTO requests?

🤖 Assistant: PTO requests must be submitted at least 2 weeks in advance.
National holidays remain paid days off regardless of your remote work status.

Notice how the second question ("What about the equipment budget?") has no explicit context — the ChatEngine figured out you were still talking about remote work policy.

STEP 5 OF 6

Project 2 — Query a PDF File

PDFs are the most common document format in the real world. LlamaIndex handles them natively. No extra libraries needed for basic PDFs — just point SimpleDirectoryReader at a folder with PDFs.

For this project, we'll query a PDF and also expose source nodes — showing exactly which page and section the answer came from.

# Download a real PDF to test with (Warren Buffett's 2025 Annual Letter)
# Or create a simple test PDF and place it in:
mkdir -p src/data/project2
# Place any PDF you have in src/data/project2/
# For testing, we'll create a mock via a text file that simulates a PDF

cat > src/data/project2/investment-principles.txt << 'EOF'
Investment Principles and Strategy Guide — 2026 Edition

Chapter 1: The Foundation of Long-Term Investing
The most important quality for an investor is temperament, not intellect.
Markets are driven by fear and greed in the short term, but by fundamentals in the long term.
The best investment you can make is in yourself — your skills compound faster than any stock.

Chapter 2: Portfolio Construction
Diversification is protection against ignorance. If you know what you're doing, diversify less.
The ideal portfolio for most investors contains 5–15 positions across different sectors.
Never invest borrowed money. Leverage amplifies both gains and losses.

Chapter 3: Valuation Principles
Price is what you pay. Value is what you get.
Buy great businesses at fair prices rather than fair businesses at great prices.
A margin of safety of at least 25% below intrinsic value is prudent for any investment.

Chapter 4: Risk Management
The first rule of investing is never lose money. The second rule is never forget rule one.
Position sizing is more important than stock selection for most investors.
Cash is not trash — it gives you the ability to act when others cannot.
EOF

Now write the PDF query tool with source attribution:

// src/project2-pdf.ts
import "dotenv/config";
import { Settings } from "llamaindex";
import { OpenAI } from "llamaindex/llms";
import { OpenAIEmbedding } from "llamaindex/embeddings";
import { SimpleDirectoryReader } from "llamaindex/ingestion";
import { VectorStoreIndex } from "llamaindex/indices";
import { SentenceSplitter } from "llamaindex/node-parser";

async function queryDocument(question: string) {
  Settings.llm = new OpenAI({
    model: "gpt-4o",
    apiKey: process.env.OPENAI_API_KEY,
    temperature: 0, // Zero temp for factual document queries
  });

  Settings.embedModel = new OpenAIEmbedding({
    model: "text-embedding-3-large",
    apiKey: process.env.OPENAI_API_KEY,
  });

  // For longer documents like annual reports, use larger chunks
  // so each retrieved chunk contains enough context
  Settings.nodeParser = new SentenceSplitter({
    chunkSize: 1024,
    chunkOverlap: 100,
  });

  const reader = new SimpleDirectoryReader();
  const documents = await reader.loadData("./src/data/project2");

  console.log(
    `Loaded ${documents.length} document(s) with ${documents.reduce((acc, d) => acc + d.text.length, 0).toLocaleString()} characters`,
  );

  const index = await VectorStoreIndex.fromDocuments(documents);

  // Configure the QueryEngine to return the source nodes (where the answer came from)
  const queryEngine = index.asQueryEngine({
    similarityTopK: 3, // Retrieve top 3 most relevant chunks
  });

  console.log(`\n📌 Question: ${question}\n`);

  const response = await queryEngine.query({ query: question });

  // Display the answer
  console.log("✅ Answer:");
  console.log(response.toString());

  // Display sources — this is the killer feature for trust and debugging
  console.log("\n📚 Sources Used:");
  const sourceNodes = response.sourceNodes || [];
  sourceNodes.forEach((node, i) => {
    const source = node.node.metadata?.file_name ?? "Unknown";
    const score = node.score?.toFixed(3) ?? "N/A";
    const preview = node.node
      .getContent()
      .substring(0, 100)
      .replace(/\n/g, " ");
    console.log(`  [${i + 1}] ${source} (similarity: ${score})`);
    console.log(`      "${preview}..."`);
  });
}

// Run three different types of queries to show versatility
async function main() {
  await queryDocument(
    "What is the recommended margin of safety for investments?",
  );
  console.log("\n" + "─".repeat(60) + "\n");
  await queryDocument("How should I think about portfolio size?");
  console.log("\n" + "─".repeat(60) + "\n");
  await queryDocument("What does the guide say about using leverage?");
}

main().catch(console.error);
npx ts-node src/project2-pdf.ts
Loaded 1 document(s) with 1,847 characters

📌 Question: What is the recommended margin of safety?

✅ Answer:
The guide recommends a margin of safety of at least 25% below intrinsic
value as prudent for any investment. This concept comes from Chapter 3
on valuation principles.

📚 Sources Used:
  [1] investment-principles.txt (similarity: 0.892)
      "Chapter 3: Valuation Principles Price is what you pay. Value is what..."
  [2] investment-principles.txt (similarity: 0.743)
      "Chapter 2: Portfolio Construction Diversification is protection against..."

WHY SOURCE ATTRIBUTION MATTERS

In production, source attribution is what separates a trusted RAG system from a hallucination machine. Always expose sourceNodes in your UI. Users need to verify where answers come from — especially in legal, medical, or financial contexts.

STEP 6 OF 6

Project 3 — A Production Express.js RAG API

This is where everything comes together. We'll build a proper REST API that your frontend can call — complete with CORS support, error handling, and response streaming.

First, install Express:

npm install [email protected] [email protected]
npm install -D @types/[email protected] @types/[email protected]
// src/project3-api.ts
import "dotenv/config";
import express, { Request, Response } from "express";
import cors from "cors";
import { Settings } from "llamaindex";
import { OpenAI } from "llamaindex/llms";
import { OpenAIEmbedding } from "llamaindex/embeddings";
import { SimpleDirectoryReader } from "llamaindex/ingestion";
import { VectorStoreIndex } from "llamaindex/indices";
import { SentenceSplitter } from "llamaindex/node-parser";
import type { ContextChatEngine, BaseQueryEngine } from "llamaindex";

const app = express();
app.use(cors());
app.use(express.json());

// ── Global index (initialized once at startup) ──────────────────────
// This is the key architectural decision: build the index ONCE when
// the server starts, then reuse it for all requests. No re-indexing!
let queryEngine: BaseQueryEngine;
let chatEngine: ContextChatEngine;

async function initializeRAG() {
  console.log("Initializing RAG system...");

  Settings.llm = new OpenAI({
    model: "gpt-4o",
    apiKey: process.env.OPENAI_API_KEY,
    temperature: 0.1,
  });

  Settings.embedModel = new OpenAIEmbedding({
    model: "text-embedding-3-large",
    apiKey: process.env.OPENAI_API_KEY,
  });

  Settings.nodeParser = new SentenceSplitter({
    chunkSize: 512,
    chunkOverlap: 64,
  });

  const reader = new SimpleDirectoryReader();
  // Load all documents from the data folder — add files, restart server
  const documents = await reader.loadData("./src/data");

  const index = await VectorStoreIndex.fromDocuments(documents);

  queryEngine = index.asQueryEngine({ similarityTopK: 4 });
  chatEngine = index.asChatEngine({
    systemPrompt:
      "You are a helpful assistant. Answer questions based only on the provided documents. Be concise and accurate.",
  });

  console.log(`RAG initialized with documents from ./src/data`);
}

// ── POST /api/query — Single-turn question answering ─────────────────
// Use this for search bars, document lookup, batch processing
app.post("/api/query", async (req: Request, res: Response) => {
  const { query } = req.body;

  if (!query || typeof query !== "string") {
    res
      .status(400)
      .json({ error: "query field is required and must be a string" });
    return;
  }

  try {
    const response = await queryEngine.query({ query });

    const sources = (response.sourceNodes || []).map((node) => ({
      text: node.node.getContent().substring(0, 200),
      fileName: node.node.metadata?.file_name,
      score: node.score,
    }));

    res.json({
      answer: response.toString(),
      sources,
      query,
    });
  } catch (error) {
    console.error("Query error:", error);
    res.status(500).json({ error: "Failed to process query" });
  }
});

// ── POST /api/chat — Multi-turn conversation ─────────────────────────
// Use this for chatbots — NOTE: this is stateful per instance.
// For production with multiple users, implement session-based engines.
app.post("/api/chat", async (req: Request, res: Response) => {
  const { message } = req.body;

  if (!message || typeof message !== "string") {
    res
      .status(400)
      .json({ error: "message field is required and must be a string" });
    return;
  }

  try {
    const response = await chatEngine.chat({ message });

    res.json({
      response: response.toString(),
      message,
    });
  } catch (error) {
    console.error("Chat error:", error);
    res.status(500).json({ error: "Failed to process message" });
  }
});

// ── GET /api/health — Health check ───────────────────────────────────
app.get("/api/health", (_req: Request, res: Response) => {
  res.json({ status: "ok", timestamp: new Date().toISOString() });
});

// ── Start the server ──────────────────────────────────────────────────
const PORT = process.env.PORT || 3001;

initializeRAG()
  .then(() => {
    app.listen(PORT, () => {
      console.log(`\n✅ RAG API running on http://localhost:${PORT}`);
      console.log(`   POST /api/query  — Single-turn Q&A`);
      console.log(`   POST /api/chat   — Multi-turn chat`);
      console.log(`   GET  /api/health — Health check`);
    });
  })
  .catch((err) => {
    console.error("Failed to initialize RAG:", err);
    process.exit(1);
  });

Run your API server:

npx ts-node src/project3-api.ts

Test it with curl:

# Single-turn query
curl -X POST http://localhost:3001/api/query \
  -H "Content-Type: application/json" \
  -d '{"query": "How many days can I work from home?"}'

# Multi-turn chat
curl -X POST http://localhost:3001/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What is the equipment budget?"}'

# Health check
curl http://localhost:3001/api/health

Example response:

{
  "answer": "Employees may work remotely up to 4 days per week, with Wednesday as the mandatory in-office day.",
  "sources": [
    {
      "text": "Remote Work Policy — Effective March 2026\n\nOVERVIEW\nOur company supports flexible remote work...",
      "fileName": "remote-work-policy.txt",
      "score": 0.891
    }
  ],
  "query": "How many days can I work from home?"
}

Your Express.js RAG API — Request Flow

📱
Client
Frontend / curl / Postman
🚀
Express
POST /api/query
🦙
LlamaIndex
QueryEngine
🤖
OpenAI
GPT-4o
Key architecture decision: The VectorStoreIndex is built once when the server starts (not on every request). This is why your API can respond in milliseconds instead of minutes — the heavy embedding work has already been done.

Key Takeaways

SimpleDirectoryReader handles all common file types automatically. Drop any mix of .txt, .pdf, .csv, .md, .docx into a folder and call loadData() — LlamaIndex figures out the rest.
512 tokens is the sweet spot for chunk size. Start here, then tune up (for complex reasoning) or down (for precise fact retrieval) based on your use case and test results.
QueryEngine for search, ChatEngine for conversation. The choice is architectural — it determines your session management strategy, API design, and UI behavior.
Always expose sourceNodes in production. Showing users where answers come from is the difference between a system people trust and one they don't. It also makes debugging dramatically easier.
Build the index once at server startup, reuse on every request. This is the single most important performance optimization — re-indexing on every request is 100–1000x slower and completely unnecessary.
metadata is your audit trail. LlamaIndex preserves metadata through every stage — from Document to Node to source citation. Use it to track file names, page numbers, authors, and timestamps.

Try It Yourself

3 Experiments to Go Deeper

LAB 1
Add a CSV file to your data folder
Create a CSV with product names and prices. Run the API and query it. Does LlamaIndex handle tabular data correctly? Can you ask "What's the price of product X?" and get a correct answer?
LAB 2
Change similarityTopK and measure accuracy
Try similarityTopK: 1 (fastest, least context) vs similarityTopK: 8 (more context but slower and more expensive). Ask the same question both ways. Which gives better answers for your use case?
LAB 3
Add a custom metadata field and filter on it
Create two documents: one with metadata: { department: "HR" } and one with metadata: { department: "Legal" }. Then explore how to filter queries to only return results from one department. This is the foundation of permission-based RAG.

What's Next in the Series

PART 3 OF 4 — UP NEXT
LlamaIndex Agents: RouterQueryEngine & Multi-Source RAG
What if your RAG system could decide WHICH knowledge base to search — without you writing a single if-statement? We build AI agents that use tools, route queries intelligently across multiple data sources, and reason through complex multi-step questions.
✦ OpenAIAgent deep dive
✦ FunctionTool creation
✦ RouterQueryEngine
✦ Multi-source RAG
MH

Mohamed Hamed

20 years building production systems — the last several deep in AI integration, LLMs, and full-stack architecture. I write what I've actually built and broken. If this was useful, the next one goes to LinkedIn first.

Follow on LinkedIn →