Skip to main content
AI-Developer/AI Engineering
Part 1 of 16

Part 2 of 4 — Build a Real RAG System with LlamaIndex 0.12: PDFs, Chat, and a Live API

You understand what LlamaIndex does. Now let's build something real. In this article you'll build three complete RAG applications: a document Q&A system, a PDF querying tool, and a full Express.js API — all in TypeScript with LlamaIndex 0.12.

March 24, 2026
18 min read
#LlamaIndex#RAG#TypeScript#Express.js#PDF#QueryEngine#ChatEngine#AI Engineering#OpenAI

From 3,000 Support Tickets to a Searchable Brain.

Your company has thousands of tickets and 200-page manuals. No one has time to read them—until now. Build a TypeScript RAG system that answers questions with page-level citations.

Primary Objective
LlamaIndex 0.12 | Express.js | PDF Support | Source Attribution
💡
Series Navigation

This is Part 2 of 4. If you missed the foundation, check out Part 1: The LlamaIndex 3-Phase Architecture.


What We're Building

The Project Roadmap

📄PROJECT 1
  • Goal: Internal Knowledge Q&A.
  • Tech: ChatEngine (Multi-turn).
  • Data: Markdown/Text policies.
📚PROJECT 2
  • Goal: PDF Deep Search.
  • Tech: QueryEngine + Source Attribution.
  • Data: Real 200-page reports.
🌐PROJECT 3
  • Goal: Production RAG API.
  • Tech: Express.js + TS.
  • Data: Live endpoints for your frontend.

The Core Abstraction: The Document Object

Before building, you must understand how data is represented internally.

Anatomy of a LlamaIndex Document
  • ID: Unique identifier (doc-123-abc).
  • Text: The raw content string.
  • Metadata: Key-value pairs like file_name, page_label, or department.
  • Impact: Metadata stays with every chunk, enabling perfect source tracking.

Chunking & Tuning

The SentenceSplitter Workflow
  • Input: 2,000 Token Document.
  • Process: Split into nodes using chunkSize=512 and chunkOverlap=50.
  • Output: 4 overlapping nodes that preserve context at boundaries.

Chunk Size Sweet Spots

PRECISE (128-256)

Best for: FAQ lookup, specific data points. Trade-off: Minimal context.

GENERAL (512)

Best for: Most RAG apps (The Golden Rule). Trade-off: Balanced speed & context.

REASONING (1024+)

Best for: Legal contracts, research papers. Trade-off: Slower, higher token cost.


Choosing Your Engine

QueryEngine vs. ChatEngine

🔍QUERYENGINE
  • Mode: Single-Turn Q&A.
  • Memory: None (Stateless).
  • Use Case: Search bars, batch processing.
💬CHATENGINE
  • Mode: Multi-Turn Conversation.
  • Memory: Full (Stateful).
  • Use Case: Support bots, interactive tutors.

Project 1: Stateful Internal Q&A (ChatEngine)

The key to a support assistant is memory. Here is how to configure a multi-turn chat experience that remembers context using LlamaIndex 0.12.

typescript
123456789101112131415161718192021222324252627282930
import { SimpleDirectoryReader, VectorStoreIndex, Settings, OpenAI } from "llamaindex";

// 1. Configure the LLM globally
Settings.llm = new OpenAI({ model: "gpt-4o", temperature: 0.2 });

async function runChatEngine() {
  // 2. Load markdown and text documents from your knowledge directory
  const reader = new SimpleDirectoryReader();
  const documents = await reader.loadData({ directoryPath: "./src/data" });

  // 3. Index and parse documents
  const index = await VectorStoreIndex.fromDocuments(documents);

  // 4. Create ChatEngine (maintains conversational history automatically)
  const chatEngine = index.asChatEngine({
    chatModel: Settings.llm,
    systemPrompt: "You are a customer support agent. Answer questions using the provided context."
  });

  // 5. Start multi-turn conversation
  const response1 = await chatEngine.chat({ message: "What is our remote work equipment policy?" });
  console.log("User: What is our remote work equipment policy?");
  console.log("AI:", response1.toString());

  const response2 = await chatEngine.chat({ message: "Does that cover the Starter plan users?" });
  console.log("User: Does that cover the Starter plan users?");
  console.log("AI:", response2.toString());
}

runChatEngine().catch(console.error);

Project 2: PDF Deep Search with Citations

In enterprise applications, users don't trust answers without proof. Using the metadata attached to LlamaIndex document chunks, we can return precise page-level citations.

typescript
12345678910111213141516171819202122232425262728293031
import { SimpleDirectoryReader, VectorStoreIndex, Settings, OpenAI } from "llamaindex";

async function runPDFSearch() {
  Settings.llm = new OpenAI({ model: "gpt-4o", temperature: 0.1 });

  // 1. Load PDFs from a target directory
  const reader = new SimpleDirectoryReader();
  const documents = await reader.loadData({ directoryPath: "./src/data/pdfs" });

  // 2. Create the Vector Index
  const index = await VectorStoreIndex.fromDocuments(documents);
  const queryEngine = index.asQueryEngine({ similarityTopK: 3 });

  // 3. Run Query
  const response = await queryEngine.query({ query: "Summarize the Q4 security audit results." });
  console.log("Answer:", response.toString());

  // 4. Print Citations
  if (response.sourceNodes) {
    console.log("\n--- CITATIONS ---");
    response.sourceNodes.forEach((node, i) => {
      console.log(`Source ${i + 1}:`);
      console.log(`- File Name: ${node.node.metadata["file_name"]}`);
      console.log(`- Page: ${node.node.metadata["page_label"] || "N/A"}`);
      console.log(`- Score: ${node.score?.toFixed(4) || "N/A"}`);
      console.log(`- Text Snippet: ${node.node.text.slice(0, 150)}...\n`);
    });
  }
}

runPDFSearch().catch(console.error);

Project 3: Production RAG API (Express.js + TypeScript)

In a real deployment, RAG sits behind a backend API. We initialize the index once on server startup to save time and memory.

typescript
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import express from "express";
import { SimpleDirectoryReader, VectorStoreIndex, Settings, OpenAI } from "llamaindex";

const app = express();
app.use(express.json());

let index: VectorStoreIndex | null = null;

// Initialize index once on startup
async function initIndex() {
  console.log("Initializing Vector Index...");
  const reader = new SimpleDirectoryReader();
  const documents = await reader.loadData({ directoryPath: "./src/data" });
  index = await VectorStoreIndex.fromDocuments(documents);
  console.log("Index initialized successfully!");
}

app.post("/api/query", async (req, res) => {
  if (!index) {
    return res.status(503).json({ error: "Index is still initializing." });
  }
  
  const { query } = req.body;
  if (!query) {
    return res.status(400).json({ error: "Query is required." });
  }

  try {
    const queryEngine = index.asQueryEngine({ similarityTopK: 3 });
    const response = await queryEngine.query({ query });
    
    const sources = response.sourceNodes?.map(node => ({
      fileName: node.node.metadata["file_name"],
      score: node.score,
      text: node.node.text.slice(0, 200) + "..."
    })) || [];

    res.json({
      answer: response.toString(),
      sources
    });
  } catch (err: any) {
    res.status(500).json({ error: err.message });
  }
});

const PORT = 3000;
app.listen(PORT, async () => {
  await initIndex();
  console.log(`RAG API server listening on http://localhost:${PORT}`);
});

Implementation Roadmap

6 Steps to Production

📂
LOAD DATA

Use SimpleDirectoryReader to ingest PDFs, MDs, and CSVs.

✂️
NODE PARSING

Configure SentenceSplitter for the optimal chunk size.

INDEXING

Create a VectorStoreIndex from your documents.

⚙️
ENGINE CONFIG

Decide between ChatEngine or QueryEngine.

🌐
API LAYER

Wrap the engine in an Express.js server for frontend access.

📚
CITATIONS

Expose sourceNodes so users can verify every answer.


Key Takeaways

01
01
Metadata is Leverage

Don't just load text. Add custom metadata like department or last_updated to your documents—it makes filtering much more powerful.

02
02
Source Attribution is Trust

In production, never show an answer without a 'Source' link. LlamaIndex's sourceNodes makes this a 1-line implementation.

03
03
Initialize Once

Build your index when the server starts. Re-building the index on every request is a massive waste of tokens and time.

AI Engineering
MH

Mohamed Hamed

20 years building production systems — the last several deep in AI integration, LLMs, and full-stack architecture. I write what I've actually built and broken. If this was useful, the next one goes to LinkedIn first.

Follow on LinkedIn →