Your dev RAG works perfectly. You push to production. The server starts, begins re-indexing 5,000 documents, takes 4 minutes to be ready, and users get a timeout error in the meantime.
When it finally loads, users ask questions and wait 8–12 seconds for an answer — staring at a blank screen with no feedback.
These are the two production RAG killers. This article eliminates both — then deploys your finished app to Vercel.
WHAT YOU'LL BUILD IN THIS ARTICLE
This is Part 4 of 4 — the final chapter. You need Part 1, Part 2, and Part 3 before reading this.
The Re-Indexing Problem — And How Persistence Kills It
In Parts 1–3, every time your script ran, it re-indexed all your documents from scratch. For 5 documents this is fine. For 5,000 documents, that's:
- ~10,000 API calls to OpenAI's embedding endpoint
- ~$2–15 in API costs (per restart!)
- 3–10 minutes of startup time before your server accepts requests
How Persistence Works
LlamaIndex's storageContextFromDefaults saves your entire index to disk — vectors, nodes, document metadata, and all. On subsequent runs, it loads from disk instead of re-embedding.
Persistence — First Run vs Every Run After
Here's the complete persistence pattern:
// src/persist.ts
import "dotenv/config";
import * as fs from "fs";
import { Settings } from "llamaindex";
import { OpenAI } from "llamaindex/llms";
import { OpenAIEmbedding } from "llamaindex/embeddings";
import { SimpleDirectoryReader } from "llamaindex/ingestion";
import { VectorStoreIndex } from "llamaindex/indices";
import { storageContextFromDefaults } from "llamaindex/storage";
import { SentenceSplitter } from "llamaindex/node-parser";
const PERSIST_DIR = "./storage";
async function getOrCreateIndex() {
Settings.llm = new OpenAI({
model: "gpt-4o",
apiKey: process.env.OPENAI_API_KEY,
temperature: 0.1,
});
Settings.embedModel = new OpenAIEmbedding({
model: "text-embedding-3-large",
apiKey: process.env.OPENAI_API_KEY,
});
Settings.nodeParser = new SentenceSplitter({
chunkSize: 512,
chunkOverlap: 64,
});
// Create a storage context pointing to our persist directory
const storageContext = await storageContextFromDefaults({
persistDir: PERSIST_DIR,
});
// If the storage directory already has data, load from disk
if (fs.existsSync(PERSIST_DIR) && fs.readdirSync(PERSIST_DIR).length > 0) {
console.log("📂 Loading index from disk (no API calls needed)...");
const startTime = Date.now();
// VectorStoreIndex.init() loads from the storage context without re-embedding
const index = await VectorStoreIndex.init({ storageContext });
console.log(`✅ Index loaded in ${Date.now() - startTime}ms`);
return index;
}
// First time: build the index from your documents
console.log("📄 First run: building index from documents...");
console.log(" (This will take a minute but only happens once)");
const reader = new SimpleDirectoryReader();
const documents = await reader.loadData("./src/data");
console.log(` Loaded ${documents.length} documents`);
const startTime = Date.now();
// fromDocuments() with storageContext automatically persists when done
const index = await VectorStoreIndex.fromDocuments(documents, {
storageContext,
});
console.log(
`✅ Index built and saved to ${PERSIST_DIR} in ${Date.now() - startTime}ms`,
);
console.log(" Next restart will load in under 2 seconds.");
return index;
}
async function main() {
const index = await getOrCreateIndex();
const queryEngine = index.asQueryEngine({ similarityTopK: 4 });
const response = await queryEngine.query({
query: "What is the remote work policy?",
});
console.log("\n💬 Answer:", response.toString());
}
main().catch(console.error);
First run:
📄 First run: building index from documents...
(This will take a minute but only happens once)
Loaded 3 documents
✅ Index built and saved to ./storage in 12,847ms
Next restart will load in under 2 seconds.
Second run (and every run after):
📂 Loading index from disk (no API calls needed)...
✅ Index loaded in 142ms
142 milliseconds instead of 13 seconds. Zero API cost instead of dollars.
WHEN TO RE-INDEX
Delete the ./storage folder and restart your server whenever you: add new documents, update existing documents, or change your chunk size or embedding model. LlamaIndex will rebuild the index automatically on the next start.
Streaming Responses — From 10-Second Wait to Real-Time Feel
Without streaming, your user sees this: a blank screen for 8–12 seconds, then a full response appears at once. It feels broken, even if it works perfectly.
With streaming, the response appears token-by-token — exactly like ChatGPT. The user sees the AI "thinking" in real time. Perceived wait time drops from 10 seconds to under 1 second.
Streaming in Your TypeScript RAG
// src/stream-demo.ts
import "dotenv/config";
import * as fs from "fs";
import { Settings } from "llamaindex";
import { OpenAI } from "llamaindex/llms";
import { OpenAIEmbedding } from "llamaindex/embeddings";
import { VectorStoreIndex } from "llamaindex/indices";
import { storageContextFromDefaults } from "llamaindex/storage";
async function main() {
Settings.llm = new OpenAI({
model: "gpt-4o",
apiKey: process.env.OPENAI_API_KEY,
temperature: 0.1,
});
Settings.embedModel = new OpenAIEmbedding({
model: "text-embedding-3-large",
apiKey: process.env.OPENAI_API_KEY,
});
// Load from disk (assumes you already ran persist.ts)
const storageContext = await storageContextFromDefaults({
persistDir: "./storage",
});
const index = await VectorStoreIndex.init({ storageContext });
const queryEngine = index.asQueryEngine({ similarityTopK: 4 });
const query = "Explain all the remote work equipment policies in detail.";
console.log(`\n📌 Query: ${query}\n`);
console.log("💬 Answer (streaming):\n");
// ── The key: pass stream: true to get an async iterable ───────────
const stream = await queryEngine.query({
query,
stream: true, // Enable streaming
});
// Process each token as it arrives
for await (const chunk of stream) {
// chunk.response contains the new text fragment (1–5 tokens typically)
process.stdout.write(chunk.response); // Write without newline for continuous flow
}
console.log("\n\n✅ Stream complete");
}
main().catch(console.error);
For ChatEngine streaming (the pattern you'll use in the Next.js API):
// Chat engine streaming — exact same pattern
const chatEngine = index.asChatEngine();
const stream = await chatEngine.chat({
message: "What are the core hours for remote workers?",
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.delta); // ChatEngine uses .delta, QueryEngine uses .response
}
Streaming Property Reference
chunk.sourceNodes — sources
chunk.done — is it finished?
chunk.message — full message so far
(sources available after stream)
The create-llama CLI — From Zero to Full-Stack in 60 Seconds
Before we build the Next.js app manually, meet the tool that generates the boilerplate for you: create-llama.
npx create-llama@latest
# You'll be asked:
# ? What is your project named? → my-rag-chatbot
# ? Which framework? → Next.js
# ? Which model provider? → OpenAI
# ? Which model? → gpt-4o
# ? Which data source? → Files
# ? Which vector store? → SimpleVectorStore (in-memory)
# ? Add LlamaCloud? → No
# ? Would you like to use ESLint? → Yes
This generates a complete production-grade app with:
- Streaming chat UI (React components)
- LlamaIndex backend API routes
- File upload support
- Dark/light theme
- Proper error handling and loading states
Run it:
cd my-rag-chatbot
cp .env.example .env.local
# Add your OPENAI_API_KEY to .env.local
npm install
npm run dev
Open http://localhost:3000 — you have a working RAG chatbot.
USE CREATE-LLAMA AS YOUR STARTING POINT
create-llama is the fastest path to a working full-stack app. Use it as your scaffold, then customize: add persistence, wire in your own data sources, add authentication. Don't build from scratch what create-llama gives you for free.
Building the Next.js 16 RAG Chatbot From Scratch
Let's build our own version — so you understand every piece. This gives you full control and the ability to customize anything.
Setup
npx [email protected] rag-chatbot \
--typescript \
--tailwind \
--eslint \
--app \
--no-src-dir \
--import-alias "@/*"
cd rag-chatbot
# Install LlamaIndex 0.12 and AI SDK for streaming
npm install [email protected] [email protected] [email protected] [email protected]
Create your .env.local:
OPENAI_API_KEY=sk-proj-...your-key...
Add your data to the project:
mkdir -p data
# Copy your documents from Part 2-3 (or create new ones)
echo "CloudSync Pro is the best cloud collaboration tool." > data/overview.txt
The API Route — Streaming RAG Endpoint
This is the heart of the application. A Next.js 16 App Router route that streams LlamaIndex responses using the Vercel AI SDK:
// app/api/chat/route.ts
import { NextRequest } from "next/server";
import { Settings } from "llamaindex";
import { OpenAI } from "llamaindex/llms";
import { OpenAIEmbedding } from "llamaindex/embeddings";
import { VectorStoreIndex } from "llamaindex/indices";
import { storageContextFromDefaults } from "llamaindex/storage";
import { SimpleDirectoryReader } from "llamaindex/ingestion";
import { SentenceSplitter } from "llamaindex/node-parser";
import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";
import * as fs from "fs";
import * as path from "path";
// ── Module-level cache: initialize once, reuse across requests ───────
// Next.js 16 App Router supports module-level singletons in serverless.
let cachedIndex: VectorStoreIndex | null = null;
async function getIndex(): Promise<VectorStoreIndex> {
if (cachedIndex) return cachedIndex; // Return cached index
Settings.llm = new OpenAI({
model: "gpt-4o",
apiKey: process.env.OPENAI_API_KEY,
temperature: 0.1,
});
Settings.embedModel = new OpenAIEmbedding({
model: "text-embedding-3-large",
apiKey: process.env.OPENAI_API_KEY,
});
Settings.nodeParser = new SentenceSplitter({
chunkSize: 512,
chunkOverlap: 64,
});
const persistDir = path.join(process.cwd(), "storage");
const storageContext = await storageContextFromDefaults({ persistDir });
if (fs.existsSync(persistDir) && fs.readdirSync(persistDir).length > 0) {
cachedIndex = await VectorStoreIndex.init({ storageContext });
} else {
const dataDir = path.join(process.cwd(), "data");
const reader = new SimpleDirectoryReader();
const documents = await reader.loadData(dataDir);
cachedIndex = await VectorStoreIndex.fromDocuments(documents, {
storageContext,
});
}
return cachedIndex!;
}
// ── POST /api/chat — Streaming chat endpoint ─────────────────────────
export async function POST(request: NextRequest) {
try {
const { messages } = await request.json();
// Get the latest user message
const userMessage = messages[messages.length - 1]?.content as string;
if (!userMessage) {
return new Response("No message provided", { status: 400 });
}
// Get the index (from cache or build it)
const index = await getIndex();
// Retrieve relevant context from your documents
const queryEngine = index.asQueryEngine({ similarityTopK: 4 });
const context = await queryEngine.query({ query: userMessage });
const contextText = context.toString();
// Use Vercel AI SDK for clean streaming with history support
// We combine the RAG context with the conversation history
const result = await streamText({
model: openai("gpt-4o"),
system: `You are a helpful assistant with access to company documentation.
IMPORTANT: Answer ONLY based on the context provided. If the answer is not in the context, say so clearly.
CONTEXT FROM DOCUMENTS:
${contextText}
Source files used: ${context.sourceNodes?.map((n) => n.node.metadata?.file_name).join(", ") || "unknown"}`,
messages: messages.map((m: { role: string; content: string }) => ({
role: m.role as "user" | "assistant",
content: m.content,
})),
});
// Return the streaming response (Vercel AI SDK handles the SSE format)
return result.toDataStreamResponse();
} catch (error) {
console.error("Chat API error:", error);
return new Response("Internal server error", { status: 500 });
}
}
The Chat UI Component
// app/page.tsx
"use client";
import { useChat } from "ai/react";
import { useRef, useEffect } from "react";
export default function ChatPage() {
const { messages, input, handleInputChange, handleSubmit, isLoading } =
useChat({ api: "/api/chat" });
const messagesEndRef = useRef<HTMLDivElement>(null);
// Auto-scroll to latest message
useEffect(() => {
messagesEndRef.current?.scrollIntoView({ behavior: "smooth" });
}, [messages]);
return (
<div className="flex flex-col h-screen bg-gray-950">
{/* Header */}
<div className="border-b border-gray-800 p-4">
<h1 className="text-xl font-bold text-white">
🦙 RAG Chatbot
</h1>
<p className="text-sm text-gray-400">
Powered by LlamaIndex 0.12 + GPT-4o + Next.js 16
</p>
</div>
{/* Messages */}
<div className="flex-1 overflow-y-auto p-4 space-y-4">
{messages.length === 0 && (
<div className="text-center text-gray-500 mt-16">
<p className="text-lg">Ask me anything about the documentation.</p>
<p className="text-sm mt-2">I'll answer using your own documents.</p>
</div>
)}
{messages.map((message) => (
<div
key={message.id}
className={`flex ${
message.role === "user" ? "justify-end" : "justify-start"
}`}
>
<div
className={`max-w-[80%] rounded-2xl px-4 py-3 ${
message.role === "user"
? "bg-blue-600 text-white"
: "bg-gray-800 text-gray-100"
}`}
>
<p className="text-sm leading-relaxed whitespace-pre-wrap">
{message.content}
</p>
</div>
</div>
))}
{/* Streaming indicator */}
{isLoading && (
<div className="flex justify-start">
<div className="bg-gray-800 rounded-2xl px-4 py-3">
<div className="flex gap-1">
<span className="w-2 h-2 bg-gray-400 rounded-full animate-bounce [animation-delay:0ms]"></span>
<span className="w-2 h-2 bg-gray-400 rounded-full animate-bounce [animation-delay:150ms]"></span>
<span className="w-2 h-2 bg-gray-400 rounded-full animate-bounce [animation-delay:300ms]"></span>
</div>
</div>
</div>
)}
<div ref={messagesEndRef} />
</div>
{/* Input */}
<div className="border-t border-gray-800 p-4">
<form onSubmit={handleSubmit} className="flex gap-3">
<input
value={input}
onChange={handleInputChange}
placeholder="Ask a question about your documents..."
className="flex-1 bg-gray-800 text-white rounded-xl px-4 py-3 text-sm
border border-gray-700 focus:border-blue-500 focus:outline-none
placeholder-gray-500"
disabled={isLoading}
/>
<button
type="submit"
disabled={isLoading || !input.trim()}
className="bg-blue-600 hover:bg-blue-500 disabled:bg-gray-700
text-white rounded-xl px-6 py-3 text-sm font-medium
transition-colors"
>
Send
</button>
</form>
</div>
</div>
);
}
Full-Stack Architecture Diagram
Your Full-Stack RAG App — Complete Architecture
app/page.tsx — Chat UIuseChat() — Vercel AI SDK hookapp/api/chat/route.tsstreamText()./storage/ — Persisted index📂
./data/ — Your documents⚡ Loaded once at startup
text-embedding-3-large🤖 Generation:
gpt-4o📡 Streaming via SSE
Deploying to Vercel — Production in 5 Minutes
Prepare for Deployment
Before deploying, handle two important production considerations:
1. The Storage Problem on Vercel
Vercel's serverless functions are stateless — they can't write to disk. Your ./storage folder won't persist between deployments. You have two options:
// Option A (simplest): In-memory index, re-build from documents on cold start
// ⚠️ Works for small document sets; slow cold starts for large ones
// Option B (recommended for production): Use a cloud vector store
// LlamaIndex supports: Pinecone, Qdrant, Weaviate, MongoDB Atlas, Supabase
// Example with Pinecone:
import { PineconeVectorStore } from "llamaindex/vector-store";
// Replace SimpleVectorStore with PineconeVectorStore in your storageContext
2. Environment Variables on Vercel
# Install Vercel CLI
npm install -g vercel@latest
# Set your environment variables
vercel env add OPENAI_API_KEY
# When prompted, paste your key and select all environments (Production, Preview, Development)
Deploy
# From your project root (the Next.js app directory)
vercel
# For production deployment
vercel --prod
The CLI will:
- Detect it's a Next.js project
- Build your app
- Deploy to a global CDN
- Give you a live URL like
https://rag-chatbot-xyz.vercel.app
Your Deployment Checklist
Pre-deployment Checklist
OPENAI_API_KEY set in Vercel environment variables
./storage is in .gitignore (don't commit the index)
./data is committed with your production documents
package.json: "engines": {"node": ">=20.9.0"}
vercel.json (default is 10s; RAG needs more):
{
"functions": {
"app/api/chat/route.ts": {
"maxDuration": 60
}
}
}
The Complete Series: Your Learning Path
You've Completed the Entire Series
Key Takeaways
storageContextFromDefaults({ persistDir }) and check for existing data. Index once, load forever. This is non-negotiable for production.stream: true to query() or chat() and iterate with for await. Use .response for QueryEngine and .delta for ChatEngine.VectorStoreIndex once at module scope. Never re-initialize on every API request — that's 10x slower and costs real money on every API call.create-llama to scaffold production apps instantly. It gives you a complete, tested, production-grade setup. Customize from there instead of building from scratch.maxDuration of 60 seconds for RAG routes. The default 10-second timeout will kill RAG requests on cold starts. Set it in vercel.json and use a cloud vector store for persistence in production.useChat() + streamText() combo is the fastest path to a streaming RAG UI. It handles SSE, conversation history, error states, and loading states — all out of the box with LlamaIndex 0.12.Try It Yourself — Final Challenge
Your Capstone Project — Build This in 1 Weekend
escalate_to_human() tool that triggers when confidence is low. Add a Pinecone or Qdrant vector store for production persistence. This is a production-grade support bot.