Your dev RAG works perfectly. You push to production. Everything breaks.
Server starts, begins re-indexing 5,000 documents, takes 4 minutes to be ready, and users get a timeout error. When it finally loads, users wait 12 seconds for an answer.
- Part A: Persistent index — index once, load in milliseconds forever.
- Part B: Streaming responses — real-time token-by-token output.
- Part C: Full-stack Next.js 16 chatbot with streaming UI.
- Part D: One-command deployment to Vercel.
This is Part 4 of 4 — the final chapter. You need Part 1, Part 2, and Part 3 before reading this.
The Re-Indexing Problem
In Parts 1–3, every time your script ran, it re-indexed all your documents from scratch. For 5,000 documents, that's thousands of API calls, high costs, and minutes of startup time.
The Impact of Persistence
- Time to ready: 3–10 minutes
- API calls: ~10,000 (per restart)
- Cost: $2–15
- User Experience: Timeout errors
- Time to ready: < 2 seconds
- API calls: 0
- Cost: $0.00
- User Experience: Instant load
How Persistence Works
LlamaIndex's storageContextFromDefaults saves your entire index to disk. On subsequent runs, it loads from disk instead of re-embedding.
- First Run: Load docs → Chunk into Nodes → Call OpenAI API ($$) → Build Index → Save to ./storage/
- Subsequent Runs: Check ./storage/ → Load from disk (instant, $0) → Ready to serve queries.
// src/persist.ts (simplified)
const storageContext = await storageContextFromDefaults({ persistDir: "./storage" });
if (fs.existsSync("./storage")) {
const index = await VectorStoreIndex.init({ storageContext }); // Load
} else {
const index = await VectorStoreIndex.fromDocuments(documents, { storageContext }); // Build & Save
}
Delete the ./storage folder whenever you add new documents, update existing ones, or change your embedding model. LlamaIndex will rebuild the index automatically.
Streaming Responses
Without streaming, users wait 10 seconds for a full response. With streaming, tokens appear immediately. Perceived wait time drops to under 1 second.
Streaming vs. Non-Streaming UX
User stares at a blank screen for 10 seconds. They assume it's broken and close the tab.
Tokens appear immediately. User sees the AI "thinking". Feels alive and responsive.
Implementation in TypeScript
const queryEngine = index.asQueryEngine({ similarityTopK: 4 });
const stream = await queryEngine.query({ query, stream: true });
for await (const chunk of stream) {
process.stdout.write(chunk.response); // Real-time tokens
}
The create-llama CLI
Before we build manually, meet the tool that generates the boilerplate for you.
npx create-llama@latest is the fastest path to a working full-stack app. Use it as your scaffold, then customize your data sources and persistence logic.
Building the Next.js 16 Chatbot
We'll use the Next.js 16 App Router to build a streaming RAG endpoint.
- Browser (React UI):
app/page.tsxwith Vercel AI SDKuseChat()hook. - Next.js 16 Server:
app/api/chat/route.tswith LlamaIndex QueryEngine. - Storage: Local
./storage/for persistent index,./data/for documents. - APIs: OpenAI
gpt-4ofor generation andtext-embedding-3-largefor vectors.
// app/api/chat/route.ts
export async function POST(request: NextRequest) {
const { messages } = await request.json();
const index = await getIndex(); // Persistent loader
const queryEngine = index.asQueryEngine({ similarityTopK: 4 });
const context = await queryEngine.query({ query: messages.at(-1).content });
const result = await streamText({
model: openai("gpt-4o"),
system: "Answer based ONLY on the context: " + context.toString(),
messages,
});
return result.toDataStreamResponse();
}
Deploying to Vercel
Vercel functions are stateless, meaning local persistence to ./storage won't work across deployments.
Production Storage Strategies
Re-build from documents on cold start. Only works for small datasets.
Connect to Pinecone, Qdrant, or Weaviate. Index stays persistent globally.
Deployment Checklist
Launch Sequence
Set environment variables: vercel env add OPENAI_API_KEY.
Run npm run build locally to verify production stability.
Run vercel --prod to push to the global edge network.
Check Vercel logs for any cold start or timeout issues.
Key Takeaways
Never re-index in production. Load from disk or a cloud vector DB for sub-second startup.
Blocking responses feel broken. Use streaming to make your RAG app feel "live".
For Vercel/Lambda deployments, move your vectors to a managed service like Pinecone.