Skip to main content
AI-Developer/AI Engineering
Part 1 of 16

Part 4 of 4 — Production RAG with LlamaIndex: Persistence, Streaming & Next.js 16 Deployment

Your RAG system works perfectly in development. Then it hits production and everything breaks: re-indexing takes 3 minutes on restart, users wait 10 seconds for answers, and your Vercel deployment times out. This article fixes all three — and deploys a full-stack Next.js 16 chatbot.

March 24, 2026
22 min read
#LlamaIndex#Next.js 16#Vercel#Streaming#Production#RAG#TypeScript#create-llama#Persistence

Your dev RAG works perfectly. You push to production. Everything breaks.

Server starts, begins re-indexing 5,000 documents, takes 4 minutes to be ready, and users get a timeout error. When it finally loads, users wait 12 seconds for an answer.

Primary Objective
These are the two production RAG killers. This article eliminates both — then deploys your finished app to Vercel.
What You'll Build in This Article
  • Part A: Persistent index — index once, load in milliseconds forever.
  • Part B: Streaming responses — real-time token-by-token output.
  • Part C: Full-stack Next.js 16 chatbot with streaming UI.
  • Part D: One-command deployment to Vercel.

This is Part 4 of 4 — the final chapter. You need Part 1, Part 2, and Part 3 before reading this.


The Re-Indexing Problem

In Parts 1–3, every time your script ran, it re-indexed all your documents from scratch. For 5,000 documents, that's thousands of API calls, high costs, and minutes of startup time.

The Impact of Persistence

🐢WITHOUT PERSISTENCE
  • Time to ready: 3–10 minutes
  • API calls: ~10,000 (per restart)
  • Cost: $2–15
  • User Experience: Timeout errors
WITH PERSISTENCE
  • Time to ready: < 2 seconds
  • API calls: 0
  • Cost: $0.00
  • User Experience: Instant load

How Persistence Works

LlamaIndex's storageContextFromDefaults saves your entire index to disk. On subsequent runs, it loads from disk instead of re-embedding.

Persistence Lifecycle
  • First Run: Load docs → Chunk into Nodes → Call OpenAI API ($$) → Build Index → Save to ./storage/
  • Subsequent Runs: Check ./storage/ → Load from disk (instant, $0) → Ready to serve queries.
// src/persist.ts (simplified)
const storageContext = await storageContextFromDefaults({ persistDir: "./storage" });

if (fs.existsSync("./storage")) {
  const index = await VectorStoreIndex.init({ storageContext }); // Load
} else {
  const index = await VectorStoreIndex.fromDocuments(documents, { storageContext }); // Build & Save
}
💡
When to Re-Index

Delete the ./storage folder whenever you add new documents, update existing ones, or change your embedding model. LlamaIndex will rebuild the index automatically.


Streaming Responses

Without streaming, users wait 10 seconds for a full response. With streaming, tokens appear immediately. Perceived wait time drops to under 1 second.

Streaming vs. Non-Streaming UX

TRADITIONAL (BLOCKING)

User stares at a blank screen for 10 seconds. They assume it's broken and close the tab.

STREAMING (REAL-TIME)

Tokens appear immediately. User sees the AI "thinking". Feels alive and responsive.

Implementation in TypeScript

const queryEngine = index.asQueryEngine({ similarityTopK: 4 });
const stream = await queryEngine.query({ query, stream: true });

for await (const chunk of stream) {
  process.stdout.write(chunk.response); // Real-time tokens
}

The create-llama CLI

Before we build manually, meet the tool that generates the boilerplate for you.

💡
Expert Tip

npx create-llama@latest is the fastest path to a working full-stack app. Use it as your scaffold, then customize your data sources and persistence logic.


Building the Next.js 16 Chatbot

We'll use the Next.js 16 App Router to build a streaming RAG endpoint.

Full-Stack Architecture
  • Browser (React UI): app/page.tsx with Vercel AI SDK useChat() hook.
  • Next.js 16 Server: app/api/chat/route.ts with LlamaIndex QueryEngine.
  • Storage: Local ./storage/ for persistent index, ./data/ for documents.
  • APIs: OpenAI gpt-4o for generation and text-embedding-3-large for vectors.
// app/api/chat/route.ts
export async function POST(request: NextRequest) {
  const { messages } = await request.json();
  const index = await getIndex(); // Persistent loader
  const queryEngine = index.asQueryEngine({ similarityTopK: 4 });
  const context = await queryEngine.query({ query: messages.at(-1).content });

  const result = await streamText({
    model: openai("gpt-4o"),
    system: "Answer based ONLY on the context: " + context.toString(),
    messages,
  });

  return result.toDataStreamResponse();
}

Deploying to Vercel

Vercel functions are stateless, meaning local persistence to ./storage won't work across deployments.

Production Storage Strategies

IN-MEMORY (SIMPLE)

Re-build from documents on cold start. Only works for small datasets.

CLOUD VECTOR DB (SCALABLE)

Connect to Pinecone, Qdrant, or Weaviate. Index stays persistent globally.

Deployment Checklist

Launch Sequence

⚙️
CONFIG

Set environment variables: vercel env add OPENAI_API_KEY.

🏗️
BUILD

Run npm run build locally to verify production stability.

🚀
DEPLOY

Run vercel --prod to push to the global edge network.

📊
MONITOR

Check Vercel logs for any cold start or timeout issues.


Key Takeaways

01
01
Persistence is Mandatory

Never re-index in production. Load from disk or a cloud vector DB for sub-second startup.

01
01
Streaming is the UX Standard

Blocking responses feel broken. Use streaming to make your RAG app feel "live".

01
01
Stateless Functions Need Cloud DBs

For Vercel/Lambda deployments, move your vectors to a managed service like Pinecone.

AI Engineering
MH

Mohamed Hamed

20 years building production systems — the last several deep in AI integration, LLMs, and full-stack architecture. I write what I've actually built and broken. If this was useful, the next one goes to LinkedIn first.

Follow on LinkedIn →