Skip to main content
AI-Developer → AI Engineering

Part 4 of 4 — Production RAG with LlamaIndex: Persistence, Streaming & Next.js 16 Deployment

Your RAG system works perfectly in development. Then it hits production and everything breaks: re-indexing takes 3 minutes on restart, users wait 10 seconds for answers, and your Vercel deployment times out. This article fixes all three — and deploys a full-stack Next.js 16 chatbot.

March 24, 2026
22 min read
#LlamaIndex#Next.js 16#Vercel#Streaming#Production#RAG#TypeScript#create-llama#Persistence

Your dev RAG works perfectly. You push to production. The server starts, begins re-indexing 5,000 documents, takes 4 minutes to be ready, and users get a timeout error in the meantime.

When it finally loads, users ask questions and wait 8–12 seconds for an answer — staring at a blank screen with no feedback.

These are the two production RAG killers. This article eliminates both — then deploys your finished app to Vercel.

🚀

WHAT YOU'LL BUILD IN THIS ARTICLE

🔷 Part A: Persistent index — index once, load in milliseconds forever
🔷 Part B: Streaming responses — real-time token-by-token output
🔷 Part C: Full-stack Next.js 16 chatbot with streaming UI
🔷 Part D: One-command deployment to Vercel

This is Part 4 of 4 — the final chapter. You need Part 1, Part 2, and Part 3 before reading this.

STEP 1 OF 5

The Re-Indexing Problem — And How Persistence Kills It

In Parts 1–3, every time your script ran, it re-indexed all your documents from scratch. For 5 documents this is fine. For 5,000 documents, that's:

  • ~10,000 API calls to OpenAI's embedding endpoint
  • ~$2–15 in API costs (per restart!)
  • 3–10 minutes of startup time before your server accepts requests
WITHOUT PERSISTENCE — Every Restart
Time to ready 3–10 min
API calls made ~10,000
Cost per restart $2–15
User experience Timeout error
WITH PERSISTENCE — After First Index
Time to ready < 2 seconds
API calls made 0
Cost per restart $0.00
User experience Instant

How Persistence Works

LlamaIndex's storageContextFromDefaults saves your entire index to disk — vectors, nodes, document metadata, and all. On subsequent runs, it loads from disk instead of re-embedding.

Persistence — First Run vs Every Run After

FIRST RUN ONLY
Load documents
Chunk into Nodes
Call OpenAI embedding API ($$)
Build VectorStoreIndex
Save everything to ./storage/ ✓
EVERY RESTART AFTER
Check if ./storage/ exists
Load from disk (instant, $0)
No embedding API calls
No chunking
Ready to serve queries immediately ✓

Here's the complete persistence pattern:

// src/persist.ts
import "dotenv/config";
import * as fs from "fs";
import { Settings } from "llamaindex";
import { OpenAI } from "llamaindex/llms";
import { OpenAIEmbedding } from "llamaindex/embeddings";
import { SimpleDirectoryReader } from "llamaindex/ingestion";
import { VectorStoreIndex } from "llamaindex/indices";
import { storageContextFromDefaults } from "llamaindex/storage";
import { SentenceSplitter } from "llamaindex/node-parser";

const PERSIST_DIR = "./storage";

async function getOrCreateIndex() {
  Settings.llm = new OpenAI({
    model: "gpt-4o",
    apiKey: process.env.OPENAI_API_KEY,
    temperature: 0.1,
  });
  Settings.embedModel = new OpenAIEmbedding({
    model: "text-embedding-3-large",
    apiKey: process.env.OPENAI_API_KEY,
  });
  Settings.nodeParser = new SentenceSplitter({
    chunkSize: 512,
    chunkOverlap: 64,
  });

  // Create a storage context pointing to our persist directory
  const storageContext = await storageContextFromDefaults({
    persistDir: PERSIST_DIR,
  });

  // If the storage directory already has data, load from disk
  if (fs.existsSync(PERSIST_DIR) && fs.readdirSync(PERSIST_DIR).length > 0) {
    console.log("📂 Loading index from disk (no API calls needed)...");
    const startTime = Date.now();

    // VectorStoreIndex.init() loads from the storage context without re-embedding
    const index = await VectorStoreIndex.init({ storageContext });

    console.log(`✅ Index loaded in ${Date.now() - startTime}ms`);
    return index;
  }

  // First time: build the index from your documents
  console.log("📄 First run: building index from documents...");
  console.log("   (This will take a minute but only happens once)");

  const reader = new SimpleDirectoryReader();
  const documents = await reader.loadData("./src/data");
  console.log(`   Loaded ${documents.length} documents`);

  const startTime = Date.now();

  // fromDocuments() with storageContext automatically persists when done
  const index = await VectorStoreIndex.fromDocuments(documents, {
    storageContext,
  });

  console.log(
    `✅ Index built and saved to ${PERSIST_DIR} in ${Date.now() - startTime}ms`,
  );
  console.log("   Next restart will load in under 2 seconds.");
  return index;
}

async function main() {
  const index = await getOrCreateIndex();
  const queryEngine = index.asQueryEngine({ similarityTopK: 4 });

  const response = await queryEngine.query({
    query: "What is the remote work policy?",
  });

  console.log("\n💬 Answer:", response.toString());
}

main().catch(console.error);

First run:

📄 First run: building index from documents...
   (This will take a minute but only happens once)
   Loaded 3 documents
✅ Index built and saved to ./storage in 12,847ms
   Next restart will load in under 2 seconds.

Second run (and every run after):

📂 Loading index from disk (no API calls needed)...
✅ Index loaded in 142ms

142 milliseconds instead of 13 seconds. Zero API cost instead of dollars.

WHEN TO RE-INDEX

Delete the ./storage folder and restart your server whenever you: add new documents, update existing documents, or change your chunk size or embedding model. LlamaIndex will rebuild the index automatically on the next start.

STEP 2 OF 5

Streaming Responses — From 10-Second Wait to Real-Time Feel

Without streaming, your user sees this: a blank screen for 8–12 seconds, then a full response appears at once. It feels broken, even if it works perfectly.

With streaming, the response appears token-by-token — exactly like ChatGPT. The user sees the AI "thinking" in real time. Perceived wait time drops from 10 seconds to under 1 second.

WITHOUT STREAMING
⏳ 8–12 seconds...
User stares at a blank/loading screen. They assume it's broken. Many close the tab.
WITH STREAMING
The Enterprise plan|
Tokens appear immediately. User sees the AI responding in real-time. Feels instant and alive.

Streaming in Your TypeScript RAG

// src/stream-demo.ts
import "dotenv/config";
import * as fs from "fs";
import { Settings } from "llamaindex";
import { OpenAI } from "llamaindex/llms";
import { OpenAIEmbedding } from "llamaindex/embeddings";
import { VectorStoreIndex } from "llamaindex/indices";
import { storageContextFromDefaults } from "llamaindex/storage";

async function main() {
  Settings.llm = new OpenAI({
    model: "gpt-4o",
    apiKey: process.env.OPENAI_API_KEY,
    temperature: 0.1,
  });
  Settings.embedModel = new OpenAIEmbedding({
    model: "text-embedding-3-large",
    apiKey: process.env.OPENAI_API_KEY,
  });

  // Load from disk (assumes you already ran persist.ts)
  const storageContext = await storageContextFromDefaults({
    persistDir: "./storage",
  });
  const index = await VectorStoreIndex.init({ storageContext });

  const queryEngine = index.asQueryEngine({ similarityTopK: 4 });

  const query = "Explain all the remote work equipment policies in detail.";
  console.log(`\n📌 Query: ${query}\n`);
  console.log("💬 Answer (streaming):\n");

  // ── The key: pass stream: true to get an async iterable ───────────
  const stream = await queryEngine.query({
    query,
    stream: true, // Enable streaming
  });

  // Process each token as it arrives
  for await (const chunk of stream) {
    // chunk.response contains the new text fragment (1–5 tokens typically)
    process.stdout.write(chunk.response); // Write without newline for continuous flow
  }

  console.log("\n\n✅ Stream complete");
}

main().catch(console.error);

For ChatEngine streaming (the pattern you'll use in the Next.js API):

// Chat engine streaming — exact same pattern
const chatEngine = index.asChatEngine();

const stream = await chatEngine.chat({
  message: "What are the core hours for remote workers?",
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.delta); // ChatEngine uses .delta, QueryEngine uses .response
}

Streaming Property Reference

QueryEngine stream chunks
chunk.response — new text
chunk.sourceNodes — sources
chunk.done — is it finished?
ChatEngine stream chunks
chunk.delta — new text token
chunk.message — full message so far
(sources available after stream)
STEP 3 OF 5

The create-llama CLI — From Zero to Full-Stack in 60 Seconds

Before we build the Next.js app manually, meet the tool that generates the boilerplate for you: create-llama.


npx create-llama@latest

# You'll be asked:
# ? What is your project named? → my-rag-chatbot
# ? Which framework? → Next.js
# ? Which model provider? → OpenAI
# ? Which model? → gpt-4o
# ? Which data source? → Files
# ? Which vector store? → SimpleVectorStore (in-memory)
# ? Add LlamaCloud? → No
# ? Would you like to use ESLint? → Yes

This generates a complete production-grade app with:

  • Streaming chat UI (React components)
  • LlamaIndex backend API routes
  • File upload support
  • Dark/light theme
  • Proper error handling and loading states

Run it:

cd my-rag-chatbot
cp .env.example .env.local
# Add your OPENAI_API_KEY to .env.local
npm install
npm run dev

Open http://localhost:3000 — you have a working RAG chatbot.

USE CREATE-LLAMA AS YOUR STARTING POINT

create-llama is the fastest path to a working full-stack app. Use it as your scaffold, then customize: add persistence, wire in your own data sources, add authentication. Don't build from scratch what create-llama gives you for free.

STEP 4 OF 5

Building the Next.js 16 RAG Chatbot From Scratch

Let's build our own version — so you understand every piece. This gives you full control and the ability to customize anything.

Setup

npx [email protected] rag-chatbot \
  --typescript \
  --tailwind \
  --eslint \
  --app \
  --no-src-dir \
  --import-alias "@/*"

cd rag-chatbot

# Install LlamaIndex 0.12 and AI SDK for streaming
npm install [email protected] [email protected] [email protected] [email protected]

Create your .env.local:

OPENAI_API_KEY=sk-proj-...your-key...

Add your data to the project:

mkdir -p data
# Copy your documents from Part 2-3 (or create new ones)
echo "CloudSync Pro is the best cloud collaboration tool." > data/overview.txt

The API Route — Streaming RAG Endpoint

This is the heart of the application. A Next.js 16 App Router route that streams LlamaIndex responses using the Vercel AI SDK:

// app/api/chat/route.ts
import { NextRequest } from "next/server";
import { Settings } from "llamaindex";
import { OpenAI } from "llamaindex/llms";
import { OpenAIEmbedding } from "llamaindex/embeddings";
import { VectorStoreIndex } from "llamaindex/indices";
import { storageContextFromDefaults } from "llamaindex/storage";
import { SimpleDirectoryReader } from "llamaindex/ingestion";
import { SentenceSplitter } from "llamaindex/node-parser";
import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";
import * as fs from "fs";
import * as path from "path";

// ── Module-level cache: initialize once, reuse across requests ───────
// Next.js 16 App Router supports module-level singletons in serverless.
let cachedIndex: VectorStoreIndex | null = null;

async function getIndex(): Promise<VectorStoreIndex> {
  if (cachedIndex) return cachedIndex; // Return cached index

  Settings.llm = new OpenAI({
    model: "gpt-4o",
    apiKey: process.env.OPENAI_API_KEY,
    temperature: 0.1,
  });
  Settings.embedModel = new OpenAIEmbedding({
    model: "text-embedding-3-large",
    apiKey: process.env.OPENAI_API_KEY,
  });
  Settings.nodeParser = new SentenceSplitter({
    chunkSize: 512,
    chunkOverlap: 64,
  });

  const persistDir = path.join(process.cwd(), "storage");
  const storageContext = await storageContextFromDefaults({ persistDir });

  if (fs.existsSync(persistDir) && fs.readdirSync(persistDir).length > 0) {
    cachedIndex = await VectorStoreIndex.init({ storageContext });
  } else {
    const dataDir = path.join(process.cwd(), "data");
    const reader = new SimpleDirectoryReader();
    const documents = await reader.loadData(dataDir);
    cachedIndex = await VectorStoreIndex.fromDocuments(documents, {
      storageContext,
    });
  }

  return cachedIndex!;
}

// ── POST /api/chat — Streaming chat endpoint ─────────────────────────
export async function POST(request: NextRequest) {
  try {
    const { messages } = await request.json();

    // Get the latest user message
    const userMessage = messages[messages.length - 1]?.content as string;
    if (!userMessage) {
      return new Response("No message provided", { status: 400 });
    }

    // Get the index (from cache or build it)
    const index = await getIndex();

    // Retrieve relevant context from your documents
    const queryEngine = index.asQueryEngine({ similarityTopK: 4 });
    const context = await queryEngine.query({ query: userMessage });
    const contextText = context.toString();

    // Use Vercel AI SDK for clean streaming with history support
    // We combine the RAG context with the conversation history
    const result = await streamText({
      model: openai("gpt-4o"),
      system: `You are a helpful assistant with access to company documentation.

IMPORTANT: Answer ONLY based on the context provided. If the answer is not in the context, say so clearly.

CONTEXT FROM DOCUMENTS:
${contextText}

Source files used: ${context.sourceNodes?.map((n) => n.node.metadata?.file_name).join(", ") || "unknown"}`,
      messages: messages.map((m: { role: string; content: string }) => ({
        role: m.role as "user" | "assistant",
        content: m.content,
      })),
    });

    // Return the streaming response (Vercel AI SDK handles the SSE format)
    return result.toDataStreamResponse();
  } catch (error) {
    console.error("Chat API error:", error);
    return new Response("Internal server error", { status: 500 });
  }
}

The Chat UI Component

// app/page.tsx
"use client";

import { useChat } from "ai/react";
import { useRef, useEffect } from "react";

export default function ChatPage() {
  const { messages, input, handleInputChange, handleSubmit, isLoading } =
    useChat({ api: "/api/chat" });

  const messagesEndRef = useRef<HTMLDivElement>(null);

  // Auto-scroll to latest message
  useEffect(() => {
    messagesEndRef.current?.scrollIntoView({ behavior: "smooth" });
  }, [messages]);

  return (
    <div className="flex flex-col h-screen bg-gray-950">
      {/* Header */}
      <div className="border-b border-gray-800 p-4">
        <h1 className="text-xl font-bold text-white">
          🦙 RAG Chatbot
        </h1>
        <p className="text-sm text-gray-400">
          Powered by LlamaIndex 0.12 + GPT-4o + Next.js 16
        </p>
      </div>

      {/* Messages */}
      <div className="flex-1 overflow-y-auto p-4 space-y-4">
        {messages.length === 0 && (
          <div className="text-center text-gray-500 mt-16">
            <p className="text-lg">Ask me anything about the documentation.</p>
            <p className="text-sm mt-2">I'll answer using your own documents.</p>
          </div>
        )}

        {messages.map((message) => (
          <div
            key={message.id}
            className={`flex ${
              message.role === "user" ? "justify-end" : "justify-start"
            }`}
          >
            <div
              className={`max-w-[80%] rounded-2xl px-4 py-3 ${
                message.role === "user"
                  ? "bg-blue-600 text-white"
                  : "bg-gray-800 text-gray-100"
              }`}
            >
              <p className="text-sm leading-relaxed whitespace-pre-wrap">
                {message.content}
              </p>
            </div>
          </div>
        ))}

        {/* Streaming indicator */}
        {isLoading && (
          <div className="flex justify-start">
            <div className="bg-gray-800 rounded-2xl px-4 py-3">
              <div className="flex gap-1">
                <span className="w-2 h-2 bg-gray-400 rounded-full animate-bounce [animation-delay:0ms]"></span>
                <span className="w-2 h-2 bg-gray-400 rounded-full animate-bounce [animation-delay:150ms]"></span>
                <span className="w-2 h-2 bg-gray-400 rounded-full animate-bounce [animation-delay:300ms]"></span>
              </div>
            </div>
          </div>
        )}

        <div ref={messagesEndRef} />
      </div>

      {/* Input */}
      <div className="border-t border-gray-800 p-4">
        <form onSubmit={handleSubmit} className="flex gap-3">
          <input
            value={input}
            onChange={handleInputChange}
            placeholder="Ask a question about your documents..."
            className="flex-1 bg-gray-800 text-white rounded-xl px-4 py-3 text-sm
                       border border-gray-700 focus:border-blue-500 focus:outline-none
                       placeholder-gray-500"
            disabled={isLoading}
          />
          <button
            type="submit"
            disabled={isLoading || !input.trim()}
            className="bg-blue-600 hover:bg-blue-500 disabled:bg-gray-700
                       text-white rounded-xl px-6 py-3 text-sm font-medium
                       transition-colors"
          >
            Send
          </button>
        </form>
      </div>
    </div>
  );
}

Full-Stack Architecture Diagram

Your Full-Stack RAG App — Complete Architecture

BROWSER — REACT UI
📱 app/page.tsx — Chat UI
useChat() — Vercel AI SDK hook
🔄 Real-time streaming tokens
📜 Full conversation history
↕ HTTP (SSE stream)
NEXT.JS 16 SERVER — API ROUTE
🔧 app/api/chat/route.ts
🗄️ Cached VectorStoreIndex (module-level)
🔍 LlamaIndex QueryEngine retrieval
📡 Vercel AI SDK streamText()
LOCAL STORAGE
💾 ./storage/ — Persisted index
📂 ./data/ — Your documents
⚡ Loaded once at startup
OPENAI APIS
🔢 Embeddings: text-embedding-3-large
🤖 Generation: gpt-4o
📡 Streaming via SSE
STEP 5 OF 5

Deploying to Vercel — Production in 5 Minutes

Prepare for Deployment

Before deploying, handle two important production considerations:

1. The Storage Problem on Vercel

Vercel's serverless functions are stateless — they can't write to disk. Your ./storage folder won't persist between deployments. You have two options:

// Option A (simplest): In-memory index, re-build from documents on cold start
// ⚠️ Works for small document sets; slow cold starts for large ones

// Option B (recommended for production): Use a cloud vector store
// LlamaIndex supports: Pinecone, Qdrant, Weaviate, MongoDB Atlas, Supabase
// Example with Pinecone:
import { PineconeVectorStore } from "llamaindex/vector-store";
// Replace SimpleVectorStore with PineconeVectorStore in your storageContext

2. Environment Variables on Vercel

# Install Vercel CLI
npm install -g vercel@latest

# Set your environment variables
vercel env add OPENAI_API_KEY
# When prompted, paste your key and select all environments (Production, Preview, Development)

Deploy

# From your project root (the Next.js app directory)
vercel

# For production deployment
vercel --prod

The CLI will:

  1. Detect it's a Next.js project
  2. Build your app
  3. Deploy to a global CDN
  4. Give you a live URL like https://rag-chatbot-xyz.vercel.app

Your Deployment Checklist

Pre-deployment Checklist

OPENAI_API_KEY set in Vercel environment variables
./storage is in .gitignore (don't commit the index)
./data is committed with your production documents
API route uses module-level index caching (not re-indexing per request)
Error handling in the API route (try/catch returns proper HTTP status codes)
Node.js version set to 22 in package.json: "engines": {"node": ">=20.9.0"}
Function timeout extended in vercel.json (default is 10s; RAG needs more):
// vercel.json
{
  "functions": {
    "app/api/chat/route.ts": {
      "maxDuration": 60
    }
  }
}

The Complete Series: Your Learning Path

You've Completed the Entire Series

Part 1
LlamaIndex Introduction
The 3-phase architecture, setup, first RAG query
Part 2
Build Real RAG Systems
Loaders, chunking, QueryEngine, ChatEngine, Express API
Part 3
LlamaIndex Agents
ReAct loop, FunctionTool, RouterQueryEngine, multi-source RAG
Part 4
Production: Persistence, Streaming & Deployment
Disk persistence, streaming, Next.js 16, Vercel — you are here

Key Takeaways

Persistence turns a 10-minute startup into a 2-second startup. Use storageContextFromDefaults({ persistDir }) and check for existing data. Index once, load forever. This is non-negotiable for production.
Streaming turns a 10-second blank screen into a real-time experience. Pass stream: true to query() or chat() and iterate with for await. Use .response for QueryEngine and .delta for ChatEngine.
Module-level index caching is critical in Next.js. Initialize VectorStoreIndex once at module scope. Never re-initialize on every API request — that's 10x slower and costs real money on every API call.
Use create-llama to scaffold production apps instantly. It gives you a complete, tested, production-grade setup. Customize from there instead of building from scratch.
Vercel needs a maxDuration of 60 seconds for RAG routes. The default 10-second timeout will kill RAG requests on cold starts. Set it in vercel.json and use a cloud vector store for persistence in production.
The Vercel AI SDK's useChat() + streamText() combo is the fastest path to a streaming RAG UI. It handles SSE, conversation history, error states, and loading states — all out of the box with LlamaIndex 0.12.

Try It Yourself — Final Challenge

Your Capstone Project — Build This in 1 Weekend

LEVEL 1
Personal Knowledge Base Chatbot
Load all your personal notes, bookmarks, and Obsidian/Notion exports. Deploy it on Vercel. Now you have a personal AI that knows everything you've written and learned. Add more docs anytime — just re-index.
LEVEL 2
Company Documentation Q&A
Load your company's entire wiki, HR docs, and technical docs. Add a RouterQueryEngine to separate departments (HR Engine, Engineering Engine, Product Engine). Deploy internally. Watch it eliminate 100 Slack messages a day.
LEVEL 3
Customer Support Agent with Escalation
Build an OpenAIAgent with three tools: (1) search product docs, (2) search FAQ, (3) a escalate_to_human() tool that triggers when confidence is low. Add a Pinecone or Qdrant vector store for production persistence. This is a production-grade support bot.

🎉
You've completed the full LlamaIndex series.
You went from "what is LlamaIndex?" to a deployed, streaming, persistent, production-grade RAG chatbot with agents and intelligent routing. That's the complete stack. Now go build something real.
✦ LlamaIndex 0.12 modular imports
✦ OpenAI GPT-4o + embeddings-3-large
✦ Persistent vector storage
✦ Real-time streaming
✦ Multi-source agents
✦ Next.js 16 + Vercel deployed
MH

Mohamed Hamed

20 years building production systems — the last several deep in AI integration, LLMs, and full-stack architecture. I write what I've actually built and broken. If this was useful, the next one goes to LinkedIn first.

Follow on LinkedIn →