Most agentic RAG latency problems aren't in the retrieval logic. They're in the three network round-trips between your embedding API, your vector database, and your LLM API. Fix the hardware layer and the latency problem largely solves itself. The RAG pipeline bare metal case study shows exactly this: one company cut p99 latency from 1.8 seconds to 190ms by colocating all three components on the same GPU server. For the VRAM math behind agent workloads specifically, the GPU infrastructure for AI agents guide covers memory planning in detail.
For corpora where cross-document reasoning matters more than per-chunk retrieval, the GraphRAG deployment guide covers the full knowledge graph indexing and inference pipeline.
This guide focuses on the infrastructure layer: how to plan GPU memory, what components to colocate, and how to get the full embedding, vector search, and LLM stack running on one node.
What Is Agentic RAG and Why It Needs Dedicated GPU Infrastructure
Standard RAG does one thing: retrieve relevant documents, inject them into context, generate a response. One retrieval pass. One LLM call.
Agentic RAG is different. The model decides what to retrieve, evaluates the results, and may retrieve again if the context is insufficient. A typical agentic RAG loop runs 3-7 LLM calls per user turn: initial planning, retrieval, re-ranking, reflection, and possibly a follow-up retrieval pass before generating the final answer. For agents that also need to remember facts across sessions - not just retrieve from a document corpus - the agent memory infrastructure guide covers how to add Mem0 or Zep to the same GPU node.
At 300ms per LLM call via a managed API, three calls is 900ms before you've generated a single output token. Seven calls is 2.1 seconds. That's before network latency to your embedding API (100-400ms p99) and your vector database (30-250ms p99).
GPU colocation solves this. When your embedding model, vector index, and LLM all live on the same GPU server:
- Query encoding drops from 100-400ms to 2-5ms
- Vector search drops from 30-250ms to 1-3ms
- LLM TTFT drops from 600-1500ms to 30-80ms
The round-trip overhead disappears entirely because there are no round trips. Components communicate over local memory, not the network.
For multi-agent systems where many agents share the same retrieval infrastructure, see multi-agent AI system GPU infrastructure for orchestration patterns.
The GPU-Accelerated RAG Stack: Three Components, One Node
Embedding Model
Embedding throughput on GPU is not a minor improvement over CPU. Stella_en_1.5B_v5 on an H100 encodes tens of thousands of sentences per second. On CPU, that number is closer to 500. If your agentic loop re-encodes the query at each retrieval step, CPU-based embedding becomes a hard bottleneck at any meaningful concurrency.
Model options by VRAM requirement:
| Model | Params | VRAM (FP16) | Notes |
|---|---|---|---|
| BGE-M3 | 0.5B | ~1 GB | Multilingual, good for hybrid search |
| stella_en_1.5B_v5 | 1.5B | ~3 GB | Strong benchmark scores, low VRAM |
| E5-mistral-7b-instruct | 7B | ~14 GB | Best quality, expensive in VRAM |
VRAM figures above are model weight storage estimates. Actual inference VRAM will be higher due to KV cache, activations, and framework overhead.
For document corpora that include PDFs, slides, or scanned images, see the ColPali and multimodal document RAG guide for a visual retrieval approach that skips OCR entirely.
Since vLLM v0.6.4, you can serve embedding models directly through vLLM using --task embed. This gives you a unified serving stack for both embeddings and generation, using the same OpenAI-compatible API surface.
GPU-Accelerated Vector Search
Two main options:
FAISS-GPU (faiss-gpu-cu12): in-memory approximate nearest neighbor search on CUDA. For 10M vectors at 1536 dimensions, FAISS-GPU on an H100 returns results in single-digit milliseconds. The same search on CPU takes around 80ms. FAISS is purely in-memory, so it's the fastest option for real-time serving.
VRAM for FAISS is deterministic: vectors × dimensions × 4 bytes. For 1M vectors at 768 dimensions: 1,000,000 × 768 × 4 = 3 GB. Add roughly 20% for GPU runtime overhead when sizing your instance.
Milvus with GPU indexing: persistent vector database with optional GPU-accelerated index building. Index build on GPU is 10-50x faster than CPU. Good for multi-tenant workloads or corpora that update frequently, at the cost of higher operational complexity.
For most agentic RAG deployments with a single corpus under 50M vectors, FAISS-GPU in-memory is the right choice. For multi-tenant or frequently-updated corpora, Milvus with GPU indexing. For a full Milvus CAGRA configuration and production sharding guide, see self-hosting vector databases with GPU acceleration.
LLM Inference Server
vLLM is the default: OpenAI-compatible API, continuous batching, PagedAttention for KV cache management. For agentic RAG specifically, you need a model with reliable structured output and function-calling support. Llama 3.3 70B Instruct and Qwen3-8B both work well. Mistral 7B v0.3 is a solid choice if VRAM is tight.
For the serving setup details, see LLM serving optimization: continuous batching and PagedAttention. For VRAM sizing by model, see the GPU memory requirements for LLMs guide.
GPU Memory Planning: Fitting the Full Stack on One Node
This is where most agentic RAG deployments get into trouble. Each component looks manageable in isolation. Together, they compete for the same VRAM pool.
Full VRAM breakdown by component:
| Component | Model Example | VRAM (FP16) | VRAM (FP8/INT8) |
|---|---|---|---|
| Embedding | stella_en_1.5B_v5 | 3 GB | 1.5 GB |
| Embedding | E5-mistral-7b | 14 GB | 7 GB |
| Vector Index | FAISS-GPU, 1M vectors (768-dim) | ~3 GB | N/A |
| Vector Index | FAISS-GPU, 5M vectors (1536-dim) | ~31 GB | N/A |
| Vector Index | FAISS-GPU, 10M vectors (1536-dim) | ~61 GB | N/A |
| LLM | Llama 3.1 8B FP8 | 8 GB | 8 GB |
| LLM | Mistral 7B v0.3 | 14 GB | 7 GB |
| LLM | Llama 3.3 70B FP8 | ~70 GB | ~70 GB |
| KV Cache buffer | 50 concurrent agentic sessions | 10-20 GB | varies |
VRAM figures for embedding and LLM rows are model weight storage estimates. Actual inference VRAM is higher due to KV cache, activations, and framework overhead.
Three practical configurations:
Option 1: L40S 48GB, budget stack
- Embedding: stella_en_1.5B_v5 FP16 (3 GB)
- Vector index: FAISS-GPU, up to 1M vectors at 768-dim (~3 GB)
- LLM: Llama 3.1 8B FP8 (8 GB)
- Committed: ~14 GB. Remaining ~34 GB for KV cache.
- Best for: up to 20 concurrent agents, small-to-mid corpora (under 1M documents)
Option 2: H100 80GB, single GPU standard stack
- Embedding: stella_en_1.5B_v5 FP16 (3 GB)
- Vector index: FAISS-GPU, up to 5M vectors at 1536-dim (~31 GB)
- LLM: Llama 3.1 8B FP8 (8 GB)
- Committed: ~42 GB. Remaining ~38 GB for KV cache.
- Best for: up to 50 concurrent agents, enterprise corpora (under 5M documents)
Option 3: 4x H100 80GB, production stack with large corpus
- Embedding: E5-mistral-7b FP16 on GPU-0 (14 GB)
- Vector index: FAISS-GPU, 10M vectors at 1536-dim on GPU-1 (~61 GB, dedicated GPU)
- LLM: Llama 3.3 70B FP8, tensor-parallel across GPU-2 and GPU-3 (~35 GB per GPU)
- GPU-0 committed: ~14 GB. Remaining ~66 GB for activations and embedding KV cache.
- GPU-1 committed: ~61 GB. Remaining ~19 GB covers FAISS runtime overhead (~20%).
- GPU-2/GPU-3 committed: ~35 GB each. Remaining ~45 GB per GPU for LLM KV cache.
- Best for: high-concurrency production, large corpora (up to 10M documents)
FAISS is isolated to its own GPU because 10M vectors at 1536-dim (~61 GB) plus E5-mistral-7b FP16 (14 GB) would total ~75 GB before FAISS runtime overhead — leaving insufficient headroom on a single 80 GB card.
NVIDIA Agentic RAG Toolkit: What It Is and How to Use It
NVIDIA released an official agentic RAG reference architecture as part of their AI Blueprint program in 2025-2026. The toolkit bundles NVIDIA NIM microservices for LLM and embedding inference, NVIDIA cuVS (a GPU-accelerated vector search library integrated into FAISS since v1.10) for vector search, and LangChain/LlamaIndex integration layers.
The architecture: NIM serves both the embedding model and LLM via OpenAI-compatible APIs. cuVS handles vector search on the GPU. The integration layer wires them together with agent orchestration.
To run this on Spheron, pull the NIM containers, provision an H100 instance, and configure cuVS as the vector backend. See the NVIDIA NIM self-hosted deployment guide for the step-by-step NIM setup.
One important note: NVIDIA NIM requires NVIDIA AI Enterprise licensing for production use. For teams running open models without enterprise licensing, the open-source stack below (vLLM + FAISS-GPU) avoids that overhead entirely and matches or exceeds NIM performance on most workloads.
Step-by-Step: Deploy an Agentic RAG Pipeline on Spheron
Step 1: Provision a GPU Instance
Choose your GPU based on the VRAM table above. H100 PCIe for a standard stack, L40S for budget. Provision via the Spheron console at app.spheron.ai or the CLI.
# H100 PCIe gives you 80GB VRAM and full root access
# L40S gives you 48GB VRAM at roughly 3x lower hourly costStep 2: Install the Stack
pip install vllm faiss-gpu-cu12 sentence-transformers langchain-communityNote: faiss-gpu-cu12 is the correct package for CUDA 12.x. The generic faiss-gpu package may silently install CPU-only FAISS depending on your environment. Check your CUDA version with nvcc --version before installing.
Step 3: Start the Embedding Server
For a standalone guide to deploying TEI embedding and reranker servers on GPU cloud (including BGE-M3, Qwen3-Embedding, Jina v4, and cross-encoder rerankers), see the TEI production deployment guide.
vllm serve dunzhang/stella_en_1.5B_v5 \
--task embed \
--port 8001 \
--dtype bfloat16This exposes an OpenAI-compatible embeddings endpoint at http://localhost:8001/v1/embeddings. You can test it immediately:
curl http://localhost:8001/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model": "dunzhang/stella_en_1.5B_v5", "input": "test query"}'Step 4: Build the FAISS-GPU Index
import faiss
import numpy as np
d = 1536 # embedding dimension
res = faiss.StandardGpuResources()
# IndexFlatIP for cosine similarity (use normalized vectors)
index_flat = faiss.IndexFlatIP(d)
gpu_index = faiss.index_cpu_to_gpu(res, 0, index_flat)
# Add document vectors (replace with your actual embeddings)
vectors = np.load("document_embeddings.npy").astype("float32")
faiss.normalize_L2(vectors)
gpu_index.add(vectors)
print(f"Index ready: {gpu_index.ntotal} vectors on GPU")For persistence between restarts, serialize the index to CPU, save to disk, and reload:
# Save
cpu_index = faiss.index_gpu_to_cpu(gpu_index)
faiss.write_index(cpu_index, "vectors.index")
# Load (on next start)
cpu_index = faiss.read_index("vectors.index")
gpu_index = faiss.index_cpu_to_gpu(res, 0, cpu_index)Step 5: Start the LLM Server
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--port 8000 \
--max-model-len 32768 \
--enable-auto-tool-choice \
--tool-call-parser llama3_json \
--enable-prefix-caching \
--gpu-memory-utilization 0.90--enable-auto-tool-choice and --tool-call-parser llama3_json enable structured tool calls, which agentic multi-hop retrieval depends on. --enable-prefix-caching reuses KV cache for shared system prompts across agent turns, cutting prefill cost significantly for multi-hop queries.
Step 6: Wire the Agentic Layer
A minimal Python example using LangChain:
import requests
import faiss
import numpy as np
from langchain_openai import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage
EMBED_URL = "http://localhost:8001/v1/embeddings"
LLM_URL = "http://localhost:8000/v1"
EMBED_MODEL = "dunzhang/stella_en_1.5B_v5"
LLM_MODEL = "meta-llama/Llama-3.3-70B-Instruct"
def encode_query(text: str) -> np.ndarray:
resp = requests.post(
f"{EMBED_URL}",
json={"model": EMBED_MODEL, "input": text},
headers={"Content-Type": "application/json"},
)
resp.raise_for_status()
vec = np.array(resp.json()["data"][0]["embedding"], dtype="float32")
faiss.normalize_L2(vec.reshape(1, -1))
return vec.reshape(1, -1)
def retrieve(query: str, gpu_index, doc_texts: list, k: int = 5) -> list[str]:
vec = encode_query(query)
_, indices = gpu_index.search(vec, k)
return [doc_texts[i] for i in indices[0] if i != -1]
llm = ChatOpenAI(
base_url=LLM_URL,
model=LLM_MODEL,
api_key="ignored", # vLLM does not require auth by default
)
def agentic_rag(user_query: str, gpu_index, doc_texts: list) -> str:
# First retrieval pass
chunks = retrieve(user_query, gpu_index, doc_texts)
context = "\n\n".join(chunks)
messages = [
SystemMessage(content="You are a helpful assistant. Use the provided context to answer questions. If the context is insufficient, say so."),
HumanMessage(content=f"Context:\n{context}\n\nQuestion: {user_query}"),
]
return llm.invoke(messages).contentFor multi-hop retrieval, extend this to loop: after the first LLM call, check if the model signals insufficient context (via a tool call or a structured flag), then retrieve again with a refined query.
Step 7: Verify Latency Baseline
# Time a single query end-to-end
time curl -s -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.3-70B-Instruct",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"max_tokens": 50,
"stream": false
}'Target benchmarks for a healthy colocated stack:
- Embedding latency: under 5ms per query
- FAISS search (1M vectors): under 3ms
- LLM TTFT (70B FP8, short prompt): under 80ms
- Full single-hop pipeline: under 120ms
- Full two-hop agentic pipeline: under 200ms
Latency Optimization: Hitting Sub-200ms Time to First Token
Here's why colocated GPU beats managed APIs on latency at every percentile:
| Component | Managed API (p99) | Colocated GPU (p99) |
|---|---|---|
| Query embedding | 200-400ms | 2-5ms |
| Vector search | 50-250ms | 1-3ms |
| LLM TTFT (8B FP8) | 600-1500ms | 30-80ms |
| Total (sequential) | 850-2150ms | 33-88ms |
Four optimizations that push you further:
1. Async embedding and retrieval. Start FAISS search as soon as the embedding result is ready. Don't wait on an unrelated pipeline step. Python asyncio with httpx handles this cleanly.
2. Prefill batching. vLLM's --max-num-batched-tokens controls how many tokens are processed per scheduler step. Set to 8192 or 16384 for high-throughput workloads where multiple agent turns arrive simultaneously.
3. KV cache prefix reuse. For agentic workloads where many requests share the same system prompt, --enable-prefix-caching in vLLM eliminates repeated prefill computation. On a typical 70B model with a 2K-token system prompt, this cuts TTFT by 30-50% for every subsequent request after the first.
4. Speculative decoding. For agentic structured output patterns (JSON tool calls, fixed-format responses), a smaller draft model accelerates generation. vLLM supports this natively.
For a deep dive on these vLLM parameters, see continuous batching and PagedAttention optimization.
Scaling Patterns: Multi-Agent RAG with Concurrent GPU Sessions
VRAM limits concurrent sessions through KV cache saturation, not compute capacity. The formula:
max_sessions = available_VRAM / (2 × n_layers × n_kv_heads × head_dim × context_length × precision_bytes)For Llama 3.3 70B FP8 on 2x H100 with ~53GB KV cache headroom: roughly 80-120 concurrent sessions before KV cache eviction kicks in. Llama 3.3 70B uses grouped-query attention (GQA) with n_kv_heads=8 (not the 64 query heads), so use n_kv_heads in this formula.
Three scaling patterns:
Single large GPU. One H100 serves all components. Simple to operate, one failure point. Right choice for under 50 concurrent agents.
GPU per component. Dedicated GPU for embedding, dedicated GPU for FAISS, dedicated GPU(s) for the LLM. Each component scales independently. Embedding and FAISS are IO-bound, LLM is compute-bound, so they benefit from independent scaling levers.
Horizontal LLM scaling. Multiple vLLM instances behind a load balancer, each serving the same model. Embedding and FAISS remain on a single shared GPU since their VRAM footprint is modest and they're fast. The LLM tier scales horizontally to match agent concurrency.
For the orchestration layer on top of any of these patterns, see multi-agent AI system GPU infrastructure.
Cost Analysis: Colocated GPU Stack vs Separate Managed APIs
Managed API baseline per 1 million RAG queries (assuming 1K tokens of retrieved context, 256-token query encoding, 500-token LLM output):
| Service | Basis | Cost per 1M queries |
|---|---|---|
| Embedding (OpenAI text-embedding-3-small) | $0.02/1M tokens, 256 tokens/query | $5 |
| Vector DB (Pinecone serverless) | ~$0.08/1M reads | $80 |
| LLM (GPT-4o-mini, 1K input + 500 output) | $0.15/$0.60 per M tokens | ~$525 |
| Total | ~$610 per 1M queries |
Spheron colocated GPU stack:
At 10 queries/second sustained throughput (a conservative number for a single H100 running an 8B model):
| GPU | Hourly rate | Queries/hr | Cost per 1M queries |
|---|---|---|---|
| L40S PCIe (on-demand) | $0.72 | ~18,000 | ~$40 |
| H100 PCIe (on-demand) | $2.11 | ~36,000 | ~$59 |
| H100 PCIe (on-demand, 70B FP8) | $2.11 | ~10,800 | ~$195 |
The 70B model comparison is the most apples-to-apples comparison against GPT-4o-mini equivalent quality. Even at that throughput, colocated GPU is roughly 3x cheaper per query than managed APIs. At higher concurrency (50+ agents), the gap widens because managed APIs price per query while your GPU cost stays fixed.
Pricing fluctuates based on GPU availability. The prices above are based on 10 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
For a broader cost analysis framework, see AI inference cost economics 2026 and serverless GPU vs on-demand vs reserved.
Running a RAG pipeline at scale means paying per query to three separate APIs, or owning the full stack on dedicated GPU hardware. Spheron's bare-metal H100 and L40S instances let you colocate your embedding model, vector index, and LLM on one node, cutting round-trip latency by up to 20x and cost by 60-70% compared to managed services.
Quick Setup Guide
List each component (embedding model, vector index, LLM) and its VRAM requirement. Add a KV cache buffer sized by max concurrent sessions times context length. Pick a GPU where the total fits within 85% of VRAM capacity.
Provision an H100 PCIe (80GB) or L40S (48GB) instance through the Spheron console or CLI. Run: pip install vllm faiss-gpu-cu12 sentence-transformers langchain-community
Start the embedding server: vllm serve dunzhang/stella_en_1.5B_v5 --task embed --port 8001 --dtype bfloat16. This exposes an OpenAI-compatible embeddings endpoint on port 8001.
Load faiss-gpu, create a StandardGpuResources object, and build an IndexFlatIP index on the GPU. Add your document vectors. For persistence, serialize the CPU index to disk and reload to GPU on startup.
Start vLLM with tool-call support: vllm serve meta-llama/Llama-3.3-70B-Instruct --dtype fp8 --port 8000 --max-model-len 32768 --enable-auto-tool-choice --tool-call-parser llama3_json
Use LangChain or LlamaIndex to connect the embedding server (port 8001), FAISS index, and LLM server (port 8000). Encode the user query, search FAISS for top-k chunks, inject them into the LLM context, then call the LLM with tool-use enabled for multi-hop retrieval.
Measure time to first token (TTFT) for single-hop queries. Enable prefix caching (--enable-prefix-caching) for multi-hop queries sharing a system prompt. Target under 200ms total latency: embedding under 5ms, FAISS search under 3ms, LLM TTFT under 80ms.
Frequently Asked Questions
At minimum, a single GPU with 48GB+ VRAM covers a full stack for small-to-mid corpora: embedding model (1-14GB), FAISS-GPU vector index (3-61 GB depending on corpus size and dimension), and an 8B FP8 LLM (8GB) with room for KV cache. An L40S (48GB) works for up to ~20 concurrent agents. An H100 80GB handles enterprise workloads with 50+ concurrent sessions using an 8B FP8 LLM. For 70B-class models, 2x H100 with tensor parallelism is the right starting point.
Budget at least 14GB for a minimal stack (Stella 1.5B embedding + FAISS 1M vectors + Llama 3.1 8B FP8). A production stack with 5M-vector FAISS index and Llama 3.1 8B FP8 needs ~42GB, leaving ~38GB on a single H100 80GB for KV cache. For 70B-class models, plan for 2x H100: Llama 3.3 70B FP8 alone requires ~70GB of model weights.
Standard RAG does one retrieval pass per user query. Agentic RAG chains 3-7 LLM calls per turn, with the model deciding what to retrieve next, re-ranking results, calling tools, and reflecting on outputs. This multiplies GPU memory pressure (longer KV caches), increases network sensitivity (more round trips), and raises throughput requirements substantially.
Yes, with careful VRAM planning. On an H100 80GB, you can colocate Stella 1.5B (3GB), a FAISS-GPU index for up to 5M 1536-dim vectors (~31GB), and Llama 3.1 8B FP8 (8GB), leaving about 38GB for KV cache. For 70B-class models, use 2x H100 with tensor parallelism. The key is sizing each component before provisioning.
Managed APIs (OpenAI embeddings, Pinecone, GPT-4o-mini) cost roughly $610 per million RAG queries. A single H100 PCIe on Spheron at $2.11/hr running a colocated stack costs $50-100 per million queries at moderate throughput, depending on query complexity and concurrency. Beyond cost, colocated GPU removes network round trips between components, cutting p99 latency from 850-2150ms to under 90ms.
