Retrieval-Augmented Generation is the most common production AI pattern today. Nearly every company deploying LLMs uses some form of RAG — retrieving relevant documents, injecting them into context, and generating grounded responses. The pattern is well-understood. The infrastructure to run it at scale is not.
This case study documents how a B2B SaaS company serving enterprise knowledge management moved their RAG pipeline from managed API services to self-hosted bare metal GPUs. They serve 2 million queries per day across 340 enterprise customers, each with isolated document collections ranging from 10K to 5M documents.
The migration took 6 weeks and the results were significant: p99 latency dropped from 1.8 seconds to 190 milliseconds, monthly infrastructure cost fell from $47,000 to $15,200, and they eliminated their dependency on three separate managed services.
The Problem with Managed RAG
Before the migration, their RAG stack looked like most production setups: a managed vector database for retrieval, a managed embedding API for document ingestion and query encoding, and a managed LLM API for generation.
The architecture worked at small scale. At 2M queries/day, the problems compounded.
Latency was unpredictable. Each query required three sequential network round-trips: query embedding (managed embedding API, ~80ms average but ~400ms p99), vector search (managed vector DB, ~30ms average but ~250ms p99), and LLM generation (managed LLM API, ~800ms average but ~2.5s p99). The p99 for the full pipeline exceeded 3 seconds during peak hours. Their enterprise customers required sub-500ms p95 latency in their SLA.
Cost scaled linearly with queries. Managed APIs charge per-token or per-query. At 2M queries/day with an average context window of 8K tokens (retrieved documents + query + system prompt), their monthly LLM API bill alone was $31,000. The embedding API added $6,000/month and the vector database $10,000/month.
Multi-tenancy was fragile. Each enterprise customer's documents were isolated in separate vector database collections. Managing 340 collections across a managed service meant dealing with collection-level rate limits, inconsistent index performance as collections grew, and no control over how the managed service prioritized their queries versus other customers.
The Target Architecture
They designed a self-hosted pipeline that co-locates all three components — embedding, retrieval, and generation — on the same bare metal servers. The key insight: by eliminating network round-trips between components, each query touches GPU memory once and flows through the entire pipeline without leaving the machine.
Hardware Configuration
| Component | Configuration |
|---|---|
| Generation servers | 2x bare metal, each 4x H100 80GB SXM |
| Embedding + retrieval server | 1x bare metal, 2x H100 80GB SXM |
| Total GPUs | 10x H100 80GB |
| Storage | 4 TB NVMe per server (local) + 20 TB NAS (shared) |
| Network | 100 Gbps between servers |
Software Stack
| Layer | Technology | Role |
|---|---|---|
| Embedding model | Stella_en_1.5B_v5 (1.5B params) | Document + query encoding |
| Vector index | FAISS (GPU-accelerated) | Similarity search |
| LLM | Qwen 3.5 32B (INT4 quantized) | Response generation |
| Serving framework | vLLM | LLM inference with continuous batching |
| Orchestration | Custom Python service | Pipeline coordination |
| Load balancer | Nginx | Request distribution |
The generation model choice was deliberate. Qwen 3.5 32B with INT4 quantization uses approximately 16 GB of VRAM per instance. On a 4x H100 server, they run 4 independent vLLM instances — one per GPU — each handling a portion of incoming queries. This "scale out" approach (multiple small instances) provides better p99 latency than a single large instance using tensor parallelism, because each instance serves fewer concurrent requests and maintains a smaller KV cache.
Implementation Details
Embedding Pipeline
Document ingestion and query encoding run on the dedicated embedding server. The Stella model at 1.5B parameters uses approximately 6 GB VRAM per instance (FP16), allowing them to run multiple encoding instances on the 2x H100 setup.
The embedding pipeline handles two workloads. Batch ingestion runs during off-peak hours (midnight to 6 AM), processing new and updated documents from customer uploads. Real-time query encoding runs continuously, converting incoming search queries to embedding vectors.
# Embedding service configuration
from sentence_transformers import SentenceTransformer
import torch
class EmbeddingService:
def __init__(self, model_name="dunzhang/stella_en_1.5B_v5", gpu_id=0):
self.model = SentenceTransformer(model_name)
self.model = self.model.to(f"cuda:{gpu_id}")
self.model.eval()
def encode_queries(self, queries: list[str]) -> torch.Tensor:
"""Encode search queries — optimized for low latency"""
with torch.no_grad():
embeddings = self.model.encode(
queries,
batch_size=32,
normalize_embeddings=True,
convert_to_tensor=True,
show_progress_bar=False,
)
return embeddings
def encode_documents(self, documents: list[str]) -> torch.Tensor:
"""Encode documents — optimized for throughput"""
with torch.no_grad():
embeddings = self.model.encode(
documents,
batch_size=256,
normalize_embeddings=True,
convert_to_tensor=True,
show_progress_bar=False,
)
return embeddingsQuery encoding latency: 3-5ms for a single query, 8-12ms for a batch of 32 queries.
GPU-Accelerated Vector Search
They replaced their managed vector database with FAISS running on GPU. Each customer's document collection is stored as a separate FAISS index, loaded into GPU memory on demand with an LRU cache for frequently accessed collections.
import faiss
import numpy as np
class TenantVectorStore:
def __init__(self, gpu_id=1, cache_size_gb=60):
self.gpu_id = gpu_id
self.res = faiss.StandardGpuResources()
self.res.setTempMemory(cache_size_gb * 1024 * 1024 * 1024)
self.loaded_indices = {} # tenant_id -> GPU index
self.access_order = [] # LRU tracking
def load_index(self, tenant_id: str, index_path: str):
"""Load a tenant's FAISS index to GPU"""
cpu_index = faiss.read_index(index_path)
gpu_index = faiss.index_cpu_to_gpu(
self.res, self.gpu_id, cpu_index
)
self.loaded_indices[tenant_id] = gpu_index
self.access_order.append(tenant_id)
def search(self, tenant_id: str, query_vector: np.ndarray, k: int = 10):
"""Search a tenant's index — returns top-k document IDs and scores"""
if tenant_id not in self.loaded_indices:
self.load_index(tenant_id, f"/nas/indices/{tenant_id}.faiss")
index = self.loaded_indices[tenant_id]
distances, doc_ids = index.search(
query_vector.reshape(1, -1).astype("float32"), k
)
return doc_ids[0], distances[0]Vector search latency: 1-3ms for collections under 1M documents, 5-8ms for collections up to 5M documents. Compare this to their managed vector database's 30ms average (250ms p99) — a 10-30x improvement.
The LRU cache holds the 50 most recently accessed tenant indices in GPU memory. Indices for the remaining tenants live on NVMe and are loaded on demand (50-200ms cold load). Since query patterns follow a power law (the top 50 tenants generate 85% of queries), the cache hit rate stays above 95%.
Generation with vLLM
Four vLLM instances run on the generation servers — one per GPU. Each instance serves Qwen 3.5 32B with INT4 quantization.
# Launch 4 vLLM instances across 4 GPUs
for GPU_ID in 0 1 2 3; do
PORT=$((8000 + GPU_ID))
CUDA_VISIBLE_DEVICES=$GPU_ID vllm serve Qwen/Qwen3.5-32B \
--quantization awq \
--max-model-len 16384 \
--gpu-memory-utilization 0.85 \
--port $PORT \
--disable-log-requests \
&
doneEach instance handles approximately 60 concurrent requests with a p99 generation latency of 140ms for responses up to 512 tokens. The --max-model-len 16384 limits the context window to 16K tokens — enough for the retrieved documents (typically 3-5 passages at 500-1000 tokens each) plus the query and system prompt.
End-to-End Pipeline
The orchestration service coordinates the full pipeline for each query:
import aiohttp
import asyncio
import time
class RAGPipeline:
def __init__(self, embedding_service, vector_store, vllm_endpoints):
self.embedder = embedding_service
self.vectors = vector_store
self.vllm_endpoints = vllm_endpoints
self.endpoint_idx = 0 # Round-robin counter
async def query(self, tenant_id: str, user_query: str) -> dict:
start = time.perf_counter()
metrics = {}
# Step 1: Encode query (3-5ms)
t0 = time.perf_counter()
query_vector = self.embedder.encode_queries([user_query])
metrics["embed_ms"] = (time.perf_counter() - t0) * 1000
# Step 2: Retrieve documents (1-8ms)
t0 = time.perf_counter()
doc_ids, scores = self.vectors.search(
tenant_id, query_vector.cpu().numpy(), k=5
)
documents = self.fetch_documents(tenant_id, doc_ids)
metrics["retrieve_ms"] = (time.perf_counter() - t0) * 1000
# Step 3: Build prompt with retrieved context
context = "\n\n---\n\n".join([
f"[Document {i+1}]\n{doc['text']}"
for i, doc in enumerate(documents)
])
prompt = [
{"role": "system", "content": f"Answer based on the provided documents.\n\nContext:\n{context}"},
{"role": "user", "content": user_query}
]
# Step 4: Generate response (100-180ms)
t0 = time.perf_counter()
endpoint = self.vllm_endpoints[self.endpoint_idx % len(self.vllm_endpoints)]
self.endpoint_idx += 1
async with aiohttp.ClientSession() as session:
async with session.post(
f"{endpoint}/v1/chat/completions",
json={
"model": "Qwen/Qwen3.5-32B",
"messages": prompt,
"max_tokens": 512,
"temperature": 0.3,
}
) as resp:
result = await resp.json()
metrics["generate_ms"] = (time.perf_counter() - t0) * 1000
metrics["total_ms"] = (time.perf_counter() - start) * 1000
return {
"answer": result["choices"][0]["message"]["content"],
"sources": [{"id": did, "score": float(s)} for did, s in zip(doc_ids, scores)],
"metrics": metrics,
}Performance Results
After 4 weeks of production traffic on the new infrastructure, the latency and cost numbers stabilized.
Latency Comparison
| Metric | Managed Stack | Self-Hosted Bare Metal | Improvement |
|---|---|---|---|
| p50 latency | 920ms | 95ms | 9.7x faster |
| p95 latency | 1,400ms | 155ms | 9.0x faster |
| p99 latency | 1,800ms | 190ms | 9.5x faster |
| Max latency (daily) | 4,200ms | 380ms | 11x faster |
The latency improvement is dominated by two factors: eliminating network round-trips (saved ~200-500ms) and GPU-accelerated vector search replacing a CPU-based managed service (saved ~100-300ms). The generation step itself is comparable — vLLM on H100 is roughly similar in speed to top-tier managed LLM APIs.
Throughput
| Metric | Value |
|---|---|
| Daily query volume | 2.0M queries |
| Peak queries/second | 85 QPS |
| Sustained queries/second | 45 QPS |
| Concurrent users at peak | ~340 |
The 4 vLLM instances collectively handle 85 QPS at peak, with each instance processing approximately 21 QPS. At this load, GPU utilization on the generation servers averages 78% during peak hours and 35% during off-peak.
Cost Comparison
| Component | Managed Stack (Monthly) | Self-Hosted (Monthly) |
|---|---|---|
| LLM generation | $31,000 | $7,680 (2x 4-GPU servers) |
| Embedding API | $6,000 | $1,920 (1x 2-GPU server) |
| Vector database | $10,000 | $0 (FAISS, self-hosted) |
| Storage | Included | $320 (NAS) |
| Monitoring + logging | Included | $280 |
| Total | $47,000 | $15,200 |
| Cost per query | $0.00078 | $0.00025 |
Monthly savings: $31,800 (68% reduction). Annual savings: $381,600.
The GPU server costs assume Dedicated H100 instances at $2.00/hr per GPU. With Spot instances for the embedding server (which can tolerate brief interruptions), the total drops further to approximately $13,400/month.
What Went Wrong
Not everything was smooth. Three issues emerged during the migration.
Cold start for tenant indices. When a tenant that hasn't been queried recently sends a request, their FAISS index must be loaded from NVMe to GPU — a 50-200ms operation depending on index size. For the first query, this adds noticeable latency. They mitigated this by pre-loading indices for all tenants expected to be active during the next business day (based on historical query patterns), running the preload job at 5 AM daily.
Memory pressure during batch ingestion. The embedding server handles both real-time query encoding and batch document ingestion. During nightly batch runs, document encoding consumed GPU memory that conflicted with the FAISS index cache. They solved this by dedicating GPU 0 exclusively to query encoding and FAISS search, and GPU 1 to batch document encoding — a static partition that eliminated contention.
vLLM memory leaks under sustained load. During the first week, one vLLM instance experienced a gradual memory leak that eventually triggered an OOM after 72 hours of continuous serving. They added a health check that monitors VRAM usage and automatically restarts instances that exceed 95% GPU memory utilization. Since adding the restart policy, they've had zero unplanned downtime.
When This Architecture Makes Sense
This approach works when three conditions are met. First, you're running enough query volume that managed API costs dominate your infrastructure budget — at under 100K queries/day, managed services are usually cheaper when you factor in engineering time. Second, your latency requirements are strict enough that network round-trips to managed services are a bottleneck — if p99 under 1 second is acceptable, managed stacks work fine. Third, you have the engineering capacity to maintain GPU infrastructure — bare metal servers require monitoring, updates, and occasional debugging that managed services handle for you.
For teams that meet these criteria, the combination of co-located embedding, retrieval, and generation on bare metal GPUs delivers latency and cost improvements that managed services structurally cannot match. The network round-trips in a distributed managed stack are a fundamental bottleneck, not a solvable optimization problem.
On Spheron AI, you can provision bare metal GPU servers with the exact configurations described in this case study — multi-GPU H100, H200, and B300 servers available as both Spot and Dedicated instances. The bare metal access gives you full control over GPU allocation, memory management, and process isolation that makes production RAG pipelines predictable.