Case Study: Building a Sub-200ms RAG Pipeline Serving 2M Queries/Day on Bare Metal H100s

Retrieval-Augmented Generation is the most common production AI pattern today. Nearly every company deploying LLMs uses some form of RAG — retrieving relevant documents, injecting them into context, and generating grounded responses. The pattern is well-understood. The infrastructure to run it at scale is not.

This case study documents how a B2B SaaS company serving enterprise knowledge management moved their RAG pipeline from managed API services to self-hosted bare metal GPUs. They serve 2 million queries per day across 340 enterprise customers, each with isolated document collections ranging from 10K to 5M documents.

The migration took 6 weeks and the results were significant: p99 latency dropped from 1.8 seconds to 190 milliseconds, monthly infrastructure cost fell from $47,000 to $15,200, and they eliminated their dependency on three separate managed services.

The Problem with Managed RAG

Before the migration, their RAG stack looked like most production setups: a managed vector database for retrieval, a managed embedding API for document ingestion and query encoding, and a managed LLM API for generation.

The architecture worked at small scale. At 2M queries/day, the problems compounded.

Latency was unpredictable. Each query required three sequential network round-trips: query embedding (managed embedding API, ~80ms average but ~400ms p99), vector search (managed vector DB, ~30ms average but ~250ms p99), and LLM generation (managed LLM API, ~800ms average but ~2.5s p99). The p99 for the full pipeline exceeded 3 seconds during peak hours. Their enterprise customers required sub-500ms p95 latency in their SLA.

Cost scaled linearly with queries. Managed APIs charge per-token or per-query. At 2M queries/day with an average context window of 8K tokens (retrieved documents + query + system prompt), their monthly LLM API bill alone was $31,000. The embedding API added $6,000/month and the vector database $10,000/month.

Multi-tenancy was fragile. Each enterprise customer's documents were isolated in separate vector database collections. Managing 340 collections across a managed service meant dealing with collection-level rate limits, inconsistent index performance as collections grew, and no control over how the managed service prioritized their queries versus other customers.

The Target Architecture

They designed a self-hosted pipeline that co-locates all three components — embedding, retrieval, and generation — on the same bare metal servers. The key insight: by eliminating network round-trips between components, each query touches GPU memory once and flows through the entire pipeline without leaving the machine.

Hardware Configuration

Component	Configuration
Generation servers	2x bare metal, each 4x H100 80GB SXM
Embedding + retrieval server	1x bare metal, 2x H100 80GB SXM
Total GPUs	10x H100 80GB
Storage	4 TB NVMe per server (local) + 20 TB NAS (shared)
Network	100 Gbps between servers

Software Stack

Layer	Technology	Role
Embedding model	Stella_en_1.5B_v5 (1.5B params)	Document + query encoding
Vector index	FAISS (GPU-accelerated)	Similarity search
LLM	Qwen 3.5 32B (INT4 quantized)	Response generation
Serving framework	vLLM	LLM inference with continuous batching
Orchestration	Custom Python service	Pipeline coordination
Load balancer	Nginx	Request distribution

The generation model choice was deliberate. Qwen 3.5 32B with INT4 quantization uses approximately 16 GB of VRAM per instance. On a 4x H100 server, they run 4 independent vLLM instances — one per GPU — each handling a portion of incoming queries. This "scale out" approach (multiple small instances) provides better p99 latency than a single large instance using tensor parallelism, because each instance serves fewer concurrent requests and maintains a smaller KV cache.

Implementation Details

Embedding Pipeline

Document ingestion and query encoding run on the dedicated embedding server. The Stella model at 1.5B parameters uses approximately 6 GB VRAM per instance (FP16), allowing them to run multiple encoding instances on the 2x H100 setup.

The embedding pipeline handles two workloads. Batch ingestion runs during off-peak hours (midnight to 6 AM), processing new and updated documents from customer uploads. Real-time query encoding runs continuously, converting incoming search queries to embedding vectors.

python

# Embedding service configuration
from sentence_transformers import SentenceTransformer
import torch

class EmbeddingService:
    def __init__(self, model_name="dunzhang/stella_en_1.5B_v5", gpu_id=0):
        self.model = SentenceTransformer(model_name)
        self.model = self.model.to(f"cuda:{gpu_id}")
        self.model.eval()

    def encode_queries(self, queries: list[str]) -> torch.Tensor:
        """Encode search queries — optimized for low latency"""
        with torch.no_grad():
            embeddings = self.model.encode(
                queries,
                batch_size=32,
                normalize_embeddings=True,
                convert_to_tensor=True,
                show_progress_bar=False,
            )
        return embeddings

    def encode_documents(self, documents: list[str]) -> torch.Tensor:
        """Encode documents — optimized for throughput"""
        with torch.no_grad():
            embeddings = self.model.encode(
                documents,
                batch_size=256,
                normalize_embeddings=True,
                convert_to_tensor=True,
                show_progress_bar=False,
            )
        return embeddings

Query encoding latency: 3-5ms for a single query, 8-12ms for a batch of 32 queries.

GPU-Accelerated Vector Search

They replaced their managed vector database with FAISS running on GPU. Each customer's document collection is stored as a separate FAISS index, loaded into GPU memory on demand with an LRU cache for frequently accessed collections.

python

import faiss
import numpy as np

class TenantVectorStore:
    def __init__(self, gpu_id=1, cache_size_gb=60):
        self.gpu_id = gpu_id
        self.res = faiss.StandardGpuResources()
        self.res.setTempMemory(cache_size_gb * 1024 * 1024 * 1024)
        self.loaded_indices = {}  # tenant_id -> GPU index
        self.access_order = []   # LRU tracking

    def load_index(self, tenant_id: str, index_path: str):
        """Load a tenant's FAISS index to GPU"""
        cpu_index = faiss.read_index(index_path)
        gpu_index = faiss.index_cpu_to_gpu(
            self.res, self.gpu_id, cpu_index
        )
        self.loaded_indices[tenant_id] = gpu_index
        self.access_order.append(tenant_id)

    def search(self, tenant_id: str, query_vector: np.ndarray, k: int = 10):
        """Search a tenant's index — returns top-k document IDs and scores"""
        if tenant_id not in self.loaded_indices:
            self.load_index(tenant_id, f"/nas/indices/{tenant_id}.faiss")

        index = self.loaded_indices[tenant_id]
        distances, doc_ids = index.search(
            query_vector.reshape(1, -1).astype("float32"), k
        )
        return doc_ids[0], distances[0]

Vector search latency: 1-3ms for collections under 1M documents, 5-8ms for collections up to 5M documents. Compare this to their managed vector database's 30ms average (250ms p99) — a 10-30x improvement.

The LRU cache holds the 50 most recently accessed tenant indices in GPU memory. Indices for the remaining tenants live on NVMe and are loaded on demand (50-200ms cold load). Since query patterns follow a power law (the top 50 tenants generate 85% of queries), the cache hit rate stays above 95%.

Generation with vLLM

Four vLLM instances run on the generation servers — one per GPU. Each instance serves Qwen 3.5 32B with INT4 quantization.

bash

# Launch 4 vLLM instances across 4 GPUs
for GPU_ID in 0 1 2 3; do
    PORT=$((8000 + GPU_ID))
    CUDA_VISIBLE_DEVICES=$GPU_ID vllm serve Qwen/Qwen3.5-32B \
        --quantization awq \
        --max-model-len 16384 \
        --gpu-memory-utilization 0.85 \
        --port $PORT \
        --disable-log-requests \
        &
done

Each instance handles approximately 60 concurrent requests with a p99 generation latency of 140ms for responses up to 512 tokens. The --max-model-len 16384 limits the context window to 16K tokens — enough for the retrieved documents (typically 3-5 passages at 500-1000 tokens each) plus the query and system prompt.

End-to-End Pipeline

The orchestration service coordinates the full pipeline for each query:

python

import aiohttp
import asyncio
import time

class RAGPipeline:
    def __init__(self, embedding_service, vector_store, vllm_endpoints):
        self.embedder = embedding_service
        self.vectors = vector_store
        self.vllm_endpoints = vllm_endpoints
        self.endpoint_idx = 0  # Round-robin counter

    async def query(self, tenant_id: str, user_query: str) -> dict:
        start = time.perf_counter()
        metrics = {}

        # Step 1: Encode query (3-5ms)
        t0 = time.perf_counter()
        query_vector = self.embedder.encode_queries([user_query])
        metrics["embed_ms"] = (time.perf_counter() - t0) * 1000

        # Step 2: Retrieve documents (1-8ms)
        t0 = time.perf_counter()
        doc_ids, scores = self.vectors.search(
            tenant_id, query_vector.cpu().numpy(), k=5
        )
        documents = self.fetch_documents(tenant_id, doc_ids)
        metrics["retrieve_ms"] = (time.perf_counter() - t0) * 1000

        # Step 3: Build prompt with retrieved context
        context = "\n\n---\n\n".join([
            f"[Document {i+1}]\n{doc['text']}"
            for i, doc in enumerate(documents)
        ])

        prompt = [
            {"role": "system", "content": f"Answer based on the provided documents.\n\nContext:\n{context}"},
            {"role": "user", "content": user_query}
        ]

        # Step 4: Generate response (100-180ms)
        t0 = time.perf_counter()
        endpoint = self.vllm_endpoints[self.endpoint_idx % len(self.vllm_endpoints)]
        self.endpoint_idx += 1

        async with aiohttp.ClientSession() as session:
            async with session.post(
                f"{endpoint}/v1/chat/completions",
                json={
                    "model": "Qwen/Qwen3.5-32B",
                    "messages": prompt,
                    "max_tokens": 512,
                    "temperature": 0.3,
                }
            ) as resp:
                result = await resp.json()

        metrics["generate_ms"] = (time.perf_counter() - t0) * 1000
        metrics["total_ms"] = (time.perf_counter() - start) * 1000

        return {
            "answer": result["choices"][0]["message"]["content"],
            "sources": [{"id": did, "score": float(s)} for did, s in zip(doc_ids, scores)],
            "metrics": metrics,
        }

Performance Results

After 4 weeks of production traffic on the new infrastructure, the latency and cost numbers stabilized.

Latency Comparison

Metric	Managed Stack	Self-Hosted Bare Metal	Improvement
p50 latency	920ms	95ms	9.7x faster
p95 latency	1,400ms	155ms	9.0x faster
p99 latency	1,800ms	190ms	9.5x faster
Max latency (daily)	4,200ms	380ms	11x faster

The latency improvement is dominated by two factors: eliminating network round-trips (saved ~200-500ms) and GPU-accelerated vector search replacing a CPU-based managed service (saved ~100-300ms). The generation step itself is comparable — vLLM on H100 is roughly similar in speed to top-tier managed LLM APIs.

Throughput

Metric	Value
Daily query volume	2.0M queries
Peak queries/second	85 QPS
Sustained queries/second	45 QPS
Concurrent users at peak	~340

The 4 vLLM instances collectively handle 85 QPS at peak, with each instance processing approximately 21 QPS. At this load, GPU utilization on the generation servers averages 78% during peak hours and 35% during off-peak.

Cost Comparison

Component	Managed Stack (Monthly)	Self-Hosted (Monthly)
LLM generation	$31,000	$7,680 (2x 4-GPU servers)
Embedding API	$6,000	$1,920 (1x 2-GPU server)
Vector database	$10,000	$0 (FAISS, self-hosted)
Storage	Included	$320 (NAS)
Monitoring + logging	Included	$280
Total	$47,000	$15,200
Cost per query	$0.00078	$0.00025

Monthly savings: $31,800 (68% reduction). Annual savings: $381,600.

The GPU server costs assume Dedicated H100 instances at $2.00/hr per GPU. With Spot instances for the embedding server (which can tolerate brief interruptions), the total drops further to approximately $13,400/month.

What Went Wrong

Not everything was smooth. Three issues emerged during the migration.

Cold start for tenant indices. When a tenant that hasn't been queried recently sends a request, their FAISS index must be loaded from NVMe to GPU — a 50-200ms operation depending on index size. For the first query, this adds noticeable latency. They mitigated this by pre-loading indices for all tenants expected to be active during the next business day (based on historical query patterns), running the preload job at 5 AM daily.

Memory pressure during batch ingestion. The embedding server handles both real-time query encoding and batch document ingestion. During nightly batch runs, document encoding consumed GPU memory that conflicted with the FAISS index cache. They solved this by dedicating GPU 0 exclusively to query encoding and FAISS search, and GPU 1 to batch document encoding — a static partition that eliminated contention.

vLLM memory leaks under sustained load. During the first week, one vLLM instance experienced a gradual memory leak that eventually triggered an OOM after 72 hours of continuous serving. They added a health check that monitors VRAM usage and automatically restarts instances that exceed 95% GPU memory utilization. Since adding the restart policy, they've had zero unplanned downtime.

When This Architecture Makes Sense

This approach works when three conditions are met. First, you're running enough query volume that managed API costs dominate your infrastructure budget — at under 100K queries/day, managed services are usually cheaper when you factor in engineering time. Second, your latency requirements are strict enough that network round-trips to managed services are a bottleneck — if p99 under 1 second is acceptable, managed stacks work fine. Third, you have the engineering capacity to maintain GPU infrastructure — bare metal servers require monitoring, updates, and occasional debugging that managed services handle for you.

For teams that meet these criteria, the combination of co-located embedding, retrieval, and generation on bare metal GPUs delivers latency and cost improvements that managed services structurally cannot match. The network round-trips in a distributed managed stack are a fundamental bottleneck, not a solvable optimization problem.

On Spheron AI, you can provision bare metal GPU servers with the exact configurations described in this case study — multi-GPU H100, H200, and B300 servers available as both Spot and Dedicated instances. The bare metal access gives you full control over GPU allocation, memory management, and process isolation that makes production RAG pipelines predictable.

The Problem with Managed RAG

The Target Architecture

Hardware Configuration

Software Stack

Implementation Details

Embedding Pipeline

GPU-Accelerated Vector Search

Generation with vLLM

End-to-End Pipeline

Performance Results

Latency Comparison

Throughput

Cost Comparison

What Went Wrong

When This Architecture Makes Sense

Build what's next.