TL;DR: Copy-Paste vLLM Setup on H100
GPU: H100 SXM5 80GB on Spheron, $2.50/hr on-demand, $1.03/hr spot
Single-GPU (70B at FP8):
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 128Multi-GPU tensor-parallel (70B at FP16, 2× H100):
docker run --gpus all --ipc=host -p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--dtype float16 --tensor-parallel-size 2 \
--max-model-len 16384 --gpu-memory-utilization 0.92Monitor: curl http://localhost:8000/metrics | grep vllm:num_requests_waiting
The rest of this guide covers every flag, OOM fix, and monitoring setup in detail.
You've already gotten vLLM working locally. Now you need to serve a model in production: maybe 70B parameters across multiple GPUs, maybe handling thousands of concurrent users, maybe with specific latency requirements. This guide covers the gap between "vLLM works on my laptop" and "vLLM is running reliably in production on bare metal."
Prerequisites: Docker, basic Linux CLI, an account on Spheron, and a model you want to serve (Hugging Face model name or local checkpoint). We'll go from instance setup through single-GPU deployment, multi-GPU tensor parallelism, FP8 quantization, load balancing, and production monitoring, with working code for every step.
Related Guides
- AI inference GPU guide: cost-per-token comparison at each model size
- Inference framework benchmark: vLLM vs TensorRT-LLM vs SGLang throughput and latency
- Ollama vs vLLM: if you haven't chosen yet
- LLM Serving Optimization: deep dive on batching mechanics
- KV Cache Optimization Guide: FP8/NVFP4 quantization, prefix caching, and CPU offloading
vLLM in 2026: Key Features to Know
vLLM has become the default serving engine for production LLM inference. The current stable release (v0.20.2, May 2026) includes several features you need to understand before deploying:
- Model Runner V2 (MRV2): available in v0.20.0+, delivering up to 56% higher throughput on GB200 (results vary by hardware) via GPU-native Triton kernels and async scheduling. MRV2 must be explicitly enabled with
VLLM_USE_V2_MODEL_RUNNER=1- see the vLLM Model Runner V2 deployment guide for the performance gains and deployment steps. - FP8 inference support: significant throughput improvement on H100 and Blackwell GPUs; enable with a single flag
- Continuous batching: the default now; dynamically groups incoming requests for maximum GPU utilization without your intervention
- Streaming support: built-in server-sent events for real-time token streaming; important for chat applications and voice AI
- Multi-modal support: images + text for models like LLaVA, Qwen2.5-VL, Llama 4 (Scout 17B-16E, Maverick 17B-128E, both are natively multi-modal MoE models), and Gemma 3 (4B/12B/27B-it)
- Expert parallelism for MoE models:
--enable-expert-parallelflag (boolean) distributes MoE expert layers across GPUs, enabling efficient deployment of large sparse models. The EP degree is auto-calculated from TP and DP sizes when this flag is enabled. For DeepSeek V4 (1T total parameters, 37B active MoE), see the DeepSeek V4 vLLM deployment guide for expert parallelism configuration. - Structured outputs: JSON schema enforcement via guided decoding; critical for agent/tool-calling workloads. If your workload involves AI agents making structured output calls, there are additional tuning steps specific to JSON schema enforcement and function calling. See our structured output and function calling inference guide.
- Speculative decoding: faster generation using a small draft model; effective for specific model pairs where latency matters. For low-concurrency serving where latency matters more than throughput, speculative decoding can add another 2-5x improvement on top of any of these configurations. See the speculative decoding production guide for setup steps.
- FlashAttention 4 backend: new in v0.20.0+; the default attention backend on Blackwell SM100/SM103 GPUs (B200, B300); improves prefill throughput and speculative decode performance with no configuration required. On Hopper GPUs (H100, H200), FlashAttention 3 remains the default. For a deep dive into FA4's SM100 tile architecture, benchmark comparisons with FA3, and step-by-step setup on Blackwell instances, see the FlashAttention-4 Blackwell inference guide.
--performance-modeflag: new in v0.17.0+; sets pre-tuned defaults forbalanced,interactivity, orthroughputworkloads. See the Production Configuration section below for details.
For the full changelog, check vLLM's releases page. What's in this guide is what you actually need to configure for production: not every feature, just the ones that matter.
Choosing Your GPU Configuration
Your GPU choice depends on model size and whether you prioritize throughput or cost. Here's the practical decision table:
| Model Size | Recommended GPU | Parallelism Strategy | On-Demand Price | Spot Price |
|---|---|---|---|---|
| 7B–13B | 1x RTX 4090 (24GB) | None: single GPU | $0.53/hr | N/A |
| 7B–13B | 1x RTX 5090 (32GB) | None: single GPU | $0.92/hr | N/A |
| 13B–30B | 1x L40S (48GB) | None: single GPU | $0.72/hr | N/A |
| 30B–70B (FP8/Q4) | 1x RTX PRO 6000 (96GB) | None: single GPU | $1.70/hr | $1.24/hr |
| 30B–40B (FP16) | 1x A100 80GB (SXM4) | None: single GPU | $1.64/hr | N/A |
| 70B (FP8) | 1x H100 SXM5 80GB | None: just fits (tight ~70GB) | $2.50/hr | $1.03/hr |
| 70B (FP16) | 2x H100 SXM5 80GB | Tensor parallel (2) | $5.00/hr | N/A |
| 70B (FP16) | 4x H100 SXM5 80GB | Tensor parallel (4) | $10.00/hr | N/A |
| 100B+ | 8x H100 SXM5 | Tensor + pipeline parallel | $20.00/hr+ | varies |
Note on A100 and FP8: The A100 does not have hardware FP8 Tensor Cores. Running
--dtype fp8on A100 will either error or silently fall back to FP16. In FP16, each parameter takes 2 bytes, so an 80GB A100 can hold at most ~40B parameters (80GB ÷ 2 bytes), with some headroom needed for KV cache. For 70B models, use H100 with FP8 (fits on one 80GB card at ~70GB) or 2× H100 with FP16.
GPU pricing fluctuates over time based on availability and demand. Prices above are live on-demand (dedicated) and spot rates fetched from Spheron's GPU catalog as of 14 May 2026. Spot instances offer significant savings where available, but may not always be in stock. Your actual cost may differ - always check current GPU pricing before committing to a configuration.
vLLM also supports AMD GPUs via ROCm. See our ROCm vs CUDA GPU Cloud guide for setup instructions and performance comparisons.
For the RTX PRO 6000 as a single-GPU 30B–70B option (96GB GDDR7), see our RTX PRO 6000 guide. For Qwen 3 deployment, see the Qwen 3 GPU deployment guide.
Setting Up Your Spheron Instance
Step 1: Launch the Instance
Select your GPU, region, and provider from Spheron's GPU catalog. For a 70B FP16 model with tensor parallelism, select a 2x H100 configuration. For FP8, a single H100 is sufficient.
Step 2: Verify GPU Access
nvidia-smi
# Should show all GPUs with expected VRAM
# For a 2x H100 instance, you'll see two rows with 80GB eachStep 3: Install Docker with NVIDIA Support
Most Spheron GPU instances come with the NVIDIA Container Toolkit pre-installed. If not:
# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerStep 4: Verify Docker GPU Access
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smiIf this outputs the same nvidia-smi table as the host, you're ready. If it errors, the container toolkit configuration didn't take: restart Docker and try again.
Single-GPU Deployment: The Starting Point
Start here even if you plan to run multi-GPU. Validate that the model loads, the API works, and your system is healthy before scaling.
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--dtype float16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 256 \
--host 0.0.0.0 \
--port 8000Every flag here is deliberate:
--gpus all: expose all GPUs to the container; required for CUDA to work--ipc=host: use the host's shared memory namespace; do not skip this: vLLM uses shared memory for inter-process communication, and without this flag you'll hit cryptic CUDA errors under load--dtype float16: FP16 precision; switch tofp8on H100/Blackwell for ~2x throughput (covered below)--max-model-len 8192: maximum context length; lower values mean less KV cache VRAM, which means more room for concurrent requests. For reasoning models (DeepSeek-R1, o3-equivalent), set this to 32,768 or higher to accommodate thinking tokens. See Inference-Time Compute Scaling on GPU Cloud for how to right-size this value for your reasoning depth.--gpu-memory-utilization 0.90: reserve 90% of GPU VRAM for the model and KV cache; keep 10% headroom to avoid OOM on unexpected spikes--max-num-seqs 256: maximum concurrent sequences; increase for high-throughput workloads if VRAM allows. For reasoning workloads, lower this significantly: a 32K-context reasoning chain uses ~10 GB of KV cache at FP16, so concurrency is memory-limited.
Test the deployment immediately after it starts:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello, are you online?"}],
"max_tokens": 50
}'You should get a JSON response with a choices array. If you get a connection error, the model is still loading: vLLM downloads and loads weights before accepting requests, which can take 5–15 minutes for large models on first run.
For a full walkthrough using the vLLM OpenAI-compatible endpoint as a drop-in replacement for the OpenAI API, see our self-hosted OpenAI API guide.
Multi-GPU Tensor Parallelism
When a model doesn't fit on one GPU, or when you need lower time-to-first-token (TTFT) by distributing compute, use tensor parallelism.
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--dtype float16 \
--tensor-parallel-size 2 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 64What --tensor-parallel-size 2 does: splits every transformer layer across 2 GPUs. Each GPU processes half the attention heads and half the MLP feed-forward computation. The results are synchronized at the end of each layer via an all-reduce operation across the GPUs.
The performance of tensor parallelism depends heavily on the interconnect between GPUs:
- NVLink (SXM H100/H200): ~900 GB/s bidirectional bandwidth; all-reduce is fast; tensor parallelism scales well
- PCIe (A100 PCIe, non-NVLink setups): ~64 GB/s bidirectional via PCIe 4.0 x16; all-reduce is slower; each layer boundary adds latency
If you're running SXM H100s with NVLink, tensor parallelism at 2x or 4x is very efficient. On PCIe-only multi-GPU setups, expect more overhead between layers: still worth it for models that don't fit, but don't expect linear throughput scaling.
When tensor parallelism helps:
- Model is too large for a single GPU (70B FP16 needs ~140GB, beyond any single GPU)
- TTFT is too high: spreading prefill across more GPUs reduces time to first token
- You have NVLink-connected GPUs and want to push throughput higher
When it doesn't help:
- Running a 7B model across 4 GPUs: the communication overhead outweighs the benefit; just run 4 separate single-GPU instances instead
- PCIe-only systems with a model that barely fits on one GPU: the communication cost may not be worth the marginal VRAM headroom
MoE models require a different parallelism strategy - see MoE inference optimization for expert parallelism configuration. For workloads where you need to separate prefill and decode across nodes, see NVIDIA Dynamo 1.0, which adds a disaggregated routing layer on top of vLLM that can deliver up to 7x higher throughput.
FP8 Quantization: 2x Throughput on H100 and Blackwell
FP8 is the single most impactful configuration change available on H100 and Blackwell GPUs. It requires no quantization scripts, no model modifications, and no additional setup: just one flag change:
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 128What you gain with FP8:
- ~1.5–2x throughput improvement vs FP16 on H100: the H100's FP8 Tensor Cores run at twice the FLOP rate of FP16 Tensor Cores
- ~50% VRAM reduction: a 70B model in FP8 uses roughly 70GB vs 140GB in FP16 - it fits on a single H100 80GB, though it is tight; tune
--gpu-memory-utilization(0.92+) and--max-model-lencarefully to avoid OOM - Marginal quality loss: typically less than 1-2% on standard benchmarks; acceptable for production inference on most tasks
Model compatibility: Most major open-source models (Llama 3.x, Mistral, Qwen 2.5, Phi-4) have been validated with vLLM FP8. For models without pre-quantized FP8 weights available, vLLM performs dynamic quantization on the fly using the original weights. Verify FP8 compatibility in vLLM's supported models documentation.
FP8 requires hardware support: H100, H200, NVIDIA Ada Lovelace GPUs (RTX 4090, L40S), and NVIDIA Blackwell GPUs (B200, B100, and consumer RTX 50 series including the RTX 5090) have dedicated FP8 Tensor Cores. On A100 or older Ampere hardware, --dtype fp8 will either fail or fall back to FP16 automatically: check your vLLM logs to confirm which mode is active.
For healthcare, finance, or defense workloads that require encrypted VRAM and hardware attestation in addition to production throughput, the confidential GPU computing guide covers running vLLM inside a NVIDIA CC mode environment with remote attestation and KMS integration.
Production Configuration for High Throughput
Once the model is running, tune these parameters for production workloads handling hundreds of concurrent requests:
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--tensor-parallel-size 2 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 512 \
--max-num-batched-tokens 65536 \
--enable-chunked-prefill \
--kv-cache-dtype fp8What each production setting does:
--max-num-seqs 512: increase from the default 256 when you have many concurrent users and your VRAM can support it; monitorvllm:kv_cache_usage_percto see if you're running out of KV cache space--max-num-batched-tokens 65536: maximum tokens processed per forward pass iteration; increase for throughput-optimized workloads, decrease if you see out-of-memory errors under bursty load--enable-chunked-prefill: breaks long prefill sequences into smaller chunks and interleaves them with ongoing decode steps; reduces latency spikes when your traffic mix includes both long prompts and short responses--kv-cache-dtype fp8: stores the KV cache in FP8 format; saves ~50% VRAM on cached activations (FP8 uses 1 byte vs 2 bytes per element in FP16), allowing more concurrent requests with the same GPU; quality impact is minimal for most workloads
These settings are not universal: tune --max-num-seqs and --max-num-batched-tokens based on your actual traffic patterns. A workload with many short requests benefits from higher --max-num-seqs. A batch inference workload with long documents needs higher --max-num-batched-tokens. Start with these values and adjust based on the metrics in the monitoring section.
Shortcut: --performance-mode (new in v0.17.0): If you don't want to tune individual flags, use --performance-mode throughput for batch workloads or --performance-mode interactivity for chat/real-time applications. The balanced mode (default) is a reasonable starting point for mixed traffic. This flag configures a curated set of defaults for each scenario, you can still override individual flags on top of it.
vLLM pre-captures CUDA graphs for the decode phase during the warm-up pass at server start. For a deep dive into how CUDA graph capture and replay work at the PyTorch level, including how to implement it yourself in a raw PyTorch serving stack, see the CUDA graph capture guide for LLM inference.
Load Balancing Multiple vLLM Instances
Tensor parallelism makes one vLLM instance use more GPUs. Load balancing makes multiple vLLM instances handle more total traffic. Use both when your traffic exceeds what one instance can handle.
The simplest horizontal scaling approach: run separate vLLM instances on separate GPU devices, then load balance across them with NGINX.
# Instance 1: pinned to GPU 0, listening on port 8000
docker run --gpus '"device=0"' --ipc=host -p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--dtype float16 \
--max-model-len 8192 \
--max-num-seqs 256
# Instance 2: pinned to GPU 1, listening on port 8001
docker run --gpus '"device=1"' --ipc=host -p 8001:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--dtype float16 \
--max-model-len 8192 \
--max-num-seqs 256Note: This example uses
--dtype float16for broad hardware compatibility - it works on RTX 4090, RTX 5090, A100, and all other CUDA GPUs. If you are running on H100, H200, NVIDIA Blackwell (B200), RTX 5090, or RTX 4090 (Ada Lovelace), you can replacefloat16withfp8to leverage hardware FP8 Tensor Cores for ~2x throughput. FP8 Tensor Core hardware support requires Ada Lovelace (SM89, RTX 40 series) or newer, do not use--dtype fp8on A100, RTX 3090, or other Ampere (SM80/SM86) and older hardware, as those GPUs lack FP8 Tensor Core support and will fall back to FP16 silently or error. The RTX 5090 (Blackwell GB202) includes FP8 Tensor Cores and is fully supported in vLLM v0.17.0+, which ships a dedicated SM120 FP8 GEMM optimization for higher FP8 throughput on Blackwell consumer GPUs. You can enable--dtype fp8on RTX 5090 when running vLLM v0.17.0 or later.
NGINX configuration for load balancing across both instances:
upstream vllm_backend {
least_conn;
server localhost:8000;
server localhost:8001;
}
server {
listen 80;
location /v1/ {
proxy_pass http://vllm_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_read_timeout 300s;
proxy_connect_timeout 10s;
proxy_buffering off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}Use least_conn (least connections), not round-robin. vLLM requests vary dramatically in duration: a short request completes in 100ms, a long generation can take 30 seconds. Least-connections routing sends new requests to whichever backend currently has fewer active connections, naturally load-balancing based on actual utilization rather than request count.
Install NGINX and load the configuration:
sudo apt-get install -y nginx
sudo cp vllm-nginx.conf /etc/nginx/sites-available/vllm
sudo ln -s /etc/nginx/sites-available/vllm /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginxFor multi-node deployments (separate physical servers), replace localhost:8000 and localhost:8001 with the actual IP addresses of each node. The rest of the NGINX configuration remains the same.
Monitoring in Production
Running vLLM in production without monitoring is running blind. Two systems to set up from day one:
GPU-Level Monitoring
# Real-time GPU stats: VRAM usage, utilization, temperature
watch -n 2 nvidia-smi
# Persistent logging to file
nvidia-smi dmon -s pum -d 10 >> gpu-metrics.log &For production, see our GPU monitoring guide for setting up DCGM with Prometheus and Grafana dashboards: nvidia-smi polling works for development but doesn't scale to multi-GPU production systems.
vLLM Metrics Endpoint
vLLM exposes a Prometheus-compatible metrics endpoint at /metrics:
curl http://localhost:8000/metrics | grep vllmThe metrics you need to watch:
| Metric | What it Tells You | Action if High |
|---|---|---|
vllm:num_requests_running | Active requests being processed | Normal: reflects traffic |
vllm:num_requests_waiting | Requests queued, waiting for a free slot | Scale out or increase --max-num-seqs |
vllm:kv_cache_usage_perc | KV cache fill percentage | If >95%, reduce --max-num-seqs or --max-model-len |
vllm:time_to_first_token_seconds | Latency before first token is generated | Your primary user-facing latency metric |
vllm:e2e_request_latency_seconds | Total request duration | Track p95 and p99, not just mean |
A rising num_requests_waiting with high kv_cache_usage_perc means you're KV-cache-bound: the bottleneck is memory, not compute. Reduce --max-model-len or add --kv-cache-dtype fp8 to free up space. A rising num_requests_waiting with low kv_cache_usage_perc means you're compute-bound: add more GPUs or instances.
For teams adding runtime safety enforcement to their vLLM deployment, covering jailbreak blocking, PII masking, and topic containment, see the NeMo Guardrails production deployment guide for a co-located classifier setup that adds under 80ms of overhead. Teams serving internal users often pair this vLLM backend with Open WebUI or LibreChat as a ChatGPT-style interface.
Common Issues and Fixes
OOM (Out of Memory) Error on Startup
CUDA out of memory. Tried to allocate X GiBDiagnosis steps in order:
- Reduce
--gpu-memory-utilizationfrom 0.90 to 0.85: give the GPU more headroom - Reduce
--max-model-len: every 1024 tokens of context length requires additional KV cache VRAM - Switch from
--dtype float16to--dtype fp8: cuts model VRAM by ~50% on H100 - Increase
--tensor-parallel-sizeto spread the model across more GPUs
Slow TTFT (Time to First Token)
If your time-to-first-token is consistently above 2-3 seconds for normal-length prompts:
- Enable tensor parallelism: spreading prefill computation across 2-4 GPUs cuts TTFT proportionally
- Enable chunked prefill (
--enable-chunked-prefill): interleaves prefill with ongoing decode, preventing long prefill jobs from blocking short ones - Reduce
--max-model-len: if you don't need 128K context, set it to 8K or 16K; shorter configured max length = smaller KV cache allocation per sequence = more room for batching
CUDA Error: Device-Side Assert Triggered
RuntimeError: CUDA error: device-side assert triggeredThis is almost always a tokenizer mismatch. Causes:
- Input text that tokenizes to more tokens than
--max-model-len - Using the wrong model name in the API request (model field must match the model vLLM loaded)
- Special tokens or unicode sequences the tokenizer doesn't handle cleanly
Fix: verify that your input length (in tokens, not characters) is below --max-model-len. Use the tokenizer's encode method to check before sending.
Low GPU Utilization (Consistently Below 70%)
If nvidia-smi shows GPU utilization below 70% but requests are queuing:
- Increase
--max-num-seqs: you're not saturating the GPU with enough concurrent sequences - Increase
--max-num-batched-tokens: each forward pass is processing too few tokens - Check your client: if you're sending requests one at a time and waiting for each to complete, you're not actually generating concurrent load; vLLM's continuous batching requires concurrent requests to be effective
Model Download Fails at Startup
OSError: Hugging Face Hub is not reachable- Verify your
HUGGING_FACE_HUB_TOKENenvironment variable is set correctly - Check that the model name matches exactly (case-sensitive) the Hugging Face repo path
- Pre-download the model with
huggingface-cli download model-nameand mount the local directory instead, to decouple model download from container startup
For more context on GPU infrastructure for production ML, see the production GPU cloud architecture guide. For LoRA-specific deployments serving multiple fine-tuned adapters on one GPU, see the LoRA multi-adapter serving guide.
Deploy vLLM on Spheron's bare-metal H100s, RTX 5090s, or B200s: full CUDA access, no virtualization overhead. Your models run at native GPU performance.
Quick Setup Guide
Select your GPU based on model size and budget. For 7B–13B models, an RTX 5090 (32GB, $0.92/hr on-demand) is sufficient. For 70B models in FP8, a single H100 SXM5 80GB ($2.50/hr on-demand, $1.03/hr spot) works on one GPU - though 70B FP8 weights are ~70GB, so tune --gpu-memory-utilization carefully. For 70B FP16, use 2x H100 SXM5 with tensor parallelism ($5.00/hr on-demand). A100 SXM4 runs from $1.64/hr on-demand. Prices as of May 2026. For a full inference GPU comparison covering cost-per-token at each model size, see the AI inference GPU guide at /blog/best-gpu-for-ai-inference-2026/.
Log into Spheron and select your GPU, region, and provider from the GPU catalog. SSH into the instance and run `nvidia-smi` to verify all GPUs appear with the expected VRAM. For multi-GPU tensor parallelism, confirm the number of GPUs shown matches what you provisioned.
Most Spheron GPU instances include the NVIDIA Container Toolkit pre-installed. If not, install it using the curl commands in the setup section, then run `sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker`. Validate GPU access inside Docker with `docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi`.
Run the vLLM OpenAI-compatible server: `docker run --gpus all --ipc=host -p 8000:8000 vllm/vllm-openai:latest --model <your-model> --dtype float16 --gpu-memory-utilization 0.90 --max-model-len 8192 --max-num-seqs 256`. The `--ipc=host` flag is required - skipping it causes CUDA errors under load. Test with a curl request to `http://localhost:8000/v1/chat/completions`.
For models that exceed single-GPU VRAM, add `--tensor-parallel-size 2` (or 4, 8) to split each transformer layer across multiple GPUs. This also reduces time-to-first-token by distributing prefill computation. NVLink-connected SXM GPUs scale most efficiently; PCIe setups have higher communication overhead per layer boundary.
Replace `--dtype float16` with `--dtype fp8` to use hardware FP8 Tensor Cores on H100 and Blackwell. This provides ~1.5–2x throughput improvement and ~50% VRAM reduction (70B weights: ~70GB in FP8 vs ~140GB in FP16) - allowing a 70B model to fit on a single H100 80GB instead of two, though it is tight; set --gpu-memory-utilization 0.92+ and tune --max-model-len accordingly. Add `--kv-cache-dtype fp8` to also store the KV cache in FP8, saving additional VRAM. See the [KV Cache Optimization Guide](/blog/kv-cache-optimization-guide/) for a complete walkthrough of FP8/NVFP4 quantization, prefix caching, and CPU offloading. For long-context workloads where GPU KV cache still fills after FP8 quantization, [NVMe KV cache offloading with LMCache](/blog/nvme-kv-cache-offloading-llm-inference/) adds a third storage tier.
Access vLLM's built-in Prometheus metrics at `/metrics`. Watch `vllm:num_requests_waiting` for queue depth, `vllm:kv_cache_usage_perc` for KV cache pressure, and `vllm:time_to_first_token_seconds` for latency. For GPU-level monitoring, use `nvidia-smi dmon -s pum -d 10` or integrate DCGM with a Prometheus/Grafana stack for multi-GPU production systems.
Frequently Asked Questions
Use the --tensor-parallel-size flag in the vLLM Docker command. For example, add --tensor-parallel-size 2 to split a model across 2 GPUs. vLLM handles all inter-GPU communication automatically. Make sure to also set --ipc=host in Docker to enable shared memory between GPU processes.
With FP8 quantization (--dtype fp8), a 70B model requires approximately 70GB of VRAM for weights alone (1 byte × 70B parameters) - this barely fits on a single H100 80GB GPU. Set --gpu-memory-utilization to 0.92 or higher and limit --max-model-len to leave minimal headroom for KV cache. In FP16, you need at least 2x H100 80GB (160GB total) with tensor-parallel-size 2. On Spheron, H100 SXM5 80GB instances start from $2.50/hr on-demand ($1.03/hr spot) as of May 2026. For INT4 quantization to cut VRAM requirements by ~50% and enable 70B serving on a single A100 80G, see the AWQ quantization guide at /blog/awq-quantization-guide-llm-deployment/.
Tensor parallelism (--tensor-parallel-size) splits each transformer layer across multiple GPUs: all GPUs work on every layer simultaneously. This reduces latency (time to first token). Pipeline parallelism (--pipeline-parallel-size) assigns different layers to different GPUs: each GPU processes a different stage. Pipeline parallelism is better for maximizing throughput on very large models that don't fit in tensor-parallel configurations.
In practice, FP8 causes marginal quality loss: typically less than 1-2% on standard benchmarks for major models like Llama and Mistral. For most production inference workloads, this tradeoff is acceptable given the ~1.5-2x throughput improvement and ~50% VRAM reduction on H100 and Blackwell GPUs.
vLLM exposes a Prometheus-compatible /metrics endpoint. The key metrics to watch are vllm:num_requests_waiting (queue depth), vllm:kv_cache_usage_perc (KV cache fill rate), and vllm:time_to_first_token_seconds (TTFT latency). Pair this with nvidia-smi or DCGM for GPU-level monitoring.
