Rent NVIDIA H200 GPUs on Demand from $1.19/hr
141GB HBM3e, 4.8 TB/s bandwidth, NVLink, per-minute billing. Live in under 2 minutes.
Renting an NVIDIA H200 on Spheron starts at $1.19/hr per GPU per hour on dedicated (99.99% SLA), with interruptible spot instances cheaper still. Billing is per minute, there is no minimum commit, and most instances are live inside two minutes. The H200 shares H100's Hopper compute (4th-gen Tensor Cores, FP8 via Transformer Engine, 989 TFLOPS TF32, 3,958 TFLOPS FP8 with sparsity) and bumps memory to 141GB HBM3e at 4.8 TB/s. That makes it the better pick when H100 is memory-bound: long-context inference, 70B to 100B serving at large batch sizes, multi-model colocation, and RAG. Specialist clouds price H200 around $3.80 to $4.00 per GPU per hour (Lambda, Jarvislabs, RunPod), while hyperscalers run $4.98/hr on AWS p5e, ~$6.31/hr on CoreWeave, and ~$10.60 to $10.87/hr on Azure ND H200 v5 and GCP a3-ultragpu on-demand.
Technical specifications
Pricing comparison
| Provider | Price/hr | Savings |
|---|---|---|
SpheronYour price | $1.19/hr | - |
Lambda | $3.79/hr | 3.2x more expensive |
Jarvislabs | $3.80/hr | 3.2x more expensive |
RunPod | $3.99/hr | 3.4x more expensive |
AWS p5e | $4.98/hr | 4.2x more expensive |
CoreWeave | $6.31/hr | 5.3x more expensive |
Azure ND H200 v5 | $10.60/hr | 8.9x more expensive |
Google Cloud a3-ultragpu | $10.87/hr | 9.1x more expensive |
Need More H200 Than What's Listed?
Reserved Capacity
Commit to a duration, lock in availability and better rates
Custom Clusters
8 to 512+ GPUs, specific hardware, InfiniBand configs on request
Supplier Matchmaking
Spheron sources from its certified data center network, negotiates pricing, handles setup
Need more H200 capacity? Tell us your requirements and we'll source it from our certified data center network.
Typical turnaround: 24–48 hours
When to pick the H200
Pick the H200 if
Your workload is memory-bound on H100. That means long-context LLM inference (32K+ tokens), 70B to 100B serving at production batch sizes, multi-model colocation on a single GPU, or RAG stacks where embedding stores and the LLM need to live in VRAM together. H200 gives you 1.76x the memory and 1.43x the bandwidth of H100, same Hopper compute.
Pick the H100 instead if
You are training on models up to 70B or running inference that comfortably fits in 80GB. H100 has identical Tensor Core math and Transformer Engine, at a lower hourly rate. Move to H200 only when memory capacity or KV-cache headroom is the bottleneck.
Pick the B200 instead if
You need maximum throughput on the largest models. B200 delivers ~2.3x H100's FP8 dense TFLOPS and ships 192GB HBM3e at 8 TB/s. For trillion-parameter training or FP4 inference, B200 is the right call. For H100-class compute with bigger memory, stay on H200.
Pick the A100 instead if
You are doing classic training up to 30B parameters, quantized inference, or cost-sensitive fine-tuning. A100 80GB costs roughly a third of H200 and the mature stack still delivers. Skip to H200 when FP8 or 141GB matter.
Ideal use cases
Long-context LLM inference
141GB HBM3e lets you serve 70B to 100B models at large batch sizes with room left for KV cache on 32K+ context windows. Transformer Engine and FP8 keep latency low; the extra memory keeps throughput high.
Multi-model and RAG serving
Colocate a 30B chat model, a 7B code model, and an embedding model on one card. Keep vector indices and reranker weights resident in VRAM alongside the LLM for sub-10 ms retrieval.
LLM fine-tuning and RLHF
Fine-tune 70B models with larger per-GPU batches, or run full SFT on 30B models without sharding. LoRA and QLoRA on 100B-class models become single-node jobs.
High-throughput inference at scale
For production serving where tokens per dollar matters, H200 widens the batch without running out of memory. Pair with TensorRT-LLM or vLLM for best throughput.
Performance benchmarks
Serve Llama 3.1 70B FP8 on one H200 in under 3 minutes
H200's 141GB fits Llama 3.1 70B in FP8 with plenty of KV-cache headroom for long contexts. This snippet pulls the vLLM image, serves the model with an OpenAI-compatible API, and enables FP8 for best throughput.
# 1. Provision an H200 from the Spheron CLI (or use the dashboard)spheron deploy --gpu h200 --image vllm/vllm-openai:latest # 2. Inside the instance, serve Llama 3.1 70B Instruct in FP8vllm serve meta-llama/Llama-3.1-70B-Instruct \ --quantization fp8 \ --max-model-len 32768 \ --gpu-memory-utilization 0.92 \ --port 8000 # 3. Hit the endpoint from any OpenAI-compatible clientcurl http://<instance-ip>:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-70B-Instruct", "messages": [{"role": "user", "content": "Summarize why H200 is memory-bound workloads first."}] }'For models above 141GB or extreme concurrency, add --tensor-parallel-size N and rent a multi-GPU H200 cluster with NVLink. For multi-node InfiniBand clusters, contact us.
Multi-GPU H200 with NVLink and InfiniBand
H200 SXM5 nodes on Spheron connect 8 GPUs with 900 GB/s NVLink inside a node and 400 Gb/s NDR InfiniBand across nodes. That fabric matches NVIDIA's HGX H200 reference design, so tensor-parallel inference with vLLM or TensorRT-LLM, and pipeline-parallel training with Megatron-LM, scale close to linearly.
Need a custom multi-node cluster or reserved capacity? Talk to us about topology, regions, and committed pricing.
H200 vs alternatives
Same Hopper compute, 1.76x the memory and 1.43x the bandwidth. Pick H200 when H100 is memory-bound: long context, 70B+ inference at scale, or multi-model serving.
B200 Blackwell delivers ~2.3x H100's FP8 dense TFLOPS and 192GB HBM3e at 8 TB/s. Jump to B200 for trillion-parameter training or FP4 inference. H200 is Hopper compute with bigger memory.
A100 is ~60 to 70 percent cheaper per hour but lacks FP8 and Transformer Engine. Use A100 for classic training up to 30B. H200 is the right call when FP8 throughput or 141GB matter.
MI300X has more VRAM (192GB) but the software stack trails NVIDIA's CUDA / TensorRT-LLM / vLLM ecosystem. For production inference today, H200 is the faster-to-deploy path.
Related resources
NVIDIA H100 vs H200: Benchmarks, Specs, and When to Upgrade
Side-by-side comparison for LLM inference, memory bandwidth, batch sizing, and when the H200 premium pays off.
H200 Deployment Guide: Long Context, Multi-Model, and NVLink Clusters
Practitioner deep dive on H200 configurations, KV-cache sizing, multi-model colocation, and cluster patterns on Spheron.
H200 vs B200 vs GB200: Which Blackwell-Class GPU Fits Your Workload?
How H200 compares to Blackwell B200 and GB200 for training, inference, and memory-bound workloads.
AMD MI300X vs NVIDIA H200: Memory, Performance, and Cost
How the H200 stacks up against AMD's MI300X across memory capacity, software stack maturity, and total cost.
Best NVIDIA GPUs for LLMs
Framework for matching GPU choice to model size and workload, from 7B on A100 to 671B on B200.
GPU Memory Requirements for Large Language Models
Calculate VRAM needs across precision levels and KV-cache pressure for every major model class.