Spheron GPU Catalog

Rent NVIDIA H200 GPUs on Demand from $1.19/hr

141GB HBM3e, 4.8 TB/s bandwidth, NVLink, per-minute billing. Live in under 2 minutes.

At a glance

Renting an NVIDIA H200 on Spheron starts at $1.19/hr per GPU per hour on dedicated (99.99% SLA), with interruptible spot instances cheaper still. Billing is per minute, there is no minimum commit, and most instances are live inside two minutes. The H200 shares H100's Hopper compute (4th-gen Tensor Cores, FP8 via Transformer Engine, 989 TFLOPS TF32, 3,958 TFLOPS FP8 with sparsity) and bumps memory to 141GB HBM3e at 4.8 TB/s. That makes it the better pick when H100 is memory-bound: long-context inference, 70B to 100B serving at large batch sizes, multi-model colocation, and RAG. Specialist clouds price H200 around $3.80 to $4.00 per GPU per hour (Lambda, Jarvislabs, RunPod), while hyperscalers run $4.98/hr on AWS p5e, ~$6.31/hr on CoreWeave, and ~$10.60 to $10.87/hr on Azure ND H200 v5 and GCP a3-ultragpu on-demand.

GPU ArchitectureNVIDIA Hopper
VRAM141 GB HBM3e
Memory Bandwidth4.8 TB/s

Technical specifications

GPU Architecture
NVIDIA Hopper
VRAM
141 GB HBM3e
Memory Bandwidth
4.8 TB/s
Tensor Cores
528 (4th Gen)
CUDA Cores
16,896
FP64 Performance
34 TFLOPS
FP32 Performance
67 TFLOPS
TF32 Tensor
989 TFLOPS (sparsity)
FP16 Tensor
1,979 TFLOPS (sparsity)
FP8 Tensor
3,958 TFLOPS (sparsity)
INT8 Tensor
3,958 TOPS (sparsity)
NVLink Bandwidth
900 GB/s
System RAM
200 GB DDR5
vCPUs
16 vCPUs
Storage
465 GB NVMe Gen4
Form Factor
SXM5
TDP
700W

Pricing comparison

ProviderPrice/hrSavings
SpheronYour price
$1.19/hr-
Lambda
$3.79/hr3.2x more expensive
Jarvislabs
$3.80/hr3.2x more expensive
RunPod
$3.99/hr3.4x more expensive
AWS p5e
$4.98/hr4.2x more expensive
CoreWeave
$6.31/hr5.3x more expensive
Azure ND H200 v5
$10.60/hr8.9x more expensive
Google Cloud a3-ultragpu
$10.87/hr9.1x more expensive
Custom & Reserved

Need More H200 Than What's Listed?

Reserved Capacity

Commit to a duration, lock in availability and better rates

Custom Clusters

8 to 512+ GPUs, specific hardware, InfiniBand configs on request

Supplier Matchmaking

Spheron sources from its certified data center network, negotiates pricing, handles setup

Need more H200 capacity? Tell us your requirements and we'll source it from our certified data center network.

Typical turnaround: 24–48 hours

When to pick the H200

Scenario 01

Pick the H200 if

Your workload is memory-bound on H100. That means long-context LLM inference (32K+ tokens), 70B to 100B serving at production batch sizes, multi-model colocation on a single GPU, or RAG stacks where embedding stores and the LLM need to live in VRAM together. H200 gives you 1.76x the memory and 1.43x the bandwidth of H100, same Hopper compute.

Recommended fit
Scenario 02

Pick the H100 instead if

You are training on models up to 70B or running inference that comfortably fits in 80GB. H100 has identical Tensor Core math and Transformer Engine, at a lower hourly rate. Move to H200 only when memory capacity or KV-cache headroom is the bottleneck.

Recommended fit
Scenario 03

Pick the B200 instead if

You need maximum throughput on the largest models. B200 delivers ~2.3x H100's FP8 dense TFLOPS and ships 192GB HBM3e at 8 TB/s. For trillion-parameter training or FP4 inference, B200 is the right call. For H100-class compute with bigger memory, stay on H200.

Recommended fit
Scenario 04

Pick the A100 instead if

You are doing classic training up to 30B parameters, quantized inference, or cost-sensitive fine-tuning. A100 80GB costs roughly a third of H200 and the mature stack still delivers. Skip to H200 when FP8 or 141GB matter.

Recommended fit

Ideal use cases

Use case / 01
💬

Long-context LLM inference

141GB HBM3e lets you serve 70B to 100B models at large batch sizes with room left for KV cache on 32K+ context windows. Transformer Engine and FP8 keep latency low; the extra memory keeps throughput high.

Llama 3.1 70B FP8 at batch 128+ with 32K contextMixtral 8x22B and DeepSeek V3 serving on fewer GPUsLong-document chat with 100K+ token windowsHigh-concurrency agent and copilot backends
Use case / 02
📚

Multi-model and RAG serving

Colocate a 30B chat model, a 7B code model, and an embedding model on one card. Keep vector indices and reranker weights resident in VRAM alongside the LLM for sub-10 ms retrieval.

Enterprise RAG with embeddings + LLM in-memoryLegal / medical AI stacks with specialized modelsA/B serving of multiple model versions on one GPUReranker + LLM pipelines without cross-GPU hops
Use case / 03
🎯

LLM fine-tuning and RLHF

Fine-tune 70B models with larger per-GPU batches, or run full SFT on 30B models without sharding. LoRA and QLoRA on 100B-class models become single-node jobs.

Llama 3.1 70B supervised fine-tuningRLHF / DPO on 30B to 70B modelsLoRA / QLoRA on 100B+ checkpointsDomain tuning for code, legal, medical
Use case / 04

High-throughput inference at scale

For production serving where tokens per dollar matters, H200 widens the batch without running out of memory. Pair with TensorRT-LLM or vLLM for best throughput.

Chat backends at millions of tokens per minuteCode-generation services with large KV cacheMulti-tenant inference with dynamic batchingSpeculative decoding with draft + target models resident

Performance benchmarks

Llama 2 70B inference (MLPerf v5.0)
~33,000 tok/s
~1.4x H100 offline
Peak single-GPU throughput
up to 1.9x
vs H100 on memory-bound decode
Memory bandwidth
4.8 TB/s
vs 3.35 TB/s on H100 (1.43x)
VRAM capacity
141 GB HBM3e
1.76x H100 (80 GB HBM3)
FP8 Tensor (sparsity)
3,958 TFLOPS
same Hopper compute as H100
Concurrent model serving
3 to 5 models
20B to 70B resident on one card

Serve Llama 3.1 70B FP8 on one H200 in under 3 minutes

H200's 141GB fits Llama 3.1 70B in FP8 with plenty of KV-cache headroom for long contexts. This snippet pulls the vLLM image, serves the model with an OpenAI-compatible API, and enables FP8 for best throughput.

bash
Spheron
# 1. Provision an H200 from the Spheron CLI (or use the dashboard)spheron deploy --gpu h200 --image vllm/vllm-openai:latest # 2. Inside the instance, serve Llama 3.1 70B Instruct in FP8vllm serve meta-llama/Llama-3.1-70B-Instruct \  --quantization fp8 \  --max-model-len 32768 \  --gpu-memory-utilization 0.92 \  --port 8000 # 3. Hit the endpoint from any OpenAI-compatible clientcurl http://<instance-ip>:8000/v1/chat/completions \  -H "Content-Type: application/json" \  -d '{    "model": "meta-llama/Llama-3.1-70B-Instruct",    "messages": [{"role": "user", "content": "Summarize why H200 is memory-bound workloads first."}]  }'

For models above 141GB or extreme concurrency, add --tensor-parallel-size N and rent a multi-GPU H200 cluster with NVLink. For multi-node InfiniBand clusters, contact us.

Interconnect fabric

Multi-GPU H200 with NVLink and InfiniBand

H200 SXM5 nodes on Spheron connect 8 GPUs with 900 GB/s NVLink inside a node and 400 Gb/s NDR InfiniBand across nodes. That fabric matches NVIDIA's HGX H200 reference design, so tensor-parallel inference with vLLM or TensorRT-LLM, and pipeline-parallel training with Megatron-LM, scale close to linearly.

01900 GB/s NVLink between GPUs inside a node
02400 Gb/s NDR InfiniBand across nodes
03GPUDirect RDMA for zero-copy GPU-to-GPU transfers
041.1 TB unified GPU memory in an 8x H200 node
05NCCL pre-tuned for H200 topology
06Tested with vLLM, TensorRT-LLM, SGLang, Megatron-LM, DeepSpeed ZeRO-3
07Tensor parallel and pipeline parallel serving for 200B+ models
08Sub-microsecond latency for GPU-to-GPU communication
Scale

Need a custom multi-node cluster or reserved capacity?

H200 vs alternatives

Related resources

FAQ / 14

Frequently asked questions

Also consider