Spheron GPU Catalog

Rent NVIDIA H100 GPUs on Demand from $2.51/hr

80GB HBM3, 400 Gb/s InfiniBand, per-minute billing. Live in under 2 minutes.

At a glance

Renting an NVIDIA H100 on Spheron starts at $2.51/hr per GPU per hour on demand, with spot instances cheaper still. There is no minimum commit, billing is per minute, and an instance is usually live in under 2 minutes. The H100 has 80GB of HBM3 and 3.35 TB/s of memory bandwidth, which is enough headroom to fine-tune Llama 3 70B in 4-bit on a single card or serve a 70B model at production latency. For multi-node training, every H100 node ships with 400 Gb/s InfiniBand and GPUDirect RDMA. On-demand H100 pricing on AWS, GCP, and Azure currently sits between about $3 and $7 per GPU per hour.

GPU ArchitectureNVIDIA Hopper
VRAM80 GB HBM3
Memory Bandwidth3.35 TB/s

Technical specifications

GPU Architecture
NVIDIA Hopper
VRAM
80 GB HBM3
Memory Bandwidth
3.35 TB/s
Tensor Cores
4th Generation
CUDA Cores
16,896
FP64 Performance
34 TFLOPS
FP32 Performance
67 TFLOPS
TF32 Performance
989 TFLOPS
FP16 Performance
1,979 TFLOPS
INT8 Performance
3,958 TOPS
System RAM
116 GB DDR4
vCPUs
26 vCPUs
Storage
2.4 TB NVMe SSD
Network
400 Gb/s InfiniBand
TDP
700W

Pricing comparison

ProviderPrice/hrSavings
SpheronYour price
$2.51/hr-
RunPod
$1.99/hr-
Lambda Labs
$2.99/hr1.2x more expensive
Google Cloud
$3.00/hr1.2x more expensive
Nebius
$3.08/hr1.2x more expensive
AWS
$3.90/hr1.6x more expensive
CoreWeave
$6.16/hr2.5x more expensive
Azure
$6.98/hr2.8x more expensive
Custom & Reserved

Need More H100 Than What's Listed?

Reserved Capacity

Commit to a duration, lock in availability and better rates

Custom Clusters

8 to 512+ GPUs, specific hardware, InfiniBand configs on request

Supplier Matchmaking

Spheron sources from its certified data center network, negotiates pricing, handles setup

Need more H100 capacity? Tell us your requirements and we'll source it from our certified data center network.

Typical turnaround: 24–48 hours

When to pick the H100

Scenario 01

Pick the H100 if

You are training or fine-tuning a 30B+ parameter model, serving a 70B model in production, or running multi-node distributed training that depends on InfiniBand and GPUDirect RDMA. The H100 is also the right call for FP8 inference workloads where the Transformer Engine pays back the price difference vs older silicon.

Recommended fit
Scenario 02

Pick the A100 instead if

Your model fits in 80GB and you do not need FP8. The A100 runs roughly half the price of the H100 and handles BERT-class training, sub-30B fine-tunes, and most production inference without breaking a sweat. Spend the savings on more GPUs.

Recommended fit
Scenario 03

Pick the H200 instead if

You are bottlenecked by memory, not compute. The H200 keeps the H100 architecture but doubles to 141GB of HBM3e at 4.8 TB/s. Long-context inference, KV-cache-heavy serving, and fitting 70B models without sharding are where the H200 wins.

Recommended fit
Scenario 04

Pick the B200 instead if

You are training trillion-parameter models or pushing the absolute frontier of throughput per node. The B200 has roughly 2.3x the FP8 dense TFLOPS of H100, 192GB HBM3e at 8 TB/s, and FP4 support. Real-world inference speedups vary by model and stack. The catch is availability and price. Most teams do not need it yet.

Recommended fit

Ideal use cases

Use case / 01
🤖

LLM training and fine-tuning

Train transformer architectures from scratch or fine-tune frontier-class models with FP8 mixed precision. The H100's Transformer Engine cuts memory and time per step versus FP16.

Pre-training 7B to 70B parameter modelsFull fine-tunes of Llama 3, Qwen, Mistral, and DeepSeekLoRA / QLoRA at scale across multiple adaptersMulti-modal models combining text, vision, and audio
Use case / 02

Production LLM inference

Serve large models at low latency with vLLM, TensorRT-LLM, or SGLang. The H100 handles 70B class models at production traffic without sharding overhead.

Llama 3 70B serving at sub-second TTFTDeepSeek V3 and R1 deploymentsReal-time chat and agent backendsSpeculative decoding with paired draft models
Use case / 03
🎨

Diffusion and video generation

Run image and video diffusion pipelines that need high VRAM and bandwidth. The H100 chews through batch sizes that the RTX 4090 can not touch.

Stable Diffusion XL and Flux at high batchWan 2.1 and other video diffusion modelsControlNet and IP-Adapter pipelines at scaleVoice cloning and TTS production workloads
Use case / 04
🔬

HPC and scientific computing

FP64 throughput on the H100 is 34 TFLOPS, three times the A100. That matters for simulation work where double precision is non-negotiable.

Molecular dynamics and protein foldingComputational fluid dynamicsClimate and weather modelingQuantum chemistry and DFT codes

Performance benchmarks

Llama 3.1 8B inference (vLLM, BF16)
~12,500 tok/s
single H100 SXM
Llama 3.1 70B inference (FP8, BS=64)
~460 tok/s
single H100 SXM
GPT-3 175B training (MLPerf v3.0)
up to 3.1x faster
vs A100 80GB
LLM training (NVIDIA headline)
up to 4x faster
vs A100 80GB
Mixed-precision training throughput
~2.4x faster
vs A100 80GB
FP8 vs FP16 throughput (TE)
up to 1.6x faster
Transformer Engine on

Launch vLLM on an H100 in under 2 minutes

Spin up a Spheron H100 instance, pull the vLLM image, and serve Llama 3 70B with an OpenAI-compatible API. Drop the same client into your existing OpenAI SDK call and point the base URL at the new endpoint.

bash
Spheron
# 1. Provision an H100 from the Spheron CLI (or use the dashboard)spheron deploy --gpu h100 --image vllm/vllm-openai:latest # 2. Inside the instance, serve Llama 3 70B with FP8python -m vllm.entrypoints.openai.api_server \  --model meta-llama/Meta-Llama-3-70B-Instruct \  --quantization fp8 \  --max-model-len 8192 \  --gpu-memory-utilization 0.92 \  --port 8000 # 3. Hit the endpoint from any OpenAI-compatible clientcurl http://<instance-ip>:8000/v1/chat/completions \  -H "Content-Type: application/json" \  -d '{    "model": "meta-llama/Meta-Llama-3-70B-Instruct",    "messages": [{"role": "user", "content": "Explain FP8 quantization."}]  }'

Need multi-GPU or multi-node? Add --tensor-parallel-size 2 for 2x H100, or contact us for InfiniBand-connected clusters.

Interconnect fabric

InfiniBand for multi-node H100 training

Every H100 node on Spheron ships with 400 Gb/s InfiniBand. Multi-node training jobs synchronize gradients over RDMA at sub-microsecond latency, so distributed training scales close to linear with node count instead of stalling on the network.

01400 Gb/s InfiniBand connectivity per GPU
02NVIDIA ConnectX-7 network adapters
03GPUDirect RDMA for zero-copy GPU-to-GPU transfers
04Optimized for NCCL collective operations
05Sub-microsecond GPU-to-GPU latency
06Bare-metal clusters up to 80x H100 (10 nodes)
07Tested with PyTorch DDP, DeepSpeed ZeRO, and Megatron-LM
08NVLink and NVSwitch within each node
Scale

Need a custom multi-node cluster or reserved capacity?

H100 vs alternatives

Related resources

FAQ / 14

Frequently asked questions

Also consider