Rent NVIDIA H100 GPUs on Demand from $2.51/hr
80GB HBM3, 400 Gb/s InfiniBand, per-minute billing. Live in under 2 minutes.
Renting an NVIDIA H100 on Spheron starts at $2.51/hr per GPU per hour on demand, with spot instances cheaper still. There is no minimum commit, billing is per minute, and an instance is usually live in under 2 minutes. The H100 has 80GB of HBM3 and 3.35 TB/s of memory bandwidth, which is enough headroom to fine-tune Llama 3 70B in 4-bit on a single card or serve a 70B model at production latency. For multi-node training, every H100 node ships with 400 Gb/s InfiniBand and GPUDirect RDMA. On-demand H100 pricing on AWS, GCP, and Azure currently sits between about $3 and $7 per GPU per hour.
Technical specifications
Pricing comparison
| Provider | Price/hr | Savings |
|---|---|---|
SpheronYour price | $2.51/hr | - |
RunPod | $1.99/hr | - |
Lambda Labs | $2.99/hr | 1.2x more expensive |
Google Cloud | $3.00/hr | 1.2x more expensive |
Nebius | $3.08/hr | 1.2x more expensive |
AWS | $3.90/hr | 1.6x more expensive |
CoreWeave | $6.16/hr | 2.5x more expensive |
Azure | $6.98/hr | 2.8x more expensive |
Need More H100 Than What's Listed?
Reserved Capacity
Commit to a duration, lock in availability and better rates
Custom Clusters
8 to 512+ GPUs, specific hardware, InfiniBand configs on request
Supplier Matchmaking
Spheron sources from its certified data center network, negotiates pricing, handles setup
Need more H100 capacity? Tell us your requirements and we'll source it from our certified data center network.
Typical turnaround: 24–48 hours
When to pick the H100
Pick the H100 if
You are training or fine-tuning a 30B+ parameter model, serving a 70B model in production, or running multi-node distributed training that depends on InfiniBand and GPUDirect RDMA. The H100 is also the right call for FP8 inference workloads where the Transformer Engine pays back the price difference vs older silicon.
Pick the A100 instead if
Your model fits in 80GB and you do not need FP8. The A100 runs roughly half the price of the H100 and handles BERT-class training, sub-30B fine-tunes, and most production inference without breaking a sweat. Spend the savings on more GPUs.
Pick the H200 instead if
You are bottlenecked by memory, not compute. The H200 keeps the H100 architecture but doubles to 141GB of HBM3e at 4.8 TB/s. Long-context inference, KV-cache-heavy serving, and fitting 70B models without sharding are where the H200 wins.
Pick the B200 instead if
You are training trillion-parameter models or pushing the absolute frontier of throughput per node. The B200 has roughly 2.3x the FP8 dense TFLOPS of H100, 192GB HBM3e at 8 TB/s, and FP4 support. Real-world inference speedups vary by model and stack. The catch is availability and price. Most teams do not need it yet.
Ideal use cases
LLM training and fine-tuning
Train transformer architectures from scratch or fine-tune frontier-class models with FP8 mixed precision. The H100's Transformer Engine cuts memory and time per step versus FP16.
Production LLM inference
Serve large models at low latency with vLLM, TensorRT-LLM, or SGLang. The H100 handles 70B class models at production traffic without sharding overhead.
Diffusion and video generation
Run image and video diffusion pipelines that need high VRAM and bandwidth. The H100 chews through batch sizes that the RTX 4090 can not touch.
HPC and scientific computing
FP64 throughput on the H100 is 34 TFLOPS, three times the A100. That matters for simulation work where double precision is non-negotiable.
Performance benchmarks
Launch vLLM on an H100 in under 2 minutes
Spin up a Spheron H100 instance, pull the vLLM image, and serve Llama 3 70B with an OpenAI-compatible API. Drop the same client into your existing OpenAI SDK call and point the base URL at the new endpoint.
# 1. Provision an H100 from the Spheron CLI (or use the dashboard)spheron deploy --gpu h100 --image vllm/vllm-openai:latest # 2. Inside the instance, serve Llama 3 70B with FP8python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Meta-Llama-3-70B-Instruct \ --quantization fp8 \ --max-model-len 8192 \ --gpu-memory-utilization 0.92 \ --port 8000 # 3. Hit the endpoint from any OpenAI-compatible clientcurl http://<instance-ip>:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3-70B-Instruct", "messages": [{"role": "user", "content": "Explain FP8 quantization."}] }'Need multi-GPU or multi-node? Add --tensor-parallel-size 2 for 2x H100, or contact us for InfiniBand-connected clusters.
InfiniBand for multi-node H100 training
Every H100 node on Spheron ships with 400 Gb/s InfiniBand. Multi-node training jobs synchronize gradients over RDMA at sub-microsecond latency, so distributed training scales close to linear with node count instead of stalling on the network.
Need a custom multi-node cluster or reserved capacity? Talk to us about topology, regions, and committed pricing.
H100 vs alternatives
Same compute, 1.76x more memory and 1.4x more bandwidth on H200. If you keep hitting OOM on long context or large KV caches, the H200 fixes that.
B200 is the next generation: ~2.3x H100's FP8 dense TFLOPS, 192GB HBM3e at 8 TB/s, and FP4 support. Real-world inference speedups depend on model and stack maturity. Worth the jump if you can secure capacity.
The 4090 has 24GB and no NVLink. Fine for sub-13B inference and hobbyist training, not for serving 70B models or any serious distributed work.
Related resources
NVIDIA H100 vs H200: benchmarks and when to upgrade
Side-by-side specs, inference throughput, and the workloads where the H200's extra memory pays for itself.
Running 10 concurrent fine-tuning jobs on bare-metal H100s
Architecture, scheduling, and cost breakdown for parallelizing fine-tunes across a bare-metal H100 cluster.
Building a sub-200ms RAG pipeline on bare-metal H100s
How a production RAG stack hits 2M queries per day with H100 inference and aggressive KV-cache reuse.
vLLM production deployment in 2026
Serving large models with vLLM on Spheron H100s: configuration, tuning, and the gotchas to avoid.
Best NVIDIA GPUs for LLMs
Decision framework for picking H100, H200, B200, A100, or RTX-class GPUs based on model size and budget.
GPU cost optimization playbook
Practical tactics to cut H100 spend: spot scheduling, FP8, batching, and right-sizing.