Spheron GPU Catalog

Rent NVIDIA GH200 GPUs on Demand from $1.88/hr

Grace Hopper Superchip with 96GB HBM3 + 432GB LPDDR5X unified memory over NVLink-C2C.

At a glance

You can rent an NVIDIA GH200 Grace Hopper Superchip on Spheron starting at $1.88/hr per GPU per hour on dedicated (99.99% SLA, non-interruptible), with spot pricing cheaper still. Per-minute billing, no long-term contracts, and instances deploy in under 2 minutes across data center partners in multiple regions. Each module ships with 96GB HBM3 on the Hopper GPU plus 432GB LPDDR5X on the Grace ARM CPU, connected by 900 GB/s NVLink-C2C. That gives you ~528GB of cache-coherent unified memory in a single socket, eliminating the PCIe bottleneck for inference workloads with large KV caches, graph workloads with billion-edge datasets, and genomics pipelines that spill beyond GPU VRAM.

GPU ArchitectureNVIDIA Grace Hopper
VRAM96 GB HBM3
Memory Bandwidth4.0 TB/s

Technical specifications

GPU Architecture
NVIDIA Grace Hopper
VRAM
96 GB HBM3
Memory Bandwidth
4.0 TB/s
Tensor Cores
4th Generation
CUDA Cores
16,896
FP64 Performance
34 TFLOPS
FP32 Performance
67 TFLOPS
TF32 Performance
989 TFLOPS
FP16 Performance
1,979 TFLOPS
INT8 Performance
3,958 TOPS
System RAM
432 GB LPDDR5X
vCPUs
64 vCPUs
Storage
4,096 GB NVMe Gen4
Network
NVLink-C2C
TDP
900W

Pricing comparison

ProviderPrice/hrSavings
SpheronYour price
$1.88/hr-
Lambda Labs
$1.99/hr1.1x more expensive
CoreWeave
$6.50/hr3.5x more expensive
Custom & Reserved

Need More GH200 Than What's Listed?

Reserved Capacity

Commit to a duration, lock in availability and better rates

Custom Clusters

8 to 512+ GPUs, specific hardware, InfiniBand configs on request

Supplier Matchmaking

Spheron sources from its certified data center network, negotiates pricing, handles setup

Need more GH200 capacity? Tell us your requirements and we'll source it from our certified data center network.

Typical turnaround: 24–48 hours

When to pick the GH200

Scenario 01

Pick GH200 if

Your workload needs memory beyond 96GB of HBM but isn't worth paying B200/H200 rates, or your model spills KV cache onto system memory and you need coherent access. Also the sweet spot for graph neural networks, genomics pipelines, and recommendation models with huge embedding tables.

Recommended fit
Scenario 02

Pick H100 80GB instead if

Your model fits in 80GB HBM3 and you want maximum multi-GPU training throughput with NVLink + InfiniBand. H100 SXM5 is the standard for 8-way tensor parallelism and pre-training runs where CPU memory isn't in the critical path.

Recommended fit
Scenario 03

Pick H200 141GB instead if

You need more GPU-side HBM than 96GB, but don't need the unified memory architecture. H200 gives you 141GB HBM3e at 4.8 TB/s, a cleaner fit for 70B+ inference without going ARM.

Recommended fit
Scenario 04

Pick B200 192GB instead if

You need Blackwell FP4 Transformer Engine, 8 TB/s bandwidth, and the latest NVLink 5. B200 is the choice for 200B+ model training, and its dedicated HBM3e beats GH200's unified memory for bandwidth-bound workloads.

Recommended fit

Ideal use cases

Use case / 01

AI Inference & Serving

Leverage the massive 432GB unified memory pool to serve large AI models with enormous KV caches, enabling high-throughput inference without CPU-GPU data transfer overhead.

LLM inference with massive KV cacheMulti-model servingReal-time recommendation enginesEdge AI inference at scale
Use case / 02
📊

Large Dataset Processing

Utilize the 432GB unified memory architecture to process datasets that don't fit in GPU VRAM alone, eliminating costly data transfers between CPU and GPU memory.

Genomics and bioinformatics pipelinesFinancial risk modelingGraph neural networks on large graphsGeospatial analytics
Use case / 03
🔬

Scientific Computing & HPC

Combine the energy-efficient ARM Grace CPU with the powerful Hopper GPU for high-performance computing workloads.

Molecular dynamics simulationsWeather and climate simulationComputational chemistryQuantum computing simulation
Use case / 04
🤖

Edge AI & Autonomous Systems

Deploy the compact superchip form factor for edge AI applications requiring powerful inference in a single integrated module.

Autonomous vehicle inferenceRobotics AISmart city analyticsReal-time video processing

Performance benchmarks

LLaMA 2 70B Inference
1.6x faster
vs H100 80GB (unified memory)
GPT-J 6B Inference
14,500 tokens/s
FP16 batch 128
ResNet-50 Inference
42,000 img/sec
INT8 precision
Genomics Processing
2.1x faster
vs CPU-only pipeline
Graph Neural Network
1.8x faster
vs H100 (large graph datasets)
Unified Memory Bandwidth
4.0 TB/s
CPU-GPU coherent

Serve Llama 3.1 70B with a massive KV cache on GH200

The GH200's 96GB HBM3 holds Llama 3.1 70B at FP8 (~70GB), and the 432GB LPDDR5X CPU memory over NVLink-C2C lets you extend the effective working set far beyond what a pure HBM card can hold.

bash
Spheron
# SSH into your GH200 instance (ARM64 / aarch64)ssh ubuntu@<instance-ip> # Install vLLM for ARM with CUDA 12.4+pip install vllm # Launch Llama 3.1 70B with FP8, long contextvllm serve meta-llama/Llama-3.1-70B-Instruct \  --quantization fp8 \  --max-model-len 32768 \  --gpu-memory-utilization 0.9 \  --enforce-eager # Sanity checkcurl http://localhost:8000/v1/models

Most major ML frameworks (PyTorch, JAX, vLLM) have native ARM64 wheels. If you hit a package without an ARM build, NVIDIA's NGC containers cover the common cases.

Interconnect fabric

NVLink-C2C Configuration

The GH200 Grace Hopper Superchip features NVLink-C2C (Chip-to-Chip) interconnect providing 900 GB/s bidirectional coherent bandwidth between the Grace CPU and Hopper GPU, eliminating the traditional PCIe bottleneck and enabling seamless unified memory access across the entire module.

01900 GB/s bidirectional NVLink-C2C bandwidth
02Cache-coherent unified memory across CPU and GPU
03432 GB LPDDR5X CPU memory accessible by GPU at full bandwidth
04Zero-copy data sharing between Grace CPU and Hopper GPU
05Eliminates PCIe Gen5 bottleneck entirely
06Hardware-managed cache coherency protocol
07Transparent memory migration between CPU and GPU
08Optimized for workloads exceeding GPU VRAM capacity
Scale

Need a custom multi-node cluster or reserved capacity?

Related resources

FAQ / 11

Frequently asked questions

Also consider