NVIDIA GH200 Grace Hopper Superchip: Architecture and Performance Guide

NVIDIA's GH200 Grace Hopper Superchip is not just another GPU; it's a fundamentally different approach to accelerated computing. By fusing a 72-core ARM-based Grace CPU with an H100 Hopper GPU on a single module connected by NVLink Chip-to-Chip (C2C), the GH200 eliminates the PCIe bottleneck that has constrained CPU-GPU communication for over a decade.

The result is 576 GB of unified, coherent memory: 96 GB of HBM3 on the GPU side and 480 GB of LPDDR5X on the CPU side, accessible through a single address space. For workloads that are bottlenecked by data movement between CPU and GPU, this architecture delivers up to 36x faster data processing compared to traditional PCIe-connected systems.

This guide covers the GH200's full architecture, detailed specifications, real-world benchmark results, how it compares to the standard H100 SXM, cloud availability and pricing, and the workloads where it makes the most sense.

Architecture Deep Dive

The Superchip Concept

Traditional GPU servers connect CPUs and GPUs over PCIe Gen5, which provides roughly 128 GB/s of bidirectional bandwidth. This creates a fundamental bottleneck: the GPU can process data far faster than the CPU can feed it. Every time data needs to move between CPU memory and GPU memory, the application stalls.

The GH200 solves this by replacing PCIe with NVLink-C2C, a high-bandwidth, memory-coherent interconnect that delivers 900 GB/s of bidirectional bandwidth: 7x faster than PCIe Gen5. Just as importantly, NVLink-C2C consumes over 5x less power per byte transferred.

The CPU and GPU share a single per-process page table, meaning all CPU and GPU threads can access all system-allocated memory regardless of whether it physically resides on the CPU or GPU. Applications no longer need to explicitly copy data between CPU and GPU memory. The hardware handles it transparently.

NVIDIA Grace CPU

The Grace CPU is a 72-core processor built on ARM's Neoverse V2 (Armv9) architecture, specifically designed for data center workloads. Unlike repurposed desktop or server CPUs, Grace was engineered from the ground up to pair with NVIDIA GPUs.

Key Grace CPU specifications:

72 Neoverse V2 cores running up to 3.5 GHz
Up to 480 GB of LPDDR5X memory with ECC
512 GB/s CPU memory bandwidth
Double the performance per watt compared to standard x86-64 server platforms
53% more memory bandwidth at one-eighth the power per GB/s compared to eight-channel DDR5 designs
PCIe Gen5 x16 lanes for additional I/O
Full ARM server ecosystem compatibility (Linux, containers, Kubernetes)

The choice of LPDDR5X over traditional DDR5 is deliberate. LPDDR5X offers significantly better energy efficiency, critical for data center deployments where power and cooling are major cost drivers. The tradeoff is lower maximum capacity compared to DDR5, but at 480 GB, there's more than enough CPU-side memory for most AI and HPC workloads.

H100 Hopper GPU

The GPU half of the GH200 is a full H100 Hopper GPU: the same silicon found in the standalone H100 SXM, with all the same compute capabilities.

16,896 CUDA cores
528 fourth-generation Tensor Cores
96 GB HBM3 (4 TB/s bandwidth) or 144 GB HBM3e variant
Transformer Engine with dynamic FP8/FP16 precision switching
Secure Multi-Instance GPU (MIG) partitioning: up to 7 isolated instances
Fourth-generation NVLink for multi-GPU connectivity (900 GB/s)
Second-generation MIG with confidential computing support

The Transformer Engine is particularly important for LLM workloads. It automatically selects between FP8 and FP16 precision on a per-layer basis during training and inference, delivering up to 9x faster AI training and up to 30x faster AI inference compared to the previous-generation A100.

Full Specifications

Specification	NVIDIA GH200 (HBM3)	NVIDIA GH200 (HBM3e)
CPU	Grace (72 Neoverse V2 cores)	Grace (72 Neoverse V2 cores)
GPU	H100 Hopper	H100 Hopper
CPU Memory	480 GB LPDDR5X	480 GB LPDDR5X
GPU Memory	96 GB HBM3	144 GB HBM3e
Total Unified Memory	576 GB	624 GB
CPU Memory Bandwidth	512 GB/s	512 GB/s
GPU Memory Bandwidth	4 TB/s	4.9 TB/s
CPU-GPU Interconnect	NVLink-C2C (900 GB/s)	NVLink-C2C (900 GB/s)
CUDA Cores	16,896	16,896
Tensor Cores	528 (4th gen)	528 (4th gen)
Peak FP32	67 TFLOPS	67 TFLOPS
Peak FP16 / BF16	990 TFLOPS	990 TFLOPS
Peak FP8	1,979 TFLOPS	1,979 TFLOPS
Peak INT8	1,979 TOPS	1,979 TOPS
TDP	~1,000 W (combined)	~1,000 W (combined)
MIG Support	Up to 7 instances	Up to 7 instances
Multi-GPU NVLink	900 GB/s (4th gen)	900 GB/s (4th gen)
PCIe	Gen5 x16	Gen5 x16

GH200 vs H100 SXM: What's the Difference?

The GH200 uses the same H100 GPU silicon as the standalone H100 SXM. The key differences are in system architecture, not GPU compute:

Memory architecture: The H100 SXM in a standard DGX or HGX system connects to a separate x86 CPU (typically dual Intel Xeon or AMD EPYC) over PCIe Gen5. The GPU only has access to its own 80 GB HBM3. The GH200 gives the GPU transparent access to an additional 480 GB of CPU memory via NVLink-C2C, totaling 576 GB of unified addressable memory.

CPU-GPU bandwidth: The H100 SXM gets approximately 128 GB/s over PCIe Gen5. The GH200 gets 900 GB/s over NVLink-C2C, a 7x improvement that eliminates the data transfer bottleneck.

Data movement: With a standard H100 SXM, applications must explicitly manage data transfers between CPU and GPU memory (cudaMemcpy). The GH200's unified memory model eliminates this overhead, with the hardware handling page migration automatically.

Power efficiency: The GH200 consumes less total system power than an equivalent H100 SXM + x86 host CPU setup, thanks to the Grace CPU's LPDDR5X memory and ARM-based power efficiency.

Benchmark Performance

MLPerf Inference v4.1

In the industry-standard MLPerf Inference v4.1 benchmarks, the GH200 delivered outstanding results across every generative AI workload tested. Compared to the H100 SXM:

Up to 1.4x more inference performance per accelerator across the most demanding LLM benchmarks
Mixtral 8x7B: The GH200 achieved significantly higher throughput, leveraging its larger memory pool to run bigger batch sizes
Llama 2 70B: The GH200's unified memory enabled batch sizes that would overflow the H100 SXM's 80 GB HBM3
DLRMv2 (recommender systems): Up to double the batch sizes in Server scenarios and 50% greater batch sizes in Offline scenarios compared to H100

The performance uplift comes primarily from memory: the GH200's 96 GB HBM3 with 4 TB/s bandwidth allows larger batch sizes, and the NVLink-C2C interconnect lets the GPU overflow into the 480 GB of CPU memory without the PCIe bottleneck.

LLM Inference Throughput

Lambda AI's benchmarks demonstrated that a single GH200 instance delivers 7.6x better throughput compared to a single H100 SXM instance for Llama 3.1 70B inference. This dramatic improvement comes from the GH200's ability to keep the entire model plus KV-cache in unified memory, avoiding the memory capacity wall that forces the H100 into tensor parallelism across multiple GPUs.

For teams serving large language models in production, this means a single GH200 can often replace a multi-GPU H100 setup, reducing system complexity, inter-GPU communication overhead, and total cost.

HPC Benchmarks

On the CPU side, the Grace processor achieved 41.7 GFLOPS in the standard HPCG memory bandwidth benchmark, demonstrating competitive performance against x86 server processors while consuming significantly less power. In the NWChem computational chemistry benchmark, the GH200 workstation placed second overall at 1,403.5 seconds.

Multi-GPU Configurations

GH200 NVL2

The GH200 NVL2 connects two GH200 Superchips with NVLink in a single server form factor, delivering:

Up to 288 GB of combined HBM3 GPU memory (or 288 GB HBM3e)
10 TB/s aggregate memory bandwidth
1.2 TB of total fast memory (GPU + CPU)
3.5x more GPU memory capacity and 3x more bandwidth than a single H100 SXM

This configuration is designed for serving the largest open-weight models. A single NVL2 server can host models with 200B+ parameters entirely in GPU memory without sharding across nodes.

DGX GH200

The DGX GH200 scales to 256 GH200 Superchips connected via NVLink Switch, creating a single massive GPU cluster with:

Up to 144 TB of shared GPU memory
NVLink Switch providing all-to-all GPU communication at full bandwidth
Designed for the largest training runs and enterprise AI workloads

The DGX GH200 is targeted at organizations training frontier-scale models or running massive recommendation system inference.

Cloud Availability and Pricing

The GH200 is broadly available across cloud providers as of 2025. Typical rental pricing ranges from approximately $1.50 to $6.50 per GPU-hour, depending on provider, commitment length, and region.

Major providers offering GH200 instances include Oracle Cloud (OCI), Lambda, Hyperstack, CoreWeave, and several specialized GPU cloud providers. Availability has steadily improved since the GH200's initial release in mid-2023, and lead times are significantly shorter than for H200 GPUs.

For teams evaluating GH200 vs H100 cloud instances, the key consideration is workload fit. If your model benefits from unified memory (large models, long contexts, heavy CPU-GPU data exchange), the GH200 offers better price-performance even at a slightly higher per-hour rate. If your workload is purely GPU-compute-bound with minimal data movement, the standard H100 SXM may offer equivalent performance at lower cost.

Ideal Workloads for GH200

The GH200's unified memory architecture creates distinct advantages for specific workload categories:

Large Language Model Inference

The GH200 is particularly strong for serving LLMs with 70B+ parameters. Its 576 GB unified memory allows hosting models that would require multi-GPU setups on H100, and the NVLink-C2C bandwidth ensures the GPU can access CPU-side memory without significant performance degradation. For production LLM serving, this translates to lower latency, higher throughput, and simpler deployment.

Recommender Systems

Modern recommendation models like DLRM operate on massive embedding tables that can exceed hundreds of gigabytes. On traditional GPU servers, these tables must be split between CPU and GPU memory with expensive PCIe transfers. The GH200's unified memory and 900 GB/s NVLink-C2C bandwidth make it ideal for these memory-hungry workloads, enabling the GPU to process embedding lookups from CPU memory at near-native speed.

Retrieval-Augmented Generation (RAG)

RAG pipelines combine vector database lookups with LLM inference, requiring frequent data movement between CPU-side retrieval and GPU-side generation. The GH200 eliminates the PCIe bottleneck in this pipeline, allowing seamless data flow between the retrieval and generation stages.

Graph Neural Networks

GNNs operate on large, irregular graph structures that don't fit neatly into GPU memory. The GH200's unified memory model allows GNN frameworks to work with graphs that span CPU and GPU memory without explicit data management, significantly simplifying development and improving performance.

Scientific Computing and HPC

For HPC workloads that combine CPU-intensive preprocessing with GPU-accelerated computation (molecular dynamics, climate modeling, computational fluid dynamics), the Grace CPU's strong single-thread performance and the NVLink-C2C bandwidth create a well-balanced system that avoids the common bottleneck of underutilized GPUs waiting for CPU data.

What's Next: Grace Blackwell GB200

NVIDIA has announced the GH200's successor: the GB200 Grace Blackwell Superchip. The GB200 pairs the same Grace CPU with two Blackwell B200 GPUs, delivering:

2.2x simulation performance, 1.8x training performance, and 1.8x inference performance compared to the GH200
Fifth-generation Tensor Cores with native FP4/FP6 support
Up to 384 GB of HBM3e across two B200 GPUs
The GB200 NVL72 rack-scale system connects 72 GPUs with NVLink, delivering up to 30x the inference performance of an equivalent number of H100 GPUs

The Grace Blackwell platform represents the natural evolution of the superchip concept, and organizations investing in GH200 infrastructure today can expect a clear upgrade path as GB200 systems become available.

Deploy GPU Infrastructure on Spheron

Whether you need GH200 for memory-intensive inference, H100 and H200 for training, or A100 GPUs for cost-effective compute, Spheron provides bare-metal GPU access with transparent pricing, instant provisioning, and no long-term contracts.

Scale your AI infrastructure without the overhead of managing physical hardware. Deploy the right GPU for your workload with flexible hourly billing.

Explore GPU options on Spheron →

Frequently Asked Questions

Is the GH200 faster than the H100 SXM?

The GH200 contains the same H100 GPU silicon, so raw GPU compute performance is identical. However, the GH200 delivers up to 1.4x better inference throughput in MLPerf benchmarks and up to 7.6x better throughput for large LLM inference. The improvement comes from the unified memory architecture and NVLink-C2C bandwidth, which eliminate data transfer bottlenecks and enable larger batch sizes.

How much memory does the GH200 have?

The GH200 provides 576 GB of total unified memory in the HBM3 variant (96 GB HBM3 GPU memory + 480 GB LPDDR5X CPU memory) or 624 GB in the HBM3e variant (144 GB HBM3e + 480 GB LPDDR5X). All memory is accessible by both CPU and GPU through NVLink-C2C.

Can I run Llama 3.1 70B on a single GH200?

Yes. The GH200's unified memory allows a single GPU to host Llama 3.1 70B with the full KV-cache, which would typically require multiple H100 GPUs. Lambda AI's benchmarks showed 7.6x better throughput on GH200 compared to a single H100 SXM for this specific model.

How does GH200 pricing compare to H100?

GH200 cloud instances typically range from $1.50 to $6.50 per GPU-hour, while H100 SXM instances range from $2.00 to $4.00 per hour. The GH200 may cost slightly more per hour, but for memory-bound workloads, fewer GH200 GPUs can replace multiple H100s, resulting in lower total cost.

What replaced the GH200?

The GB200 Grace Blackwell Superchip is the GH200's successor, pairing the Grace CPU with two Blackwell B200 GPUs. It delivers 1.8x the training and inference performance of the GH200, with fifth-generation Tensor Cores and support for FP4/FP6 precision.

Is the GH200 good for training or just inference?

The GH200 performs well for both, but its unified memory architecture provides the largest advantage for inference and memory-bound workloads. For pure training throughput on standard model sizes, the H100 SXM in DGX/HGX configurations with optimized NVLink topology may offer similar or better multi-GPU scaling. The GH200's training advantage appears most when working with very large models or datasets that benefit from the expanded memory pool.