NVIDIA H100 vs H200: Benchmarks, Specs, and Performance Comparison for AI Inference

The NVIDIA H100 set the performance standard for AI training and inference when it launched in 2022. Two years later, the H200 arrived with the same Hopper architecture but one critical upgrade: 141 GB of HBM3e memory delivering 4.8 TB/s of bandwidth. This is nearly double the H100's capacity and 1.4x its bandwidth.

That memory upgrade translates to up to 45% faster inference on large language models, 1.9x higher throughput on Llama 70B, and the ability to serve models that simply don't fit on an H100 without tensor parallelism.

But both GPUs share identical compute silicon. Same CUDA cores, same Tensor Cores, same FP8 performance. So when does the H200's memory advantage justify its premium? And when is the H100 the smarter deployment choice?

This guide compares the H100 and H200 across architecture, full specifications, MLPerf benchmarks, inference throughput, cloud pricing, and workload fit; it includes concrete numbers to help you decide.

Architecture: Same Hopper Silicon, Different Memory

Both the H100 and H200 are built on NVIDIA's Hopper architecture (GH100 die). They share the same compute engine: same streaming multiprocessors (SMs), same fourth-generation Tensor Cores, same Transformer Engine. The difference is entirely in the memory subsystem.

Hopper Compute Features (Shared)

Fourth-Generation Tensor Cores: Both GPUs deliver 3,958 TFLOPS in FP8 and INT8. The Tensor Cores support FP64, TF32, BF16, FP16, FP8, and INT8, covering every precision format used in modern AI training and inference.

Transformer Engine: The Hopper-exclusive Transformer Engine dynamically switches between FP8 and FP16 precision within each layer of a transformer model. This automatic mixed-precision approach delivers up to 9x faster training and 30x faster inference compared to the A100, with minimal accuracy loss.

Multi-Instance GPU (MIG): Both GPUs support MIG partitioning into up to 7 isolated instances. The H200's MIG instances are larger (16.5 GB each versus 10 GB on the H100) because of the expanded memory pool.

NVLink and NVSwitch: Both connect via fourth-generation NVLink at 900 GB/s bidirectional bandwidth and support NVSwitch for all-to-all GPU communication in multi-GPU configurations.

Tensor Memory Accelerator (TMA): Both feature the TMA for asynchronous bulk data transfers, reducing the CUDA threads needed for memory management and freeing compute capacity for actual workloads.

Where They Differ: HBM3 vs HBM3e

The H100 SXM ships with 80 GB of HBM3 memory at 3.35 TB/s bandwidth. The H200 SXM ships with 141 GB of HBM3e at 4.8 TB/s: a 76% increase in capacity and 43% increase in bandwidth.

This matters because modern LLM inference is memory-bound, not compute-bound. The GPU's Tensor Cores can process tokens faster than the memory subsystem can feed them data. By widening the memory bottleneck, the H200 unlocks throughput that the H100's compute engine is already capable of but cannot achieve due to memory constraints.

Full Specifications Comparison

Specification	H100 SXM	H100 PCIe	H200 SXM	H200 NVL
Architecture	Hopper (GH100)	Hopper (GH100)	Hopper (GH100)	Hopper (GH100)
Process Node	TSMC 4N	TSMC 4N	TSMC 4N	TSMC 4N
Transistors	80B	80B	80B	80B
CUDA Cores	16,896	14,592	16,896	16,896
Tensor Cores	528 (4th Gen)	456 (4th Gen)	528 (4th Gen)	528 (4th Gen)
VRAM	80 GB HBM3	80 GB HBM2e	141 GB HBM3e	141 GB HBM3e
Memory Bandwidth	3,350 GB/s	2,000 GB/s	4,800 GB/s	4,800 GB/s
L2 Cache	50 MB	50 MB	50 MB	50 MB
FP64 (TFLOPS)	34	26	34	34
FP32 (TFLOPS)	67	51	67	67
TF32 Tensor (TFLOPS)	989	756	989	989
BF16 Tensor (TFLOPS)	1,979	1,513	1,979	1,979
FP16 Tensor (TFLOPS)	1,979	1,513	1,979	1,979
FP8 Tensor (TFLOPS)	3,958	3,026	3,958	3,958
INT8 Tensor (TOPS)	3,958	3,026	3,958	3,958
NVLink Bandwidth	900 GB/s	N/A	900 GB/s	900 GB/s
PCIe	Gen5 (128 GB/s)	Gen5 (128 GB/s)	Gen5 (128 GB/s)	Gen5 (128 GB/s)
MIG Instances	7x 10 GB	7x 10 GB	7x 16.5 GB	7x 16.5 GB
TDP	700W	350W	700W	600W

The key takeaway: every compute metric is identical between the H100 SXM and H200 SXM. Same 3,958 TFLOPS FP8, same 1,979 TFLOPS BF16, same 34 TFLOPS FP64. The H200's advantage is purely in memory capacity (141 vs 80 GB) and bandwidth (4,800 vs 3,350 GB/s).

MLPerf Inference Benchmarks

MLPerf is the industry-standard benchmark suite for measuring AI hardware performance. The H200 has been tested across multiple MLPerf rounds, consistently showing significant inference throughput improvements over the H100.

Llama 2 70B Inference (MLPerf v4.0)

Scenario	H100 SXM (tok/s)	H200 SXM (tok/s)	H200 Improvement
Offline	22,290	31,712	+42.3%
Server	21,504	29,526	+37.3%

The offline scenario measures maximum throughput with no latency constraints. The GPU processes as many tokens as possible. The server scenario adds realistic latency requirements that mirror production serving. In both cases, the H200 delivers a 37-42% throughput increase over the H100 on the same 70B parameter model.

Model-Specific Throughput Comparisons

Workload	H100 SXM	H200 SXM	H200 Advantage
Llama 2 70B (offline)	22,290 tok/s	31,712 tok/s	1.42x
Llama 2 70B (max throughput)	Baseline	1.9x	1.9x
GPT-3 175B (online, 8 GPU)	Baseline	1.6x	1.6x
Llama 2 13B (single GPU)	~8,000 tok/s	~12,000 tok/s	1.5x
Llama 3.1 405B (8 GPU, pipeline)	Baseline	1.5x	1.5x

The pattern is clear: the larger the model, the greater the H200's advantage. For Llama 2 70B at maximum throughput, the H200 delivers 1.9x the H100's performance. For the massive Llama 3.1 405B on an 8-GPU HGX system, the H200 still maintains a 1.5x edge.

This happens because larger models are more memory-bound. The model weights consume more of the available VRAM, leaving less room for KV cache and activations. The H200's 141 GB and 4.8 TB/s bandwidth directly alleviates this bottleneck.

MLPerf v4.1 Software Improvements

Between MLPerf v4.0 and v4.1, NVIDIA achieved an additional 27% performance improvement on the H200 through software optimizations alone. The same hardware ran better TensorRT-LLM kernels. This demonstrates that the H200's memory headroom enables software optimizations that are impossible on the more constrained H100.

Why Memory Bandwidth Matters More Than Compute

For LLM inference, the performance bottleneck is almost always memory bandwidth, not compute throughput. Here's why.

The Arithmetic Intensity Problem

During the autoregressive token generation phase (the decode step), each new token requires reading the entire model weight tensor from VRAM. For a 70B parameter model in FP16, that's 140 GB of data read per token. The compute required is relatively small (a few matrix-vector multiplications), but the data movement is massive.

GPU	Memory Bandwidth	Time to Read 140 GB	Theoretical Decode Limit
H100 SXM	3,350 GB/s	41.8 ms	~24 tok/s (single stream)
H200 SXM	4,800 GB/s	29.2 ms	~34 tok/s (single stream)

The H200's 43% bandwidth advantage translates directly into 43% faster per-stream decode latency for bandwidth-bound models. With batching, the advantage compounds because the H200's larger VRAM holds more concurrent KV caches.

KV Cache Capacity

The KV (key-value) cache stores attention state for each active sequence. For Llama 2 70B at 4K context length, each sequence's KV cache consumes approximately 2.5 GB.

GPU	Total VRAM	Model Weights (FP16)	Available for KV Cache	Max Concurrent Sequences
H100 SXM	80 GB	~140 GB (TP2)	~5 GB per GPU	~2 per GPU
H200 SXM	141 GB	~140 GB (TP2)	~36 GB per GPU	~14 per GPU

With tensor parallelism across 2 GPUs, the H200 can serve roughly 7x more concurrent sequences than the H100 for Llama 2 70B, dramatically improving throughput for batch inference workloads.

Training Performance

Since both GPUs share identical compute silicon, training performance on compute-bound workloads is nearly identical. The H200's advantage appears in memory-bound training scenarios.

Where the H200 Matches the H100

For small-to-medium models where the model weights, optimizer states, and activations fit comfortably in 80 GB, training throughput is effectively the same. ResNet-50, BERT-Large, GPT-2, and similar models see no meaningful difference.

Where the H200 Pulls Ahead

Large model training without tensor parallelism: A 13B parameter model in FP16 requires approximately 26 GB for weights plus approximately 52 GB for Adam optimizer states; 78 GB total, which barely fits on an H100. The H200's 141 GB accommodates this with room to spare, enabling larger batch sizes and eliminating the need for gradient checkpointing.

Long-context training: Training with 32K+ context lengths balloons the activation memory. The H200's extra 61 GB of VRAM means you can train with longer sequences before needing activation recomputation or model parallelism.

Mixed workloads: For teams that train and serve from the same GPU pool, the H200's memory headroom eliminates the need to carefully partition workloads by VRAM consumption.

Model Capacity Comparison

Which models fit on a single GPU without parallelism?

Model	Parameters	VRAM (FP16)	VRAM (INT8)	H100 80GB	H200 141GB
Llama 3.1 8B	8B	16 GB	8 GB	Yes	Yes
Mistral 7B	7B	14 GB	7 GB	Yes	Yes
Llama 2 13B	13B	26 GB	13 GB	Yes	Yes
Llama 2 70B (FP16)	70B	140 GB	70 GB	No	Yes (tight)
Llama 2 70B (INT8)	70B	70 GB	N/A	Yes (tight)	Yes
Mixtral 8x7B	47B	94 GB	47 GB	No	Yes
Llama 3.1 70B (FP16)	70B	140 GB	70 GB	No	Yes (tight)
Falcon 180B (INT4)	180B	N/A	90 GB	No	Yes
Llama 3.1 405B	405B	810 GB	405 GB	No (8 GPU)	No (4-6 GPU)

The H200's defining advantage is that it can serve Llama 2 70B and Mixtral 8x7B on a single GPU in FP16, workloads that require 2 H100s with tensor parallelism. This halves the GPU count and eliminates inter-GPU communication overhead, directly reducing both cost and latency.

MIG: Larger Partitions on the H200

Both GPUs support Multi-Instance GPU partitioning into up to 7 isolated instances. The H200's larger memory pool means each MIG instance gets 16.5 GB instead of 10 GB.

MIG Profile	H100 Memory per Instance	H200 Memory per Instance
1g (1/7 GPU)	10 GB	16.5 GB
2g (2/7 GPU)	20 GB	33 GB
3g (3/7 GPU)	40 GB	~50 GB
7g (full GPU)	80 GB	141 GB

The H200's 16.5 GB MIG instances can serve 7B–13B parameter models in INT8, while the H100's 10 GB instances are limited to models under 5B parameters. For multi-tenant inference platforms, this means the H200 can serve larger models per partition without compromising isolation.

Cloud Pricing Comparison

The H200 commands a premium over the H100, but the price-per-token is often lower due to the throughput advantage.

GPU	Typical Cloud Price	VRAM	Inference Throughput (Llama 70B)	Price per 1M Tokens
H100 SXM (hyperscaler)	$3.00 to $6.98/hr	80 GB	22,290 tok/s	~$0.037 to $0.087
H100 SXM (GPU cloud)	$1.49 to $2.99/hr	80 GB	22,290 tok/s	~$0.019 to $0.037
H200 SXM (hyperscaler)	$4.50 to $10.60/hr	141 GB	31,712 tok/s	~$0.039 to $0.093
H200 SXM (GPU cloud)	$2.50 to $4.31/hr	141 GB	31,712 tok/s	~$0.022 to $0.038

On specialized GPU cloud providers like Spheron, the price-per-token is comparable between H100 and H200. The H200's higher hourly rate is offset by its higher throughput, making the cost-per-inference roughly equivalent. However, the H200 serves more users from fewer GPUs, reducing operational complexity.

TCO Analysis: Serving Llama 2 70B

Consider serving Llama 2 70B at 10,000 requests per hour (average 500 tokens per request):

Configuration	GPUs Needed	Cost per Hour	Monthly Cost
H100 SXM (GPU cloud)	2 (TP2 for FP16)	$4.00–$6.00	$2,880–$4,320
H200 SXM (GPU cloud)	1 (single GPU)	$2.50–$4.31	$1,800–$3,103

The H200 can serve Llama 70B on a single GPU (the full 140 GB of FP16 weights fit in 141 GB), while the H100 requires 2-GPU tensor parallelism. This single-GPU advantage cuts the H200's TCO by 25-40% for this specific workload.

Workload Recommendations

Choose the H200 When

Serving 70B+ models: The H200's 141 GB fits Llama 70B in FP16 on a single GPU. This eliminates tensor parallelism overhead and halves GPU count compared to the H100.

Maximum inference throughput: For any memory-bound inference workload, the H200's 4.8 TB/s bandwidth delivers 37-90% more throughput than the H100 depending on the model size and batching strategy.

Long-context inference: Serving requests with 8K to 128K context lengths generates large KV caches. The H200's extra 61 GB of VRAM accommodates more concurrent long-context sessions.

Multi-tenant MIG serving: The H200's 16.5 GB MIG instances can serve 7B–13B models, while the H100's 10 GB instances max out at ~5B. For platforms serving multiple smaller models, the H200 enables denser packing.

Choose the H100 When

Compute-bound training: For models that fit in 80 GB (anything under ~30B parameters in FP16 with optimizer states), training throughput is identical. The H100's lower price makes it the better value.

Small model inference: For models under 30 GB (7B–13B parameter range in FP16), the H100's 80 GB provides ample VRAM and the memory bandwidth isn't the bottleneck. The H100 delivers equivalent per-token latency at a lower hourly rate.

Budget-sensitive deployments: At $1.49-$2.99/hr on GPU clouds, the H100 is 30-40% cheaper per hour. For workloads that don't benefit from the H200's extra memory, this translates directly to cost savings.

Existing infrastructure: If your cluster already uses H100s with NVLink and NVSwitch, adding more H100s is simpler than mixing GPU types. Both GPUs use the same interconnect, but homogeneous clusters are easier to manage.

The Blackwell Successor: B200 and B100

NVIDIA's next-generation Blackwell architecture (B200, B100) launched in late 2024 and early 2025, offering significant improvements over both Hopper GPUs.

Specification	H100 SXM	H200 SXM	B200
VRAM	80 GB HBM3	141 GB HBM3e	192 GB HBM3e
Memory Bandwidth	3,350 GB/s	4,800 GB/s	8,000 GB/s
FP8 Performance	3,958 TFLOPS	3,958 TFLOPS	9,000 TFLOPS
FP4 Performance	N/A	N/A	18,000 TFLOPS
TDP	700W	700W	1,000W

The B200 delivers 2.3x the compute and 1.7x the memory bandwidth of the H200. However, Blackwell availability remains constrained and pricing is significantly higher. For teams deploying today, the H100 and H200 remain the most available and cost-effective Hopper-class GPUs.

Deploy on Spheron

Looking for GPU infrastructure to power your AI workloads? Spheron offers bare-metal GPU access with transparent pricing, instant provisioning, and no long-term contracts. Deploy on H100, H200, A100, and RTX 4090 GPUs with full root access, NVLink support, and pay-per-second billing.

Whether you're serving Llama 70B on a single H200 or training across H100 clusters, Spheron provides the flexibility to scale without hardware commitments.

Explore GPU options on Spheron →

Frequently Asked Questions

Is the H200 just an H100 with more memory?

Essentially, yes. Both GPUs use the same Hopper GH100 die with identical CUDA cores, Tensor Cores, and Transformer Engine. The H200 upgrades the memory subsystem from 80 GB HBM3 (3,350 GB/s) to 141 GB HBM3e (4,800 GB/s). This makes the H200 significantly faster for memory-bound workloads like LLM inference, while compute-bound workloads see little to no improvement.

How much faster is the H200 for LLM inference?

The H200 is 37-90% faster than the H100 for LLM inference depending on the model size and serving configuration. MLPerf v4.0 benchmarks show 42% higher throughput on Llama 2 70B in offline mode and 37% in server mode. For maximum throughput on a single GPU, the H200 delivers up to 1.9x the H100's performance on Llama 70B.

Does the H200 train models faster than the H100?

For compute-bound training on models that fit in 80 GB, performance is identical. The compute engines are the same. The H200 pulls ahead when training is memory-bound: large batch sizes, long context lengths, or models between 30B-70B parameters where the H100 runs out of VRAM and requires gradient checkpointing or model parallelism.

Can the H200 serve Llama 70B on a single GPU?

Yes. Llama 2 70B in FP16 requires approximately 140 GB of VRAM for model weights alone. The H200's 141 GB HBM3e can hold the full model on a single GPU, while the H100's 80 GB requires at least 2-way tensor parallelism. This single-GPU serving halves the hardware cost and eliminates inter-GPU communication latency.

Should I wait for Blackwell (B200) instead?

It depends on your timeline. The B200 offers 2.3x the compute and 1.7x the bandwidth of the H200, with 192 GB HBM3e. However, Blackwell availability is limited and pricing is substantially higher. If you need GPU capacity now, the H100 and H200 are the most available and cost-effective options. If you can wait 6+ months and have the budget, Blackwell delivers a generational improvement.

Is the H200 worth the price premium over the H100?

For memory-bound workloads (inference on 70B+ models, long-context serving, multi-tenant MIG), yes. The H200's 37-90% higher throughput more than justifies its 30-50% higher hourly cloud rate, often delivering lower cost-per-token. For compute-bound workloads (training small-to-medium models, inference on sub-30B models), the H100 provides equivalent performance at a lower price.