Your training job crashes at 3 AM. The error says "CUDA out of memory," but system monitors show plenty of free RAM. CPU usage looks normal. Disk is fine. You restart with a smaller batch size, and a few hours later it fails the same way. The real issue: nobody was watching GPU memory usage. The KV cache or activation memory grew gradually until it hit the VRAM limit.
This scenario is painfully common. Industry surveys show that more than 75% of organizations run GPUs below 70% utilization even at peak load. Only 7% achieve over 85% utilization. That means most teams simultaneously waste expensive GPU capacity and deal with preventable crashes because they lack visibility into what their GPUs are actually doing.
This guide covers GPU monitoring at every level: quick command-line checks with nvidia-smi, framework-level memory profiling in PyTorch, production-grade metrics with DCGM Exporter and Prometheus, and cluster-scale observability with Grafana dashboards. Each section includes concrete commands and configurations you can use immediately.
Why GPU Monitoring Is a Business-Critical Investment
GPUs are the most expensive component in the AI stack. On-demand H100 pricing ranges from $1.21/hr on cost-efficient providers to $6.98/hr on major clouds, a 5.7x variance. For a 100-GPU training run over 200 hours, that's the difference between $24,200 and $139,600. Without monitoring, teams cannot tell whether those GPUs are delivering value or sitting idle.
The operational impact of poor monitoring compounds at scale. During Meta's 54-day training run on their Grand Teton platform, the team experienced 419 job interruptions, roughly one failure every 3 hours. When projected to a 128,000-GPU cluster (the scale needed for frontier models), this translates to a job interruption every 23 minutes. Without fault detection and automated recovery, these interruptions cascade through training pipelines, turning weeks of computation into waste.
Research shows that 54.5% of teams cite cost as their biggest GPU challenge, not hardware scarcity. Another 16% explicitly identify monitoring and visibility gaps as a primary blocker. Proper GPU monitoring turns these cost and reliability problems into engineering problems with measurable solutions.
What "GPU Usage" Actually Means
GPU usage is not a single number. It is a collection of metrics that tell different stories about system health. Understanding what each metric reveals and what it hides is essential for effective monitoring.
Compute Utilization (GPU%)
This is what nvidia-smi reports as "GPU-Util", the percentage of time over the last sample period during which one or more GPU kernels were executing. A GPU showing 100% utilization is not necessarily running efficiently; it may be executing poorly parallelized kernels that underutilize the hardware. Conversely, a GPU at 50% utilization with bursty workloads may be perfectly healthy.
Memory Utilization
Memory utilization has two dimensions: how much VRAM is allocated (capacity) and how fast data moves between memory and compute units (bandwidth). Allocated memory shows whether you are approaching the VRAM limit. Bandwidth utilization shows whether the GPU is starving for data, a common bottleneck in LLM inference where memory bandwidth determines tokens-per-second throughput.
Streaming Multiprocessor (SM) Efficiency
SM efficiency measures how well GPU kernels map to the hardware's streaming multiprocessors. Low SM efficiency with high compute utilization often means kernels are poorly parallelized or memory-bound. The GPU is busy but not productive.
Power Draw and Temperature
Power draw is an underrated diagnostic signal. GPUs doing real computational work typically draw power near their TDP (thermal design power). An H100 at 700W TDP that consistently draws only 300W is likely bottlenecked by something other than compute. Temperature matters because sustained heat above 80–85°C triggers thermal throttling, where the GPU automatically reduces clock speeds by 25–30% to protect the hardware.
nvidia-smi: Essential Commands Every ML Engineer Should Know
nvidia-smi (NVIDIA System Management Interface) ships with every NVIDIA driver installation and is the fastest way to check GPU state. Most engineers know the basic command, but nvidia-smi offers much more.
Basic Status Check
The simplest command gives an immediate snapshot:
nvidia-smiThis displays GPU name, driver version, CUDA version, temperature, power draw, memory usage, GPU utilization, and running processes. It's the first thing to run when diagnosing any GPU issue.
Continuous Monitoring
For real-time monitoring during training:
nvidia-smi -l 1This refreshes every 1 second, showing how metrics evolve over time. Watch for memory steadily climbing toward the limit (impending OOM), GPU utilization dropping to 0% between batches (data pipeline bottleneck), or power draw well below TDP (underutilization).
Device Monitoring (dmon)
The dmon mode provides compact, continuous output ideal for logging:
nvidia-smi dmon -s pucvmet -d 1This outputs one line per second per GPU with power usage, utilization, clocks, VRAM usage, memory bandwidth, encoder/decoder usage, and temperature. The compact format makes it easy to pipe into log files or send to monitoring tools.
Process Monitoring (pmon)
To see which processes consume GPU resources:
nvidia-smi pmon -d 1This shows per-process GPU utilization, memory usage, and command name. This is essential for multi-tenant systems where multiple training jobs share GPUs, or for identifying rogue processes consuming VRAM.
CSV Export for Analysis
For automated monitoring and historical analysis, nvidia-smi can output structured CSV data:
nvidia-smi --query-gpu=timestamp,name,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,power.draw --format=csv -l 5This queries specific metrics every 5 seconds in CSV format. Redirect to a file for post-training analysis:
nvidia-smi --query-gpu=timestamp,utilization.gpu,memory.used,power.draw --format=csv -l 1 -f gpu_metrics.csvRun nvidia-smi --help-query-gpu to see the complete list of queryable fields. There are over 100 available metrics.
Compact Alternative: gpustat
For a cleaner view during development:
pip install gpustat
gpustat -cp -i 1gpustat provides a one-line-per-GPU summary with utilization, memory, temperature, and process info in a color-coded format that's easier to scan than nvidia-smi's default output.
Framework-Level Memory Profiling
nvidia-smi shows total GPU memory usage but can't tell you which tensors, layers, or operations consume that memory. Framework-level profiling fills this gap.
PyTorch Memory Tracking
PyTorch provides built-in APIs for querying GPU memory allocation within training scripts:
import torch
# Check current memory state
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
print(f"Max alloc: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")The distinction between allocated and reserved memory matters. PyTorch's caching allocator reserves memory blocks for reuse. memory_allocated() shows what your model actually uses, while memory_reserved() shows what PyTorch has claimed from the GPU. A large gap between the two may indicate memory fragmentation.
Memory Snapshot Recording
For deep memory debugging, PyTorch can record a full memory history:
import torch
# Start recording memory events
torch.cuda.memory._record_memory_history(max_entries=100000)
# Your training loop
for epoch in range(num_epochs):
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Log per-iteration memory
print(f"Epoch {epoch}: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
# Save snapshot for visualization
torch.cuda.memory._dump_snapshot("memory_profile.pkl")
torch.cuda.memory._record_memory_history(enabled=None)The saved snapshot can be loaded into PyTorch's memory visualizer to see exactly which tensors occupy GPU memory and where allocations happen over time. This is invaluable for diagnosing memory leaks, tensors that should be freed but persist across iterations.
Memory Leak Detection Pattern
A common pattern for detecting memory leaks during training:
import torch
for step in range(num_steps):
# Track memory before and after each step
mem_before = torch.cuda.memory_allocated()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
optimizer.zero_grad()
mem_after = torch.cuda.memory_allocated()
mem_delta = (mem_after - mem_before) / 1e6 # MB
if mem_delta > 1.0: # More than 1 MB growth per step
print(f"WARNING Step {step}: memory grew by {mem_delta:.1f} MB")If memory consistently grows by more than a trivial amount per step, you have a leak. This is likely from tensors being retained by the computation graph, logging operations that accumulate GPU tensors, or third-party libraries that do not release CUDA memory properly.
Production Monitoring with DCGM and Prometheus
For production clusters, nvidia-smi's polling approach doesn't scale. NVIDIA's Data Center GPU Manager (DCGM) provides a proper metrics pipeline designed for continuous monitoring at scale.
DCGM Exporter Overview
The DCGM Exporter is a container that continuously queries GPU metrics via DCGM and exposes them as a Prometheus-compatible HTTP endpoint (typically on port 9400). It runs with approximately 5% overhead and supports all NVIDIA data center GPUs.
Quick Start with Docker
To start collecting GPU metrics immediately:
docker run -d --gpus all --cap-add SYS_ADMIN \
-p 9400:9400 \
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04Verify it is working:
curl localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTILKey DCGM Prometheus Metrics
The DCGM Exporter exposes dozens of GPU metrics. The most important for ML workloads:
| Metric | Description | Alert Threshold |
|---|---|---|
DCGM_FI_DEV_GPU_UTIL | GPU compute utilization (%) | Below 50% for 10+ min |
DCGM_FI_DEV_MEM_COPY_UTIL | Memory bandwidth utilization (%) | Below 30% with high GPU% |
DCGM_FI_DEV_FB_USED | Framebuffer (VRAM) used (MB) | Above 90% of total |
DCGM_FI_DEV_FB_FREE | Framebuffer (VRAM) free (MB) | Below 2 GB |
DCGM_FI_DEV_GPU_TEMP | GPU temperature (°C) | Above 83°C sustained |
DCGM_FI_DEV_POWER_USAGE | Current power draw (W) | Below 50% of TDP |
DCGM_FI_DEV_SM_CLOCK | SM clock speed (MHz) | Dropping below base clock |
DCGM_FI_DEV_PCIE_TX_THROUGHPUT | PCIe TX throughput (KB/s) | Sustained high = offloading |
DCGM_FI_DEV_XID_ERRORS | XID error count | Any non-zero value |
Kubernetes Deployment
For Kubernetes clusters, DCGM Exporter deploys via Helm or as part of the NVIDIA GPU Operator:
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--set serviceMonitor.enabled=trueWhen running the NVIDIA GPU Operator in your cluster, DCGM Exporter is installed automatically and is ready for Prometheus scraping out of the box.
Grafana Dashboard Setup
Pair DCGM Exporter with Grafana for visual dashboards. NVIDIA provides a pre-built Grafana dashboard (ID: 12239) that visualizes all key GPU metrics. Key panels to include:
- GPU utilization over time (per GPU and cluster average)
- VRAM usage with capacity lines showing danger zones
- Temperature heatmap across all GPUs
- Power draw normalized to TDP percentage
- Memory bandwidth utilization alongside compute utilization
- XID error events overlaid on performance metrics
Example Prometheus Alert Rules
Set up alerts for common GPU failure patterns:
groups:
- name: gpu-alerts
rules:
- alert: GPUMemoryNearLimit
expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.92
for: 5m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.gpu }} memory above 92%"
- alert: GPUThermalThrottling
expr: DCGM_FI_DEV_GPU_TEMP > 83
for: 10m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.gpu }} sustained temperature above 83°C"
- alert: GPUUnderutilized
expr: DCGM_FI_DEV_GPU_UTIL < 30
for: 15m
labels:
severity: info
annotations:
summary: "GPU {{ $labels.gpu }} below 30% utilization for 15 minutes"Monitoring Tools Comparison
Different tools serve different stages of GPU monitoring maturity:
| Tool | Best For | Overhead | Scope | Setup Complexity |
|---|---|---|---|---|
| nvidia-smi | Quick debugging, development | None | Single node | None (pre-installed) |
| gpustat | Clean dev-time monitoring | Negligible | Single node | pip install |
| PyTorch Profiler | Memory leak detection, per-tensor analysis | Low | Per-process | Code changes |
| DCGM Exporter | Production cluster monitoring | ~5% | Multi-node | Container + Prometheus |
| Nsight Systems | Deep kernel-level profiling | 20–200x | Single node (lab) | NVIDIA toolkit |
| Prometheus + Grafana | Historical trends, alerting, dashboards | ~5% | Cluster-wide | Helm / docker-compose |
For most ML teams, the progression is: nvidia-smi for development, PyTorch Profiler for debugging, then DCGM + Prometheus + Grafana for production.
How Different ML Workloads Behave
Understanding normal GPU behavior for your workload type helps distinguish real problems from expected patterns.
Training Workloads
Well-optimized training shows steady 85–95% GPU utilization during forward and backward passes, with brief dips between batches for data loading. Key warning signs include sustained utilization below 80% (likely a data pipeline bottleneck), memory climbing steadily across epochs (memory leak), or large utilization variance across GPUs in a multi-GPU setup (load imbalance).
LLM Inference
LLM inference is memory-bandwidth-bound at low batch sizes, where the GPU reads model weights for every token generated. Expect moderate compute utilization (40–70%) but high memory bandwidth utilization. Warning signs include VRAM usage climbing with concurrent requests (KV cache growth without proper limits), latency spikes correlating with memory pressure, or power draw well below TDP despite active serving.
Multi-GPU Distributed Training
All GPUs should show similar utilization patterns. Significant differences between GPUs indicate load imbalance, slow interconnects, or uneven data distribution. Monitor NCCL communication time relative to compute time. If communication exceeds 20–30% of total step time, the interconnect (NVLink versus PCIe) or parallelism strategy needs optimization.
Common GPU Failure Patterns
Understanding how GPUs fail helps teams set up the right alerts and prevention strategies.
Out-of-Memory (OOM) Crashes
The most frequent GPU failure mode. Memory usage climbs gradually, often from KV cache growth, activation accumulation, or memory fragmentation, until it hits the VRAM limit. Prevention: alert when VRAM exceeds 90%, track memory deltas per iteration, and set explicit KV cache limits in serving frameworks.
Memory Leaks
Tensors that should be freed persist on the GPU across iterations. Common causes include retaining computation graphs (forgetting loss.detach()), logging GPU tensors without moving them to CPU, custom CUDA kernels that don't release memory, and third-party library bugs. Prevention: log memory_allocated() per step and alert on consistent growth.
Data Pipeline Starvation
GPUs show low utilization despite active jobs because the data pipeline can't feed data fast enough. This manifests as the GPU alternating between high utilization (processing a batch) and near-zero utilization (waiting for the next batch). Fix: increase DataLoader workers, use faster storage, implement prefetching, or cache frequently accessed data.
Thermal Throttling
Sustained temperatures above 83–85°C cause automatic clock speed reduction. The GPU appears to run but delivers 25–30% less throughput. This is invisible without temperature monitoring. Utilization may still show high percentages because the GPU is active, just slower. Prevention: monitor temperature and SM clock frequency together.
NCCL Timeout in Distributed Training
In multi-GPU training, gradient synchronization failures (NCCL timeouts) crash the entire job. Often caused by one GPU running slower than others (creating a straggler), network issues between nodes, or memory pressure on one GPU causing it to stall. Prevention: monitor per-GPU step times and NCCL communication overhead for each training run.
Turning Monitoring Data into Cost Savings
GPU monitoring is ultimately a financial tool. Without it, teams can't answer the most basic infrastructure questions: Are our GPUs delivering value? Should we use fewer, more powerful GPUs or more, cheaper ones? Which workloads justify H100s and which should run on A100s?
Right-Sizing GPU Selection
Monitoring data reveals whether your workloads are compute-bound or memory-bound. A training job that shows 95% compute utilization and moderate memory usage is well-matched to its GPU. A job showing 40% compute utilization with memory near capacity needs a GPU with more VRAM, not more compute. Consider an H200 instead of an H100.
Identifying Idle Resources
The organizational data is stark: 15% of organizations use 50% or less of available GPU resources, 40% operate in the 50–70% utilization range, and only 7% achieve over 85% utilization at peak. Monitoring exposes idle GPUs that can be reclaimed, jobs that finish but don't release resources, and development instances left running overnight.
Scheduling Optimization
Historical utilization data enables smarter job scheduling: packing smaller inference jobs onto GPUs during training job idle periods, preempting low-priority jobs when high-priority training needs capacity, and identifying time-of-day patterns that allow spot instance usage during off-peak hours.
Even a 10% improvement in GPU utilization across a 100-GPU cluster at $2/hr/GPU saves $175,000 per year. For larger clusters, the savings are proportionally larger.
Building a GPU Monitoring Strategy
The path to production-grade GPU observability follows a clear progression:
Stage 1: Development - Use nvidia-smi and gpustat for immediate feedback. Add PyTorch memory logging to training scripts. Zero infrastructure cost, zero overhead.
Stage 2: Team Scale - Deploy DCGM Exporter with Prometheus on your GPU nodes. Create basic Grafana dashboards for utilization, memory, and temperature. Set up Slack/PagerDuty alerts for OOM risk and underutilization. Approximately 5% overhead.
Stage 3: Production - Add per-workload GPU metric attribution. Implement automated job scheduling based on GPU availability. Build chargeback or showback reporting for GPU cost allocation. Integrate GPU metrics with application-level metrics (loss curves, throughput, latency).
Stage 4: Cluster Scale - For 100+ GPU environments, add cluster-wide profiling tools that correlate GPU performance with CPU, network, and storage metrics. Implement automated anomaly detection for GPU failures and performance degradation. Build capacity planning dashboards that project future GPU needs.
Each stage builds on the previous one. Start where you are and progress as your GPU footprint grows.
Deploy Monitored GPU Infrastructure on Spheron
Spheron provides bare-metal GPU access with full hardware visibility. There are no hidden virtualization layers or overcommitted memory. Deploy on H100, H200, A100, and RTX 4090 GPUs with transparent pricing, instant provisioning, and predictable performance characteristics.
When you monitor GPUs on Spheron, the metrics you see reflect actual hardware behavior. There is no hypervisor abstraction between your monitoring tools and the physical GPU.
Explore GPU options on Spheron →
Frequently Asked Questions
What is the most important GPU metric to monitor for ML training?
Memory usage (VRAM) is the most critical metric because memory exhaustion causes immediate job crashes, while other issues only degrade performance. Monitor memory.used via nvidia-smi or DCGM_FI_DEV_FB_USED via DCGM Exporter, and alert when usage exceeds 90% of capacity. After memory, GPU utilization and temperature are the next most important. Low utilization indicates waste, and high temperature causes throttling.
How do I check GPU memory usage from the command line?
Run nvidia-smi for a quick snapshot, or nvidia-smi --query-gpu=memory.used,memory.total,memory.free --format=csv -l 1 for continuous CSV output every second. For per-process memory breakdown, use nvidia-smi pmon -d 1. Inside Python, torch.cuda.memory_allocated() shows what your model actually uses, while torch.cuda.memory_reserved() shows what PyTorch has claimed from the GPU.
How do I set up GPU monitoring with Prometheus and Grafana?
Deploy the DCGM Exporter container alongside your GPU nodes (docker run -d --gpus all -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter), configure Prometheus to scrape the :9400/metrics endpoint, and import NVIDIA's pre-built Grafana dashboard (ID: 12239). For Kubernetes, install the NVIDIA GPU Operator which includes DCGM Exporter automatically. The entire setup takes under an hour and adds approximately 5% overhead.
Why does nvidia-smi show 100% GPU utilization but training is still slow?
High GPU utilization means kernels are executing, but it does not mean they are executing efficiently. Common causes include: the GPU is memory-bandwidth-bound (check utilization.memory), kernels are poorly parallelized (low SM efficiency), thermal throttling has reduced clock speeds (check temperature and clocks.current.sm), or the GPU is waiting on synchronization in multi-GPU setups. Use nvidia-smi dmon -s pucvmet to see all metrics simultaneously.
How do I detect GPU memory leaks in PyTorch?
Log torch.cuda.memory_allocated() at the start and end of each training step. If memory consistently grows by more than a trivial amount per step, you have a leak. Common causes include forgetting loss.detach() before logging, accumulating GPU tensors in lists, or third-party libraries that do not release CUDA memory. For deep analysis, use torch.cuda.memory._record_memory_history() to capture a full memory snapshot and identify which tensors are persisting.
What GPU utilization percentage should I target?
For training workloads, target 85–95% compute utilization during active training phases. Below 80% typically indicates a data pipeline bottleneck. For LLM inference, lower compute utilization (40–70%) is normal because inference is memory-bandwidth-bound. Focus on memory bandwidth utilization and throughput (tokens/second) instead. The key is matching the right metric to the workload type rather than targeting a universal utilization number.