Engineering

GPU Monitoring for ML: nvidia-smi, DCGM, and Production Observability Guide

Back to BlogWritten by SpheronJan 15, 2026
GPU CloudGPU MonitoringMachine LearningAI Infrastructurenvidia-smiDCGMPrometheusCost Optimization
GPU Monitoring for ML: nvidia-smi, DCGM, and Production Observability Guide

Your training job crashes at 3 AM. The error says "CUDA out of memory," but system monitors show plenty of free RAM. CPU usage looks normal. Disk is fine. You restart with a smaller batch size, and a few hours later it fails the same way. The real issue: nobody was watching GPU memory usage. The KV cache or activation memory grew gradually until it hit the VRAM limit.

This scenario is painfully common. Industry surveys show that more than 75% of organizations run GPUs below 70% utilization even at peak load. Only 7% achieve over 85% utilization. That means most teams simultaneously waste expensive GPU capacity and deal with preventable crashes because they lack visibility into what their GPUs are actually doing.

This guide covers GPU monitoring at every level: quick command-line checks with nvidia-smi, framework-level memory profiling in PyTorch, production-grade metrics with DCGM Exporter and Prometheus, and cluster-scale observability with Grafana dashboards. Each section includes concrete commands and configurations you can use immediately.

Why GPU Monitoring Is a Business-Critical Investment

GPUs are the most expensive component in the AI stack. On-demand H100 pricing ranges from $1.21/hr on cost-efficient providers to $6.98/hr on major clouds, a 5.7x variance. For a 100-GPU training run over 200 hours, that's the difference between $24,200 and $139,600. Without monitoring, teams cannot tell whether those GPUs are delivering value or sitting idle.

The operational impact of poor monitoring compounds at scale. During Meta's 54-day training run on their Grand Teton platform, the team experienced 419 job interruptions, roughly one failure every 3 hours. When projected to a 128,000-GPU cluster (the scale needed for frontier models), this translates to a job interruption every 23 minutes. Without fault detection and automated recovery, these interruptions cascade through training pipelines, turning weeks of computation into waste.

Research shows that 54.5% of teams cite cost as their biggest GPU challenge, not hardware scarcity. Another 16% explicitly identify monitoring and visibility gaps as a primary blocker. Proper GPU monitoring turns these cost and reliability problems into engineering problems with measurable solutions.

What "GPU Usage" Actually Means

GPU usage is not a single number. It is a collection of metrics that tell different stories about system health. Understanding what each metric reveals and what it hides is essential for effective monitoring.

Compute Utilization (GPU%)

This is what nvidia-smi reports as "GPU-Util", the percentage of time over the last sample period during which one or more GPU kernels were executing. A GPU showing 100% utilization is not necessarily running efficiently; it may be executing poorly parallelized kernels that underutilize the hardware. Conversely, a GPU at 50% utilization with bursty workloads may be perfectly healthy.

Memory Utilization

Memory utilization has two dimensions: how much VRAM is allocated (capacity) and how fast data moves between memory and compute units (bandwidth). Allocated memory shows whether you are approaching the VRAM limit. Bandwidth utilization shows whether the GPU is starving for data, a common bottleneck in LLM inference where memory bandwidth determines tokens-per-second throughput.

Streaming Multiprocessor (SM) Efficiency

SM efficiency measures how well GPU kernels map to the hardware's streaming multiprocessors. Low SM efficiency with high compute utilization often means kernels are poorly parallelized or memory-bound. The GPU is busy but not productive.

Power Draw and Temperature

Power draw is an underrated diagnostic signal. GPUs doing real computational work typically draw power near their TDP (thermal design power). An H100 at 700W TDP that consistently draws only 300W is likely bottlenecked by something other than compute. Temperature matters because sustained heat above 80–85°C triggers thermal throttling, where the GPU automatically reduces clock speeds by 25–30% to protect the hardware.

nvidia-smi: Essential Commands Every ML Engineer Should Know

nvidia-smi (NVIDIA System Management Interface) ships with every NVIDIA driver installation and is the fastest way to check GPU state. Most engineers know the basic command, but nvidia-smi offers much more.

Basic Status Check

The simplest command gives an immediate snapshot:

bash
nvidia-smi

This displays GPU name, driver version, CUDA version, temperature, power draw, memory usage, GPU utilization, and running processes. It's the first thing to run when diagnosing any GPU issue.

Continuous Monitoring

For real-time monitoring during training:

bash
nvidia-smi -l 1

This refreshes every 1 second, showing how metrics evolve over time. Watch for memory steadily climbing toward the limit (impending OOM), GPU utilization dropping to 0% between batches (data pipeline bottleneck), or power draw well below TDP (underutilization).

Device Monitoring (dmon)

The dmon mode provides compact, continuous output ideal for logging:

bash
nvidia-smi dmon -s pucvmet -d 1

This outputs one line per second per GPU with power usage, utilization, clocks, VRAM usage, memory bandwidth, encoder/decoder usage, and temperature. The compact format makes it easy to pipe into log files or send to monitoring tools.

Process Monitoring (pmon)

To see which processes consume GPU resources:

bash
nvidia-smi pmon -d 1

This shows per-process GPU utilization, memory usage, and command name. This is essential for multi-tenant systems where multiple training jobs share GPUs, or for identifying rogue processes consuming VRAM.

CSV Export for Analysis

For automated monitoring and historical analysis, nvidia-smi can output structured CSV data:

bash
nvidia-smi --query-gpu=timestamp,name,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,power.draw --format=csv -l 5

This queries specific metrics every 5 seconds in CSV format. Redirect to a file for post-training analysis:

bash
nvidia-smi --query-gpu=timestamp,utilization.gpu,memory.used,power.draw --format=csv -l 1 -f gpu_metrics.csv

Run nvidia-smi --help-query-gpu to see the complete list of queryable fields. There are over 100 available metrics.

Compact Alternative: gpustat

For a cleaner view during development:

bash
pip install gpustat
gpustat -cp -i 1

gpustat provides a one-line-per-GPU summary with utilization, memory, temperature, and process info in a color-coded format that's easier to scan than nvidia-smi's default output.

Framework-Level Memory Profiling

nvidia-smi shows total GPU memory usage but can't tell you which tensors, layers, or operations consume that memory. Framework-level profiling fills this gap.

PyTorch Memory Tracking

PyTorch provides built-in APIs for querying GPU memory allocation within training scripts:

python
import torch

# Check current memory state
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Reserved:  {torch.cuda.memory_reserved() / 1e9:.2f} GB")
print(f"Max alloc: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")

The distinction between allocated and reserved memory matters. PyTorch's caching allocator reserves memory blocks for reuse. memory_allocated() shows what your model actually uses, while memory_reserved() shows what PyTorch has claimed from the GPU. A large gap between the two may indicate memory fragmentation.

Memory Snapshot Recording

For deep memory debugging, PyTorch can record a full memory history:

python
import torch

# Start recording memory events
torch.cuda.memory._record_memory_history(max_entries=100000)

# Your training loop
for epoch in range(num_epochs):
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    # Log per-iteration memory
    print(f"Epoch {epoch}: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

# Save snapshot for visualization
torch.cuda.memory._dump_snapshot("memory_profile.pkl")
torch.cuda.memory._record_memory_history(enabled=None)

The saved snapshot can be loaded into PyTorch's memory visualizer to see exactly which tensors occupy GPU memory and where allocations happen over time. This is invaluable for diagnosing memory leaks, tensors that should be freed but persist across iterations.

Memory Leak Detection Pattern

A common pattern for detecting memory leaks during training:

python
import torch

for step in range(num_steps):
    # Track memory before and after each step
    mem_before = torch.cuda.memory_allocated()

    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    mem_after = torch.cuda.memory_allocated()
    mem_delta = (mem_after - mem_before) / 1e6  # MB

    if mem_delta > 1.0:  # More than 1 MB growth per step
        print(f"WARNING Step {step}: memory grew by {mem_delta:.1f} MB")

If memory consistently grows by more than a trivial amount per step, you have a leak. This is likely from tensors being retained by the computation graph, logging operations that accumulate GPU tensors, or third-party libraries that do not release CUDA memory properly.

Production Monitoring with DCGM and Prometheus

For production clusters, nvidia-smi's polling approach doesn't scale. NVIDIA's Data Center GPU Manager (DCGM) provides a proper metrics pipeline designed for continuous monitoring at scale.

DCGM Exporter Overview

The DCGM Exporter is a container that continuously queries GPU metrics via DCGM and exposes them as a Prometheus-compatible HTTP endpoint (typically on port 9400). It runs with approximately 5% overhead and supports all NVIDIA data center GPUs.

Quick Start with Docker

To start collecting GPU metrics immediately:

bash
docker run -d --gpus all --cap-add SYS_ADMIN \
  -p 9400:9400 \
  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04

Verify it is working:

bash
curl localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL

Key DCGM Prometheus Metrics

The DCGM Exporter exposes dozens of GPU metrics. The most important for ML workloads:

MetricDescriptionAlert Threshold
DCGM_FI_DEV_GPU_UTILGPU compute utilization (%)Below 50% for 10+ min
DCGM_FI_DEV_MEM_COPY_UTILMemory bandwidth utilization (%)Below 30% with high GPU%
DCGM_FI_DEV_FB_USEDFramebuffer (VRAM) used (MB)Above 90% of total
DCGM_FI_DEV_FB_FREEFramebuffer (VRAM) free (MB)Below 2 GB
DCGM_FI_DEV_GPU_TEMPGPU temperature (°C)Above 83°C sustained
DCGM_FI_DEV_POWER_USAGECurrent power draw (W)Below 50% of TDP
DCGM_FI_DEV_SM_CLOCKSM clock speed (MHz)Dropping below base clock
DCGM_FI_DEV_PCIE_TX_THROUGHPUTPCIe TX throughput (KB/s)Sustained high = offloading
DCGM_FI_DEV_XID_ERRORSXID error countAny non-zero value

Kubernetes Deployment

For Kubernetes clusters, DCGM Exporter deploys via Helm or as part of the NVIDIA GPU Operator:

bash
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --set serviceMonitor.enabled=true

When running the NVIDIA GPU Operator in your cluster, DCGM Exporter is installed automatically and is ready for Prometheus scraping out of the box.

Grafana Dashboard Setup

Pair DCGM Exporter with Grafana for visual dashboards. NVIDIA provides a pre-built Grafana dashboard (ID: 12239) that visualizes all key GPU metrics. Key panels to include:

  • GPU utilization over time (per GPU and cluster average)
  • VRAM usage with capacity lines showing danger zones
  • Temperature heatmap across all GPUs
  • Power draw normalized to TDP percentage
  • Memory bandwidth utilization alongside compute utilization
  • XID error events overlaid on performance metrics

Example Prometheus Alert Rules

Set up alerts for common GPU failure patterns:

yaml
groups:
  - name: gpu-alerts
    rules:
      - alert: GPUMemoryNearLimit
        expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.92
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu }} memory above 92%"

      - alert: GPUThermalThrottling
        expr: DCGM_FI_DEV_GPU_TEMP > 83
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu }} sustained temperature above 83°C"

      - alert: GPUUnderutilized
        expr: DCGM_FI_DEV_GPU_UTIL < 30
        for: 15m
        labels:
          severity: info
        annotations:
          summary: "GPU {{ $labels.gpu }} below 30% utilization for 15 minutes"

Monitoring Tools Comparison

Different tools serve different stages of GPU monitoring maturity:

ToolBest ForOverheadScopeSetup Complexity
nvidia-smiQuick debugging, developmentNoneSingle nodeNone (pre-installed)
gpustatClean dev-time monitoringNegligibleSingle nodepip install
PyTorch ProfilerMemory leak detection, per-tensor analysisLowPer-processCode changes
DCGM ExporterProduction cluster monitoring~5%Multi-nodeContainer + Prometheus
Nsight SystemsDeep kernel-level profiling20–200xSingle node (lab)NVIDIA toolkit
Prometheus + GrafanaHistorical trends, alerting, dashboards~5%Cluster-wideHelm / docker-compose

For most ML teams, the progression is: nvidia-smi for development, PyTorch Profiler for debugging, then DCGM + Prometheus + Grafana for production.

How Different ML Workloads Behave

Understanding normal GPU behavior for your workload type helps distinguish real problems from expected patterns.

Training Workloads

Well-optimized training shows steady 85–95% GPU utilization during forward and backward passes, with brief dips between batches for data loading. Key warning signs include sustained utilization below 80% (likely a data pipeline bottleneck), memory climbing steadily across epochs (memory leak), or large utilization variance across GPUs in a multi-GPU setup (load imbalance).

LLM Inference

LLM inference is memory-bandwidth-bound at low batch sizes, where the GPU reads model weights for every token generated. Expect moderate compute utilization (40–70%) but high memory bandwidth utilization. Warning signs include VRAM usage climbing with concurrent requests (KV cache growth without proper limits), latency spikes correlating with memory pressure, or power draw well below TDP despite active serving.

Multi-GPU Distributed Training

All GPUs should show similar utilization patterns. Significant differences between GPUs indicate load imbalance, slow interconnects, or uneven data distribution. Monitor NCCL communication time relative to compute time. If communication exceeds 20–30% of total step time, the interconnect (NVLink versus PCIe) or parallelism strategy needs optimization.

Common GPU Failure Patterns

Understanding how GPUs fail helps teams set up the right alerts and prevention strategies.

Out-of-Memory (OOM) Crashes

The most frequent GPU failure mode. Memory usage climbs gradually, often from KV cache growth, activation accumulation, or memory fragmentation, until it hits the VRAM limit. Prevention: alert when VRAM exceeds 90%, track memory deltas per iteration, and set explicit KV cache limits in serving frameworks.

Memory Leaks

Tensors that should be freed persist on the GPU across iterations. Common causes include retaining computation graphs (forgetting loss.detach()), logging GPU tensors without moving them to CPU, custom CUDA kernels that don't release memory, and third-party library bugs. Prevention: log memory_allocated() per step and alert on consistent growth.

Data Pipeline Starvation

GPUs show low utilization despite active jobs because the data pipeline can't feed data fast enough. This manifests as the GPU alternating between high utilization (processing a batch) and near-zero utilization (waiting for the next batch). Fix: increase DataLoader workers, use faster storage, implement prefetching, or cache frequently accessed data.

Thermal Throttling

Sustained temperatures above 83–85°C cause automatic clock speed reduction. The GPU appears to run but delivers 25–30% less throughput. This is invisible without temperature monitoring. Utilization may still show high percentages because the GPU is active, just slower. Prevention: monitor temperature and SM clock frequency together.

NCCL Timeout in Distributed Training

In multi-GPU training, gradient synchronization failures (NCCL timeouts) crash the entire job. Often caused by one GPU running slower than others (creating a straggler), network issues between nodes, or memory pressure on one GPU causing it to stall. Prevention: monitor per-GPU step times and NCCL communication overhead for each training run.

Turning Monitoring Data into Cost Savings

GPU monitoring is ultimately a financial tool. Without it, teams can't answer the most basic infrastructure questions: Are our GPUs delivering value? Should we use fewer, more powerful GPUs or more, cheaper ones? Which workloads justify H100s and which should run on A100s?

Right-Sizing GPU Selection

Monitoring data reveals whether your workloads are compute-bound or memory-bound. A training job that shows 95% compute utilization and moderate memory usage is well-matched to its GPU. A job showing 40% compute utilization with memory near capacity needs a GPU with more VRAM, not more compute. Consider an H200 instead of an H100.

Identifying Idle Resources

The organizational data is stark: 15% of organizations use 50% or less of available GPU resources, 40% operate in the 50–70% utilization range, and only 7% achieve over 85% utilization at peak. Monitoring exposes idle GPUs that can be reclaimed, jobs that finish but don't release resources, and development instances left running overnight.

Scheduling Optimization

Historical utilization data enables smarter job scheduling: packing smaller inference jobs onto GPUs during training job idle periods, preempting low-priority jobs when high-priority training needs capacity, and identifying time-of-day patterns that allow spot instance usage during off-peak hours.

Even a 10% improvement in GPU utilization across a 100-GPU cluster at $2/hr/GPU saves $175,000 per year. For larger clusters, the savings are proportionally larger.

Building a GPU Monitoring Strategy

The path to production-grade GPU observability follows a clear progression:

Stage 1: Development - Use nvidia-smi and gpustat for immediate feedback. Add PyTorch memory logging to training scripts. Zero infrastructure cost, zero overhead.

Stage 2: Team Scale - Deploy DCGM Exporter with Prometheus on your GPU nodes. Create basic Grafana dashboards for utilization, memory, and temperature. Set up Slack/PagerDuty alerts for OOM risk and underutilization. Approximately 5% overhead.

Stage 3: Production - Add per-workload GPU metric attribution. Implement automated job scheduling based on GPU availability. Build chargeback or showback reporting for GPU cost allocation. Integrate GPU metrics with application-level metrics (loss curves, throughput, latency).

Stage 4: Cluster Scale - For 100+ GPU environments, add cluster-wide profiling tools that correlate GPU performance with CPU, network, and storage metrics. Implement automated anomaly detection for GPU failures and performance degradation. Build capacity planning dashboards that project future GPU needs.

Each stage builds on the previous one. Start where you are and progress as your GPU footprint grows.

Deploy Monitored GPU Infrastructure on Spheron

Spheron provides bare-metal GPU access with full hardware visibility. There are no hidden virtualization layers or overcommitted memory. Deploy on H100, H200, A100, and RTX 4090 GPUs with transparent pricing, instant provisioning, and predictable performance characteristics.

When you monitor GPUs on Spheron, the metrics you see reflect actual hardware behavior. There is no hypervisor abstraction between your monitoring tools and the physical GPU.

Explore GPU options on Spheron →

Frequently Asked Questions

What is the most important GPU metric to monitor for ML training?

Memory usage (VRAM) is the most critical metric because memory exhaustion causes immediate job crashes, while other issues only degrade performance. Monitor memory.used via nvidia-smi or DCGM_FI_DEV_FB_USED via DCGM Exporter, and alert when usage exceeds 90% of capacity. After memory, GPU utilization and temperature are the next most important. Low utilization indicates waste, and high temperature causes throttling.

How do I check GPU memory usage from the command line?

Run nvidia-smi for a quick snapshot, or nvidia-smi --query-gpu=memory.used,memory.total,memory.free --format=csv -l 1 for continuous CSV output every second. For per-process memory breakdown, use nvidia-smi pmon -d 1. Inside Python, torch.cuda.memory_allocated() shows what your model actually uses, while torch.cuda.memory_reserved() shows what PyTorch has claimed from the GPU.

How do I set up GPU monitoring with Prometheus and Grafana?

Deploy the DCGM Exporter container alongside your GPU nodes (docker run -d --gpus all -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter), configure Prometheus to scrape the :9400/metrics endpoint, and import NVIDIA's pre-built Grafana dashboard (ID: 12239). For Kubernetes, install the NVIDIA GPU Operator which includes DCGM Exporter automatically. The entire setup takes under an hour and adds approximately 5% overhead.

Why does nvidia-smi show 100% GPU utilization but training is still slow?

High GPU utilization means kernels are executing, but it does not mean they are executing efficiently. Common causes include: the GPU is memory-bandwidth-bound (check utilization.memory), kernels are poorly parallelized (low SM efficiency), thermal throttling has reduced clock speeds (check temperature and clocks.current.sm), or the GPU is waiting on synchronization in multi-GPU setups. Use nvidia-smi dmon -s pucvmet to see all metrics simultaneously.

How do I detect GPU memory leaks in PyTorch?

Log torch.cuda.memory_allocated() at the start and end of each training step. If memory consistently grows by more than a trivial amount per step, you have a leak. Common causes include forgetting loss.detach() before logging, accumulating GPU tensors in lists, or third-party libraries that do not release CUDA memory. For deep analysis, use torch.cuda.memory._record_memory_history() to capture a full memory snapshot and identify which tensors are persisting.

What GPU utilization percentage should I target?

For training workloads, target 85–95% compute utilization during active training phases. Below 80% typically indicates a data pipeline bottleneck. For LLM inference, lower compute utilization (40–70%) is normal because inference is memory-bandwidth-bound. Focus on memory bandwidth utilization and throughput (tokens/second) instead. The key is matching the right metric to the workload type rather than targeting a universal utilization number.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.