Tools: A Complete Guide to Real-Time GPU Usage Monitoring (2026)

Tools: A Complete Guide to Real-Time GPU Usage Monitoring (2026)

Key Takeaways

What GPU Utilization Metrics Actually Mean

GPU Core Utilization vs. Memory Utilization

SM Utilization, Memory Bandwidth, and Power Draw

Why These Metrics Matter for Deep Learning Workloads

GPU Bottlenecks and Out of Memory Errors

CPU Preprocessing Bottlenecks

What Causes OOM Errors and How to Resolve Them

Monitoring GPU Utilization with nvidia-smi

Basic nvidia-smi Output and What Each Field Shows

Running nvidia-smi in Continuous Loop Mode

Logging nvidia-smi Output to a File

Querying Specific Metrics with nvidia-smi --query-gpu

Per-Process GPU Monitoring

Using nvidia-smi pmon for Process-Level Metrics

Correlating Process IDs to Application Names

Interactive GPU Monitoring with nvtop and gpustat

Installing and Running nvtop

Installing and Running gpustat

When to Use nvtop vs. gpustat vs. nvidia-smi

GPU Monitoring with Glances

GPU Monitoring Inside Docker Containers and Kubernetes

Exposing GPU Metrics in Docker with the NVIDIA Container Toolkit

Monitoring GPU Utilization in Kubernetes with DCGM Exporter

Viewing GPU Metrics in a DigitalOcean Managed Kubernetes Cluster

Setting Up Persistent GPU Monitoring with Datadog

Installing the Datadog Agent with NVIDIA GPU Support

Configuring the GPU Integration and Tag Strategy

Building a Real-Time GPU Dashboard and Setting Alerts

Setting Up GPU Monitoring with Zabbix

Enabling the NVIDIA GPU Template in Zabbix

Configuring Triggers for Utilization Thresholds

Enabling Unified GPU Usage Monitoring on Windows

What Unified GPU Usage Monitoring Is

How to Enable It via NVIDIA Control Panel and Registry

Reading Unified GPU Data via Task Manager and WMI

Comparing GPU Monitoring Tools

Feature and Trade-off Comparison Table

Choosing the Right Tool for Your Use Case

Conclusion The fastest way to monitor GPU utilization in real time on Linux is to run nvidia-smi --loop=1, which refreshes GPU stats every second including core utilization, VRAM usage, temperature, and power draw. Monitoring GPU utilization in real time starts with nvidia-smi, then expands to per-process views, container metrics, and alerts for long-running jobs. This guide shows command-level workflows you can run on Ubuntu, GPU Droplets, Docker hosts, and Kubernetes clusters. If you are building or operating deep learning systems, pair this guide with How To Set Up a Deep Learning Environment on Ubuntu and DigitalOcean GPU Droplets. GPU utilization metrics tell you whether your job is compute-bound, memory-bound, input-bound, or idle between batches. Start by tracking core utilization, memory usage, memory controller load, temperature, and power draw together instead of looking at one metric in isolation. GPU core utilization is the percentage of time kernels are actively executing on SMs during the sampling window. GPU memory utilization in nvidia-smi usually refers to memory controller activity, while memory usage is allocated VRAM in MiB. Low core utilization with high allocated VRAM often means the model is resident but waiting on data or synchronization. High core utilization with low memory controller activity is more common in compute-heavy kernels. SM utilization tells you whether CUDA cores are busy, memory bandwidth indicates how hard memory channels are being driven, and power draw shows electrical load relative to the card limit. These three together explain why two workloads with similar utilization percentages can perform differently. Use power.draw, power.limit, and utilization metrics in the same sample window when tuning batch size and dataloader workers. If power is capped while utilization is high, clock throttling can be the next bottleneck to investigate. These metrics matter because training throughput is gated by the slowest stage in the pipeline. If GPU cores are idle while CPU or storage is saturated, adding another GPU will not fix throughput. <$>[note]

For a practical environment baseline before tuning, follow How To Set Up a Deep Learning Environment on Ubuntu.<$> Most GPU incidents in ML pipelines come from input bottlenecks or VRAM pressure. Diagnose both at the same time by sampling GPU, CPU, and process-level memory while a real training job is running. If CPU preprocessing is the bottleneck, GPU utilization drops between mini-batches even when VRAM remains allocated. This pattern appears when image decode, augmentation, or tokenization is slower than kernel execution. Check host pressure while your training loop runs: In vmstat, watch r, wa, bi, and us plus sy together. r is runnable processes, and if it stays above your CPU core count, the CPU is saturated. wa is CPU time waiting on I/O, and sustained values above 10 to 15 during training often mean dataloader workers are blocked on disk reads. bi is blocks received from storage, and high bi with high wa points to storage bottlenecks instead of compute. us + sy is total active CPU time, and if it is high while GPU-Util is low, preprocessing is outrunning the GPU. If wa is high, increase dataloader workers or switch to faster storage. If us + sy is high with low GPU-Util, move transforms to GPU with a library such as Kornia. OOM errors happen when requested allocations exceed available VRAM, often due to large batch sizes, long sequence lengths, or concurrent GPU processes. Resolve OOM by lowering memory pressure first, then increasing workload cautiously. If a stale process is still holding VRAM after a failed run, list active compute processes, verify ownership, terminate the stale PID, then confirm memory was released. <$>[warning]Do not kill unknown PIDs on shared hosts. Verify process ownership and job context first.<$> nvidia-smi is the fastest built-in tool for real-time GPU telemetry on Linux servers. It is available with NVIDIA drivers and documents fields used by most higher-level integrations. Run nvidia-smi with no flags for a full snapshot of GPU and process state. Focus first on GPU-Util, Memory-Usage, Temp, and Pwr:Usage/Cap. If GPU-Util shows 0% while a job appears to be running, check three common causes. The job may still be in a CPU-bound preprocessing stage and has not submitted work to the GPU yet. The process may have errored and stayed alive but idle. The job may also be running on a different GPU index, so list all devices with nvidia-smi --list-gpus and check each one. Use loop mode when you need live updates without writing scripts. --loop=1 refreshes once per second. Write sampled output to a file for post-run inspection. Redirect stdout so each sample is timestamped in your shell history and log stream. Use --query-gpu with --format=csv when you need parseable output for scripts. This is the preferred pattern for cron jobs and custom exporters. Per-process monitoring answers which application is consuming GPU time right now. Use nvidia-smi pmon to inspect utilization by PID instead of by device only. Run pmon in loop mode to monitor active compute processes. -s um displays utilization and memory throughput related activity by process. gpu is the GPU index the process is running on. pid is the process ID. type is workload class, where C is compute, G is graphics, and M is mixed. sm is the percentage of time spent executing kernels on streaming multiprocessors. mem is the percentage of time the memory interface was active for that process. enc and dec are encoder and decoder utilization percentages. command is the truncated process name. Map PIDs to full command lines to identify notebook kernels, training scripts, and inference workers. This is required when multiple Python jobs are running under one user. Use nvtop when you want interactive process control and gpustat when you want compact snapshots in scripts. Both tools complement nvidia-smi rather than replace it. Install nvtop from Ubuntu repositories, then start it in the terminal. It provides live bars and per-process views similar to htop. Install gpustat with pip, then use watch mode for one-second updates. This is useful in SSH sessions where minimal output matters. Use nvidia-smi for canonical driver-level data and scripted queries. Use gpustat for low-noise terminal snapshots, and use nvtop for interactive process monitoring during active debugging. Use Glances when you need one terminal dashboard for GPU, CPU, memory, disk, and network at once. Install with the GPU extra so NVIDIA metrics are available. In the Glances GPU line, util maps to GPU core activity, and mem shows allocated versus total VRAM. temp and power indicate thermal and electrical load during the sample window. Use these values together to identify whether workload pressure is compute, memory, or thermal related. Glances is a better choice than nvidia-smi when you want CPU, memory, disk, and GPU in one non-scrolling view during interactive debugging on a single node. <$>[note]If glances shows no GPU section, verify that NVIDIA drivers are installed on the host and the Python environment running Glances can access NVML.<$> Containerized GPU monitoring requires host runtime support first, then workload-level metric collection. Start with NVIDIA Container Toolkit for Docker and DCGM Exporter for Kubernetes clusters. Install the NVIDIA Container Toolkit on the host, then run containers with --gpus all. Inside the container, nvidia-smi should show host GPU telemetry. Use this after setting up Docker by following How To Install and Use Docker on Ubuntu. <$>[note]The NVIDIA runtime is only active after the Docker daemon restarts. Already-running containers are not affected, but any new container launched after the restart will have GPU access. For full installation details, see the NVIDIA Container Toolkit guide.<$> Deploy DCGM Exporter as a DaemonSet on GPU nodes to expose Prometheus metrics. This creates scrape targets with per-GPU and per-pod metric labels. To collect GPU metrics in a DOKS cluster, configure Prometheus to scrape the DCGM Exporter DaemonSet, then visualize the data in Grafana or forward it to a hosted monitoring backend. Separate GPU dashboards by node pool and workload labels to avoid mixed tenancy confusion. Before deployment, review An Introduction to Kubernetes if your team is new to cluster primitives. In a DOKS cluster, use DaemonSet pod IPs or a Kubernetes Service DNS name instead of static node IP targets. For Grafana dashboard import details, see NVIDIA DCGM Exporter documentation. Use Datadog when you need long-term retention, tag-based slicing, and alert routing to on-call systems. Install the Agent on each GPU node and enable the NVIDIA integration. Install Agent 7 on the GPU host, then enable the nvidia_gpu integration. Keep host drivers and NVML available to the Agent process. <$>[note]The NVML integration is not bundled with Agent 7 by default. Install it separately, then configure nvml.d/conf.yaml.<$> <$>[note]Verify the latest available version of the NVML integration before installing.<$> Define tags at the host and integration level so you can group by cluster, environment, and workload type. This keeps alert routing and dashboard filters usable at scale. Save this as /etc/datadog-agent/conf.d/nvml.d/conf.yaml, then restart: Create timeseries panels for nvidia.gpu.utilization, nvidia.gpu.memory.used, and nvidia.gpu.temperature, then alert on sustained saturation. A practical first alert is GPU utilization above 95% for 10 minutes on production training nodes. Use How To Monitor Your Infrastructure with Datadog for dashboard and monitor fundamentals. To monitor GPU hosts with Zabbix, install the Zabbix agent on each GPU host, import the NVIDIA GPU template, and configure trigger thresholds for utilization and temperature. Zabbix is the right choice when you need self-hosted monitoring with custom alerting and existing enterprise integrations. Import or attach an NVIDIA GPU template in Zabbix, then bind it to hosts that have NVIDIA drivers installed. Template items should poll utilization, memory, temperature, and power. Create triggers for sustained high utilization, high temperature, and unexpected drops to zero utilization during scheduled training windows. Use trigger expressions with time windows to avoid noise from short spikes. {#GPUINDEX} is a low-level discovery macro populated automatically by the template. You do not need to set it manually. Unified GPU Usage Monitoring aggregates activity from multiple GPU engines into a single usage view that operators can read quickly. Enable it through NVIDIA Control Panel first, then verify registry policy where required by your driver profile. Unified monitoring combines graphics, compute, copy, and video engine activity into one normalized utilization metric. This improves cross-process visibility when mixed workloads run on the same adapter. In NVIDIA Control Panel, enable the GPU activity monitoring feature and apply settings system-wide. If your environment uses managed policy, set the registry value used by your NVIDIA driver branch to turn on unified usage reporting. <$>[warning]Registry value names for unified usage reporting vary by driver branch and policy tooling. Validate the exact key and value against your NVIDIA enterprise driver documentation before changing production systems.<$> After enabling unified monitoring, Task Manager can display GPU engine and aggregate usage per process. WMI queries can then be used for scripted collection in Windows-based monitoring workflows. Use this table to pick a tool based on data depth, operational overhead, and alerting needs. Start with CLI tools for diagnostics, then add Datadog, Zabbix, or DCGM pipelines for persistent monitoring. For single-node debugging, start with nvidia-smi and nvtop. For fleet-level visibility across GPU Droplets and Kubernetes nodes, use DCGM Exporter plus your monitoring backend or deploy Datadog or Zabbix for retention and alerting.

If you need a historical record of GPU activity alongside CPU, memory, and disk in a single log, atop captures all of these at configurable intervals and is worth adding to long-running training hosts alongside nvidia-smi. Real-time GPU utilization monitoring is essential for optimizing deep learning performance, troubleshooting bottlenecks, and achieving efficient resource usage—whether running on single nodes, inside containers, or scaling across clustered environments. The right monitoring tool depends on your specific use case: quick one-off checks, interactive debugging, continuous fleet-wide visibility, or long-term metric retention and alerting. Start with simple tools like nvidia-smi for instant visibility, and progress to dashboarding, custom alerting, and enterprise-grade solutions as your needs grow. With the strategies and tools outlined in this guide, you can proactively monitor, troubleshoot, and maximize the performance of your GPU workloads—ensuring smoother operation for development, training, and deployment pipelines. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 824320 74384 901212 0 0 6 10 420 980 18 4 76 2 0 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 824320 74384 901212 0 0 6 10 420 980 18 4 76 2 0 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 824320 74384 901212 0 0 6 10 420 980 18 4 76 2 0 nvidia-smi --query-compute-apps=pid,used_memory,process_name --format=csv,noheader nvidia-smi --query-compute-apps=pid,used_memory,process_name --format=csv,noheader nvidia-smi --query-compute-apps=pid,used_memory,process_name --format=csv,noheader 18211, 17664 MiB, python 18304, 512 MiB, python 18211, 17664 MiB, python 18304, 512 MiB, python 18211, 17664 MiB, python 18304, 512 MiB, python ps -p <PID> -o pid,user,etime,cmd ps -p <PID> -o pid,user,etime,cmd ps -p <PID> -o pid,user,etime,cmd kill -9 <PID> kill -9 <PID> kill -9 <PID> nvidia-smi # Confirm VRAM is now released nvidia-smi # Confirm VRAM is now released nvidia-smi # Confirm VRAM is now released +-----------------------------------------------------------------------------+ | NVIDIA-SMI 550.xx Driver Version: 550.xx CUDA Version: 12.x | | GPU Name Temp Pwr:Usage/Cap Memory-Usage GPU-Util Compute M. | | 0 H100 53C 215W / 350W 18240MiB/81920MiB 78% Default | +-----------------------------------------------------------------------------+ | Processes: | | GPU PID Type Process name GPU Memory | | 0 18211 C python train.py 17664MiB| +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ | NVIDIA-SMI 550.xx Driver Version: 550.xx CUDA Version: 12.x | | GPU Name Temp Pwr:Usage/Cap Memory-Usage GPU-Util Compute M. | | 0 H100 53C 215W / 350W 18240MiB/81920MiB 78% Default | +-----------------------------------------------------------------------------+ | Processes: | | GPU PID Type Process name GPU Memory | | 0 18211 C python train.py 17664MiB| +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ | NVIDIA-SMI 550.xx Driver Version: 550.xx CUDA Version: 12.x | | GPU Name Temp Pwr:Usage/Cap Memory-Usage GPU-Util Compute M. | | 0 H100 53C 215W / 350W 18240MiB/81920MiB 78% Default | +-----------------------------------------------------------------------------+ | Processes: | | GPU PID Type Process name GPU Memory | | 0 18211 C python train.py 17664MiB| +-----------------------------------------------------------------------------+ nvidia-smi --loop=1 nvidia-smi --loop=1 nvidia-smi --loop=1 Wed Mar 26 12:00:01 2026 ... snapshot ... Wed Mar 26 12:00:02 2026 ... snapshot ... Wed Mar 26 12:00:01 2026 ... snapshot ... Wed Mar 26 12:00:02 2026 ... snapshot ... Wed Mar 26 12:00:01 2026 ... snapshot ... Wed Mar 26 12:00:02 2026 ... snapshot ... nvidia-smi --loop=5 > gpu.log nvidia-smi --loop=5 > gpu.log nvidia-smi --loop=5 > gpu.log # gpu.log now contains one snapshot every 5 seconds # gpu.log now contains one snapshot every 5 seconds # gpu.log now contains one snapshot every 5 seconds nvidia-smi --query-gpu=timestamp,index,name,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw --format=csv,noheader,nounits nvidia-smi --query-gpu=timestamp,index,name,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw --format=csv,noheader,nounits nvidia-smi --query-gpu=timestamp,index,name,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw --format=csv,noheader,nounits 2026/03/26 12:10:02.123, 0, NVIDIA H100 80GB HBM3, 82, 54, 18420, 81920, 55, 228.31 2026/03/26 12:10:02.123, 0, NVIDIA H100 80GB HBM3, 82, 54, 18420, 81920, 55, 228.31 2026/03/26 12:10:02.123, 0, NVIDIA H100 80GB HBM3, 82, 54, 18420, 81920, 55, 228.31 nvidia-smi pmon -s um -d 1 nvidia-smi pmon -s um -d 1 nvidia-smi pmon -s um -d 1 # gpu pid type sm mem enc dec command 0 18211 C 76 41 0 0 python 0 18304 C 12 8 0 0 python # gpu pid type sm mem enc dec command 0 18211 C 76 41 0 0 python 0 18304 C 12 8 0 0 python # gpu pid type sm mem enc dec command 0 18211 C 76 41 0 0 python 0 18304 C 12 8 0 0 python ps -p 18211 -o pid,user,etime,cmd ps -p 18211 -o pid,user,etime,cmd ps -p 18211 -o pid,user,etime,cmd PID USER ELAPSED CMD 18211 mlops 01:22:11 python train.py --model llama --batch-size 8 PID USER ELAPSED CMD 18211 mlops 01:22:11 python train.py --model llama --batch-size 8 PID USER ELAPSED CMD 18211 mlops 01:22:11 python train.py --model llama --batch-size 8 sudo apt update && sudo apt install -y nvtop sudo apt update && sudo apt install -y nvtop sudo apt update && sudo apt install -y nvtop GPU0 78% MEM 18240/81920 MiB TEMP 54C PWR 221W PID 18211 python train.py GPU 72% MEM 17664MiB GPU0 78% MEM 18240/81920 MiB TEMP 54C PWR 221W PID 18211 python train.py GPU 72% MEM 17664MiB GPU0 78% MEM 18240/81920 MiB TEMP 54C PWR 221W PID 18211 python train.py GPU 72% MEM 17664MiB python3 -m pip install --user gpustat python3 -m pip install --user gpustat python3 -m pip install --user gpustat gpustat --watch 1 gpustat --watch 1 gpustat --watch 1 hostname Thu Mar 26 12:25:44 2026 [0] NVIDIA H100 | 54C, 79 % | 18420 / 81920 MB | python/18211(17664M) hostname Thu Mar 26 12:25:44 2026 [0] NVIDIA H100 | 54C, 79 % | 18420 / 81920 MB | python/18211(17664M) hostname Thu Mar 26 12:25:44 2026 [0] NVIDIA H100 | 54C, 79 % | 18420 / 81920 MB | python/18211(17664M) python3 -m pip install 'glances[gpu]' python3 -m pip install 'glances[gpu]' python3 -m pip install 'glances[gpu]' GPU NVIDIA H100: util 77% | mem 18240/81920MiB | temp 54C | power 220W CPU: 21.4% MEM: 62.1% LOAD: 2.13 1.87 1.66 GPU NVIDIA H100: util 77% | mem 18240/81920MiB | temp 54C | power 220W CPU: 21.4% MEM: 62.1% LOAD: 2.13 1.87 1.66 GPU NVIDIA H100: util 77% | mem 18240/81920MiB | temp 54C | power 220W CPU: 21.4% MEM: 62.1% LOAD: 2.13 1.87 1.66 curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt update && sudo apt install -y nvidia-container-toolkit sudo apt update && sudo apt install -y nvidia-container-toolkit sudo apt update && sudo apt install -y nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo nvidia-ctk runtime configure --runtime=docker sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker sudo systemctl restart docker sudo systemctl restart docker docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 550.xx Driver Version: 550.xx CUDA Version: 12.x | | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ | NVIDIA-SMI 550.xx Driver Version: 550.xx CUDA Version: 12.x | | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ | NVIDIA-SMI 550.xx Driver Version: 550.xx CUDA Version: 12.x | | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | +-----------------------------------------------------------------------------+ apiVersion: apps/v1 kind: DaemonSet metadata: name: dcgm-exporter namespace: gpu-monitoring spec: selector: matchLabels: app: dcgm-exporter template: metadata: labels: app: dcgm-exporter spec: nodeSelector: nvidia.com/gpu.present: "true" containers: - name: dcgm-exporter image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04 ports: - containerPort: 9400 apiVersion: apps/v1 kind: DaemonSet metadata: name: dcgm-exporter namespace: gpu-monitoring spec: selector: matchLabels: app: dcgm-exporter template: metadata: labels: app: dcgm-exporter spec: nodeSelector: nvidia.com/gpu.present: "true" containers: - name: dcgm-exporter image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04 ports: - containerPort: 9400 apiVersion: apps/v1 kind: DaemonSet metadata: name: dcgm-exporter namespace: gpu-monitoring spec: selector: matchLabels: app: dcgm-exporter template: metadata: labels: app: dcgm-exporter spec: nodeSelector: nvidia.com/gpu.present: "true" containers: - name: dcgm-exporter image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04 ports: - containerPort: 9400 # HELP DCGM_FI_DEV_GPU_UTIL GPU utilization (in %). # TYPE DCGM_FI_DEV_GPU_UTIL gauge DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-..."} 78 # HELP DCGM_FI_DEV_GPU_UTIL GPU utilization (in %). # TYPE DCGM_FI_DEV_GPU_UTIL gauge DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-..."} 78 # HELP DCGM_FI_DEV_GPU_UTIL GPU utilization (in %). # TYPE DCGM_FI_DEV_GPU_UTIL gauge DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-..."} 78 scrape_configs: - job_name: dcgm-exporter static_configs: - targets: ['<node-ip>:9400'] scrape_configs: - job_name: dcgm-exporter static_configs: - targets: ['<node-ip>:9400'] scrape_configs: - job_name: dcgm-exporter static_configs: - targets: ['<node-ip>:9400'] DD_API_KEY="<YOUR_DATADOG_API_KEY>" DD_SITE="datadoghq.com" bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)" DD_API_KEY="<YOUR_DATADOG_API_KEY>" DD_SITE="datadoghq.com" bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)" DD_API_KEY="<YOUR_DATADOG_API_KEY>" DD_SITE="datadoghq.com" bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)" sudo datadog-agent integration install -t datadog-nvml==1.0.9 sudo datadog-agent integration install -t datadog-nvml==1.0.9 sudo datadog-agent integration install -t datadog-nvml==1.0.9 init_config: instances: - min_collection_interval: 15 tags: - env:prod - role:training - gpu_vendor:nvidia init_config: instances: - min_collection_interval: 15 tags: - env:prod - role:training - gpu_vendor:nvidia init_config: instances: - min_collection_interval: 15 tags: - env:prod - role:training - gpu_vendor:nvidia sudo systemctl restart datadog-agent sudo systemctl restart datadog-agent sudo systemctl restart datadog-agent Example monitor query: avg(last_10m):avg:nvidia.gpu.utilization{env:prod,role:training} by {host,gpu_index} > 95 Example monitor query: avg(last_10m):avg:nvidia.gpu.utilization{env:prod,role:training} by {host,gpu_index} > 95 Example monitor query: avg(last_10m):avg:nvidia.gpu.utilization{env:prod,role:training} by {host,gpu_index} > 95 Path: Data collection -> Templates -> Import Template: Nvidia by Zabbix agent 2 For some versions, the active mode variant is: Nvidia by Zabbix agent 2 active Official template source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/nvidia_agent2 Path: Data collection -> Templates -> Import Template: Nvidia by Zabbix agent 2 For some versions, the active mode variant is: Nvidia by Zabbix agent 2 active Official template source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/nvidia_agent2 Path: Data collection -> Templates -> Import Template: Nvidia by Zabbix agent 2 For some versions, the active mode variant is: Nvidia by Zabbix agent 2 active Official template source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/nvidia_agent2 Example trigger logic using Zabbix agent 2 template item keys: avg(/GPU Host/nvidia.smi[{#GPUINDEX},utilization.gpu],10m)>95 and last(/GPU Host/nvidia.smi[{#GPUINDEX},temperature.gpu])>85 Example trigger logic using Zabbix agent 2 template item keys: avg(/GPU Host/nvidia.smi[{#GPUINDEX},utilization.gpu],10m)>95 and last(/GPU Host/nvidia.smi[{#GPUINDEX},temperature.gpu])>85 Example trigger logic using Zabbix agent 2 template item keys: avg(/GPU Host/nvidia.smi[{#GPUINDEX},utilization.gpu],10m)>95 and last(/GPU Host/nvidia.smi[{#GPUINDEX},temperature.gpu])>85 Windows Registry example for GPU performance counter visibility: HKEY_LOCAL_MACHINE\SOFTWARE\NVIDIA Corporation\Global\NVTweak Value name: RmProfilingAdminOnly (DWORD) Set to 0 to allow non-admin access to GPU performance counters, set to 1 for admin-only. Reference: https://developer.nvidia.com/ERR_NVGPUCTRPERM Windows Registry example for GPU performance counter visibility: HKEY_LOCAL_MACHINE\SOFTWARE\NVIDIA Corporation\Global\NVTweak Value name: RmProfilingAdminOnly (DWORD) Set to 0 to allow non-admin access to GPU performance counters, set to 1 for admin-only. Reference: https://developer.nvidia.com/ERR_NVGPUCTRPERM Windows Registry example for GPU performance counter visibility: HKEY_LOCAL_MACHINE\SOFTWARE\NVIDIA Corporation\Global\NVTweak Value name: RmProfilingAdminOnly (DWORD) Set to 0 to allow non-admin access to GPU performance counters, set to 1 for admin-only. Reference: https://developer.nvidia.com/ERR_NVGPUCTRPERM reg query "HKLM\SOFTWARE\NVIDIA Corporation\Global" /s reg query "HKLM\SOFTWARE\NVIDIA Corporation\Global" /s reg query "HKLM\SOFTWARE\NVIDIA Corporation\Global" /s powershell -Command "Get-Counter '\GPU Engine(*)\Utilization Percentage' | Select-Object -ExpandProperty CounterSamples | Select-Object InstanceName,CookedValue" powershell -Command "Get-Counter '\GPU Engine(*)\Utilization Percentage' | Select-Object -ExpandProperty CounterSamples | Select-Object InstanceName,CookedValue" powershell -Command "Get-Counter '\GPU Engine(*)\Utilization Percentage' | Select-Object -ExpandProperty CounterSamples | Select-Object InstanceName,CookedValue" InstanceName CookedValue pid_1204_luid_0x00000000_0x0000_engtype_3D 27.31 pid_1820_luid_0x00000000_0x0000_engtype_Compute_0 74.02 InstanceName CookedValue pid_1204_luid_0x00000000_0x0000_engtype_3D 27.31 pid_1820_luid_0x00000000_0x0000_engtype_Compute_0 74.02 InstanceName CookedValue pid_1204_luid_0x00000000_0x0000_engtype_3D 27.31 pid_1820_luid_0x00000000_0x0000_engtype_Compute_0 74.02 - Use nvidia-smi --loop=1 for the fastest host-level real-time GPU check on Linux. - Use nvidia-smi pmon -s um to identify which PID is using GPU cores and GPU memory bandwidth. - For terminal dashboards, use nvtop for interactive drill-down and gpustat for lightweight snapshots. - In containers and Kubernetes, expose metrics through NVIDIA runtime support and DCGM Exporter. - Persistent alerting belongs in monitoring platforms such as Datadog Agent or Zabbix templates. - GPU memory utilization and GPU core utilization are separate signals, high memory with low cores is common in input-stalled jobs. - On Windows, Unified GPU Usage Monitoring aggregates engine activity and surfaces it in Task Manager and WMI. - Reduce batch size or sequence length. - Use gradient accumulation to keep effective batch size. - Enable mixed precision where supported. - Terminate stale GPU processes before restart. - Move expensive transforms to more efficient pipeline stages. - NVIDIA System Management Interface (nvidia-smi) - NVIDIA DCGM User Guide