Tools

Tools: Report: I Couldn’t Debug My AI/ML GPU Incident - So I Built My Own Tool

2026-04-02 0 views admin

Design and Architecture Several weeks ago, I encountered some problems with ML jobs running on my GPU server. I received alerts triggered at midnight, and one of the jobs failed due to GPU memory usage. The next morning, I performed a root-cause analysis to understand what had happened the night before. However, I couldn’t identify the issue because I only had access to overall GPU usage metrics at current time. I used nvidia-smi and nvtop to inspect the current state, but there was no trace about the issue we got from last night. Therefore, I realized I needed a solution to prevent similar problems from happening in the future. I tried using DCGM exporter to expose GPU metrics, but it couldn’t provide PID-level metrics. I also tested it in a Kubernetes environment to get pod-level metrics, but it didn’t work because our GPUs only support time-slicing mode. Therefore, I developed an open-source tool called gpuxray to monitor GPUs at the process level. This tool has helped our team significantly when observing and investigating bottlenecks in AI/ML processes running on Linux servers. It exposes metrics in Prometheus format, which we use to build Grafana dashboards for visualizing resource usage at the process level. We deployed this tool in a Kubernetes cluster as a DaemonSet on all GPU nodes that need to be monitored. With the setup described here, we can easily enable per-process GPU observability. This tool achieves high performance while consuming minimal resources. Because it is built using eBPF to trace GPU memory-related events. This is powerful because eBPF allows us to observe what is happening inside the kernel based on specific use cases - in this case, we create probes these are attached to CUDA API.

The project is built on a solid codebase, making it easy to extend in the future. If you have ideas, feel free to discuss or open a pull request. Now, I will describe the architecture of the project to help you understand how it works. Basically, the userspace-code handles the main logic and is written in Go. The eBPF-program is attached to CUDA API calls. When these APIs are invoked, events are captured. The eBPF-program performs lightweight processing at the kernel level, updates eBPF maps, and sends events to the ring-buffer.The userspace-code then consumes events from the ring-buffer, processes them, and produces the final metrics output. With the mon option, the tool even no taken resources on the GPU server. When tracing memory leaks using the memtrace option for a specific PID, I used a Python script to generate more than 2,000 malloc/free calls per second on the GPU and observed resource usage. It consumed only about ~8% of a single CPU core (on a server with 32 CPU cores and 125GB RAM). This is impressive because ~2,000 malloc/free operations per second is not a typical real-world workload. As a result, we don’t need to worry about performance or resource overhead when using this tool. Feel free to explore the project, try it out, and contribute your ideas:

https://github.com/vuvietnguyenit/gpuxray Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

> > -weight: 500;">kubectl -n kube-operators get daemonset/gpuxray NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpuxray 2 2 2 2 2 node.k8s.cluster/gpu=exists 20d > -weight: 500;">kubectl -n kube-operators get daemonset/gpuxray NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpuxray 2 2 2 2 2 node.k8s.cluster/gpu=exists 20d > -weight: 500;">kubectl -n kube-operators get daemonset/gpuxray NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpuxray 2 2 2 2 2 node.k8s.cluster/gpu=exists 20d

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolsreportcouldndebugincidentbuilt

More from Tools

Tools: Essential Free Tools for Professional Network Monitoring and Troubleshooting (2026)

2026-04-02 0

Tools: Update: Setting Up a NAS for the First Time: Storage, Backups, and Remote Access

2026-04-02 0

Tools: Essential Guide: Human-in-the-Loop: 3 Ways to Approve Your Agent's Transactions

2026-04-02 0

Tools: Why Your WordPress Plugins Are a Security Nightmare (And How to Fix It)

2026-04-02 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: Report: I Couldn’t Debug My AI/ML GPU Incident - So I Built My Own Tool

🏷️ Tags

More from Tools

Tools: Essential Free Tools for Professional Network Monitoring and Troubleshooting (2026)

Tools: Update: Setting Up a NAS for the First Time: Storage, Backups, and Remote Access

Tools: Essential Guide: Human-in-the-Loop: 3 Ways to Approve Your Agent's Transactions

Tools: Why Your WordPress Plugins Are a Security Nightmare (And How to Fix It)

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting