Tools

Tools: Your AMD APU Has Three Processors. Why Does ML Only Use One? - Guide

2026-03-29 0 views admin

The hardware nobody's exploiting

I did the research. Here's what's actually possible today.

The Vulkan surprise

The NPU-as-scheduler concept

First of its kind

Six novel contributions

What's next

The bigger picture I've been staring at my AMD Ryzen AI HX 370 for months thinking the same thing: this chip has three processors that share a memory bus, and every ML runtime ignores two of them. The CPU runs inference. The GPU sits there unless you explicitly set it up. The NPU — 50 TOPS of dedicated neural compute at 2 watts — does literally nothing unless you're on Windows blurring your webcam background. What if a runtime used all three? And what if it learned the optimal split for your specific chip? The Ryzen AI 300 series is a monolithic die. CPU (Zen 5), iGPU (RDNA 3.5, 16 CUs), and NPU (XDNA 2) share physical DDR5X through one memory controller. No dedicated VRAM. True unified memory architecture. The theoretical pipeline: After 47 generations, it tells you: "guidance_scale=4.2 produces the cleanest output on this hardware." After a driver update: "ROCm 7.3 improved GPU throughput 15% — redistributing layers." I spent a week mapping every dependency, every driver, every research paper. The short version: What works (March 2026): Here's something the ROCm community hasn't fully absorbed: Vulkan outperforms ROCm by ~60% for prompt processing on the Radeon 890M. The reason is memory access. ROCm's hipMalloc can only address the BIOS-configured VRAM carveout. On a 96GB system, that might be 48GB. Vulkan sees the entire pool — VRAM plus GTT, 80+ GB — via VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT. On a memory-bandwidth-bound workload over a 120 GB/s LPDDR5X bus, that gap is decisive. For this project, Vulkan is the iGPU backend. This is the part that has no prior art in the literature. I checked. The idea: dedicate a few NPU columns to running a tiny scheduling neural network (~500 bytes, based on CoDL's latency predictor architecture) that monitors CPU and GPU utilization, thermal state, and inference latency — then dynamically redistributes operators across all three processors. The NPU is perfect for this because: The closest analogy is SmartNIC-as-orchestrator in distributed systems (Wave, Conspirator, RingLeader) — an auxiliary processor dedicated to scheduling decisions. Nobody's applied this pattern to NPUs. I reviewed 25+ papers, AMD's own 2025 scheduling research, and every major heterogeneous inference project I could find. Nobody has built this. CPU+GPU co-execution is well-studied (CoDL, SparOA). NPU+GPU scheduling exists in limited forms. But a self-optimizing runtime that dynamically partitions operators across all three processors on a consumer APU? No prior implementation. AMD's Karami et al. (2025) characterize the problem — a combinatorial scheduling search space of O(2^125) — but don't build a runtime that solves it. To be clear: I've architected this and validated feasibility, not shipped a running system. But the architecture itself, the feasibility study, and the research synthesis represent the first open-source attempt at this category of runtime. From the full literature review, these have no existing implementation: I'm calling the project R.A.G-Race-Router [Adaptive Tri-Processor Inference Runtime]. The runtime treats the three processors as an assembly line: CPU and GPU are asynchronous production belts, and the NPU bookends the pipeline — dispatching work at the start, assembling output at the end. At 50 TOPS, the NPU evaluates scheduling decisions in microseconds while CPU/GPU compute takes milliseconds. It appears to be in two places at once. After a few runs, it encodes the dispatch pattern as lightweight rules that auto-execute with near-zero overhead, only re-engaging when something changes. The full feasibility study, architecture, literature review, and Phase 1 build instructions are here: → github.com/Peterc3-dev/rag-race-router Phase 1 is proving three-processor data flow on a Ryzen AI 300 under CachyOS. The immediate step is getting FastFlowLM running on the NPU and benchmarking the three-way pipeline. This is pre-alpha. No code yet — just architecture, validated feasibility, and a clear build path. If you're running Ryzen AI 300 on Linux and this resonates, I'd love to hear from you. AMD shipped hardware that could redefine edge inference. The silicon is there. The drivers are (mostly) there. What's missing is a runtime that treats the whole SoC as a unified inference machine instead of three separate devices that happen to share a bus. Every chip is slightly different. Thermal characteristics, silicon lottery, memory controller behavior, driver versions. A runtime that learns your chip's personality isn't an optimization — it's a new category. The models are coming. The question is whether any runtime will know how to actually use the hardware it's running on. I'm Peter Clemente (@Peterc3-dev). I build systems on Linux. This project is part of a broader architecture called CIN — a distributed inference network that treats every device as a node. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

hipMallocManaged VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT - NPU handles efficient inference at 50 TOPS / 2W — your always-on workhorse - iGPU handles flexible parallel compute — batch processing, larger models - CPU orchestrates, preprocesses, and fills gaps - A scheduler (running on the NPU itself) learns which ops run best where on your chip - NPU inference via FastFlowLM: Llama 3.2 1B at ~60 tok/s, under 2 watts - XDNA kernel driver mainlined in Linux 6.14 - iGPU inference via Vulkan llama.cpp (and it's 60% faster than ROCm — more on that below) - All three processors sharing physical memory - ONNX Runtime's Vitis AI EP is completely broken on Linux - hipMallocManaged returns "not supported" on the 890M - No DMA-BUF bridge between the GPU and NPU drivers - Nobody has run all three processors simultaneously for inference - It runs at <2W, always-on without thermal impact - XDNA 2 supports dynamic spatial partitioning at column boundaries - The remaining NPU columns still handle inference workloads - It's literally a neural processor running a neural scheduling policy - NPU-as-scheduling-agent for CPU+GPU workload orchestration - Persistent hardware personality — an evolving model of your chip's specific behavior over weeks/months - Three-processor dynamic operator placement on a single SoC (CPU+GPU is studied; all three is not) - Cross-model transfer learning for on-device scheduling (learning from Model A improves scheduling of Model B) - Vulkan+XRT memory bridge — combining Vulkan's superior unified memory access with XRT buffer objects via CPU-mediated sharing - NPU-bookended assembly line — NPU dispatches at the start, assembles at the end; CPU and GPU are decoupled async producers. 1000:1 speed ratio makes scheduling overhead effectively zero

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolsthreeprocessorsguidehardwarenobody

More from Tools

Tools: Gas-Aware Trading: Execute Only When Gas Is Cheap (2026)

2026-03-30 0

Tools: Grafana k6 Has a Free API That Load Tests Your APIs With JavaScript - Full Analysis

2026-03-30 0

Tools: Caddy Has a Free API That Gives You Automatic HTTPS With Zero Configuration (2026)

2026-03-30 0

Tools: Fly.io Has a Free API That Deploys Docker Apps Globally With Edge Hosting (2026)

2026-03-30 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: Your AMD APU Has Three Processors. Why Does ML Only Use One? - Guide

The hardware nobody's exploiting

I did the research. Here's what's actually possible today.

The Vulkan surprise

The NPU-as-scheduler concept

First of its kind

Six novel contributions

What's next

🏷️ Tags

More from Tools

Tools: Gas-Aware Trading: Execute Only When Gas Is Cheap (2026)

Tools: Grafana k6 Has a Free API That Load Tests Your APIs With JavaScript - Full Analysis

Tools: Caddy Has a Free API That Gives You Automatic HTTPS With Zero Configuration (2026)

Tools: Fly.io Has a Free API That Deploys Docker Apps Globally With Edge Hosting (2026)

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting