Tools: Windows vs Linux for Local AI: My Radeon 890M Has 96GB of RAM, but Windows Only Lets Me Use 3.5GB (2026)
The Setup
The Windows VRAM Wall
The Linux Promise (and Its Own Problem)
The Decision Matrix
Why This Matters
So What Does Work?
The Structural Problem I run a Ryzen AI 9 HX 370 mini PC as my daily AI workstation. 96GB of system RAM, a Radeon 890M integrated GPU (gfx1150, RDNA 3.5). On paper, this should be a capable local inference box — 96GB is enough to run Gemma 4, Llama 3 70B, even Mixtral 8x22B at reasonable quantization. There's just one problem: Windows refuses to let the iGPU use more than ~3.5GB of that memory. I spent the better part of a week trying every workaround I could find. Here's what I learned — and why Linux won this round. The Radeon 890M is an integrated GPU. It has no dedicated VRAM. On Windows, the GPU driver allocates a fixed shared GPU memory budget from system RAM. For the 890M, that budget caps at roughly 3.5GB. The root cause? Windows' WDDM driver model + Hypervisor-based security (VBS). When Memory Integrity is enabled (and on modern Windows 11 installs, it is by default), every GPU memory allocation goes through an extra hypervisor verification layer. The driver responds by clamping shared GPU memory to a conservative default. And there's no "unlock" switch — even in Group Policy or AMD's own Adrenalin control panel. For a 7B model at Q4_K_M (~4.5GB VRAM needed), that 3.5GB wall means the model doesn't fit. Period. The alternative is CPU-only inference — using all 96GB of RAM, but at maybe 2-3 tokens per second for anything larger than 7B. Windows gives you a choice: tiny models on GPU, or glacial models on CPU. There's no middle ground. Linux doesn't have this VRAM cap. The Mesa Vulkan driver for AMD GPUs (amdgpu, radv) lets the GPU use as much system memory as needed. You can allocate 64GB for a model and the driver won't blink. So I set up a dual-boot, installed ROCm, and ran a few tests. The performance difference was dramatic — llama.cpp with Vulkan on Linux could load a 14B Q4_K_M model entirely on the GPU, no VRAM wall. But there was a catch. ROCm does not support the Radeon 890M. AMD's official stance, confirmed by the Framework community (Framework Laptop 13 Ryzen AI 9 HX 370): "AMD ROCm does not support the Radeon 890M (gfx1150). PyTorch cannot run with ROCm on this GPU." The Vulkan backend in llama.cpp does work on Linux. It's faster than CPU and has no VRAM cap. But it's nowhere near as fast as native ROCm acceleration would be. You're getting maybe 40-60% of the performance you'd get with a supported GPU stack. The HX 370 is AMD's flagship mobile AI chip. It's built on a 4nm process, has an XDNA 2 NPU rated for 50+ TOPS, and pairs with a capable RDNA 3.5 iGPU. AMD clearly wants this chip in the "AI PC" category. AMD's GPU software stack is fragmented. ROCm works on their discrete GPUs (RX 7900 series, some W-series) and Instinct cards. It does not work on RDNA 3.5 integrated GPUs. Period. The Windows HIP SDK also doesn't support it. If you bought an HX 370 machine thinking ROCm would be available, you bought a paperweight for AI workloads. Windows' driver model is actively hostile to shared-memory GPUs. The 3.5GB hard cap isn't a bug. It's a deliberate safety boundary in WDDM. Apple Silicon Macs can share 64GB+ between CPU and GPU seamlessly. The HX 370 with 96GB of RAM should be able to do the same. It cannot. Not on Windows. Linux works, but not well enough. The Vulkan backend lifts the VRAM cap, which is the most important win. But without ROCm, you're leaving performance on the table — roughly 40-60% of what a properly supported GPU would deliver. For this exact machine (HX 370 + Radeon 890M), the realistic stack is: For production workloads / multi-model serving: Linux + CPU-only. 96GB RAM handles Gemma 4 27B, Llama 3 70B, Qwen 2.5 72B at Q3/Q4. Slow but reliable. For single-user inference / experimentation: Linux + llama.cpp Vulkan. Best balance of VRAM headroom and acceleration. No hard cap, no heavy tuning. For portability / Windows-only environments: Windows + llama.cpp Vulkan. Stuck at the 3.5GB cap. Fine for 3B models, frustrating for anything larger. This isn't about Windows vs Linux fanboyism. The structural issue is: The HX 370 is a great CPU with a mediocre AI story. If you're building a local inference box today, you're better off pairing it with a discrete GPU — or just skipping Windows entirely for AI workloads. Using an HX 370 for LLMs? A ⭐ or a one-word issue tells me what to build next — helps more than you'd think. Templates let you quickly answer FAQs or store snippets for re-use. as well , this person and/or - Registry hacks (GpuMemoryAllocation, HwSchMode) → no effect
- DirectML via PyTorch → OOM on anything above a 7B model- llama.cpp with Vulkan backend → same 3.5GB limit, enforced by the driver- Disabling Memory Integrity / VBS → freed up ~500MB, still hit the wall- BIOS UMA Frame Buffer tweaking → the 890M's firmware-based allocation is hard-coded - AMD's GPU software stack is fragmented. ROCm works on their discrete GPUs (RX 7900 series, some W-series) and Instinct cards. It does not work on RDNA 3.5 integrated GPUs. Period. The Windows HIP SDK also doesn't support it. If you bought an HX 370 machine thinking ROCm would be available, you bought a paperweight for AI workloads.- Windows' driver model is actively hostile to shared-memory GPUs. The 3.5GB hard cap isn't a bug. It's a deliberate safety boundary in WDDM. Apple Silicon Macs can share 64GB+ between CPU and GPU seamlessly. The HX 370 with 96GB of RAM should be able to do the same. It cannot. Not on Windows.- Linux works, but not well enough. The Vulkan backend lifts the VRAM cap, which is the most important win. But without ROCm, you're leaving performance on the table — roughly 40-60% of what a properly supported GPU would deliver. - AMD's software commitment stops at their discrete GPU line. Integrated GPUs are second-class citizens for AI — despite being in every Ryzen AI chip shipping today.- Windows' driver architecture assumes dGPUs with dedicated VRAM. Shared-memory GPUs are an afterthought, and the security hypervisor layer makes it worse.- Linux avoids the cap but can't fill the acceleration gap. Without ROCm, you're running on Vulkan — which is a translation layer, not a native compute stack.