Powerful Nvidia Gb10's Memory Subsystem, From The CPU Side

Powerful Nvidia Gb10's Memory Subsystem, From The CPU Side

GB10 is a collaboration between Nvidia and Mediatek that brings Nvidia’s Blackwell architecture into an integrated GPU. GB10’s GPU has 48 Blackwell SMs, matching the RTX 5070 in core count. The CPU side has 10 Cortex X925 and 10 Cortex A725 cores and is therefore quite powerful. Feeding all of that compute power requires a beefy memory subsystem, and can lead to difficult tradeoffs. Analyzing GB10’s memory subsystem from the CPU side will be the focus of this article. To keep article length manageable, I’ll further focus on Nvidia and Mediatek’s memory subsystems and design decisions. Core architecture and GB10’s GPU will be an exercise for another time.

We’d like to thank Zach at ZeroOne Technology for allowing us SSH access to his DGX Spark unit for CPU testing.

CPU cores on GB10 are split into two clusters. Each cluster has five A725 cores and five X925 cores. Core numbering starts with the A725 cores within each cluster, and the two clusters come after each other. All of the A725 cores run at 2.8 GHz. X925 cores clock up to 3.9 GHz on the first cluster, and up to 4 GHz on the second.

Arm’s A725 and X925 have configurable cache capacities. GB10 opts for 64 KB L1 instruction and data caches on both cores. All A725 cores get 512 KB L2 caches, and all X925 cores get 2 MB of L2. A725’s L2 is 8-way set associative and offers latency at just 9 cycles. In actual time, that comes out to 3.2 nanoseconds and is good considering the low 2.8 GHz clock speed. However, L3 latency is poor at over 21 ns, or >60 cycles.

Testing cores across both clusters indicates that the first CPU cluster has 8 MB of L3, while the second has 16 MB. I’ll refer to these as Cluster 0 and Cluster 1 respectively. Both clusters have the same L3 latency from an A725 core, despite the capacity difference. 512 KB isn’t a lot of L2 capacity when L3 latency is this high. Likely, selecting the 512 KB L2 option reduces core area and lets GB10 implement more cores. Doing so makes sense considering that A725 cores aren’t meant to individually deliver high single threaded performance. That task is best left to the X925 cores.

GB10’s X925 cores have 2 MB, 8-way set associative L2 caches with 12 cycle latency. L3 latency is surprisingly much better at ~56 cycles or ~14 ns, even though the A725 and X925 cores share the same L3. While it’s not a spectacular L3 latency result, it’s at least similar to Intel’s Arrow Lake L3 in nanosecond terms. Combined with the larger L2, that gives GB10’s X925 cores

Source: HackerNews