Tools: Profiling a CUDA Python Program with GPUFlight (2026)

Tools: Profiling a CUDA Python Program with GPUFlight (2026)

The sample kernel

1. Is the GPU actually busy?

2. How long did each profiled launch take?

3. Is the problem occupancy?

4. Is the problem thread divergence?

5. What do the memory sectors say?

6. The fix: shared-memory tiling

7. What changed?

Summary In the previous post, I used a C++ CUDA example to look at memory coalescing and how memory access patterns affect GPU performance. This time, I wanted to look at a similar performance problem from Python. I usually write CUDA code in C++, but recently I have been spending more time with Python, especially PyTorch and Numba. Numba is interesting because it lets you write a real GPU kernel directly in Python. You can decorate a function with @cuda.jit, launch it with kernel[grid, block](...), and Numba compiles it down to GPU machine code that runs on the actual hardware. The good news is that GPUFlight can profile Python GPU programs as well. In this post, I’ll profile a simple Numba matrix multiplication kernel with GPUFlight. Then I’ll read the report step by step and show how the report points to a real optimization: shared-memory tiling. One important note before we start: this example uses GPUFlight’s deeper profiling mode with SASS-level metrics and PC sampling. So the duration numbers in the report should not be treated as clean baseline kernel timing. They include profiling overhead. The main goal here is not to benchmark Numba against an optimized library like cuBLAS. The goal is to show how GPUFlight helps explain what is happening inside the kernel. Both GPUFlight and Numba can be installed from PyPI. On a fresh Linux machine: You should see something like: At the time I am writing this, the version is 1.0.2. Before using the profiler, it is a good idea to confirm that Numba can find your GPU: Now we are ready to run a Python CUDA application with GPUFlight. Here is the sample code I am using: This is a very simple matrix multiplication kernel. Each thread computes one output element. For each element, the thread walks through one full row of A and one full column of B. This is intentionally not optimized. I want to start with a simple kernel, because it makes the profiling report easier to understand. Let’s run it and see what GPUFlight tells us. Now let’s read the report carefully. A profiling report is only useful if we can turn it into a decision. So instead of just looking at numbers, I usually ask a few questions. This means the GPU was working for almost the entire run. Out of 17.91 s of wall-clock time, 17.40 s were spent running GPU kernels. The SM clock is also boosted to 2631 MHz, and power is around 71.0 W, which is close to the laptop GPU’s power limit. So this is not a case where the CPU is too slow, the input data is too small, or the GPU is waiting for work. The GPU is busy. That means if we want to improve performance, we need to look inside the kernel. However, this number needs to be read carefully. This run includes deeper profiling, including SASS-level metrics and sampling. That means the measured duration includes profiling overhead. So I should not treat 1.74 s as the clean baseline runtime of the kernel. I would not use this number alone to claim how fast or slow the raw Numba kernel is. But it is still useful as the runtime under this profiling configuration. This tells us the GPU has enough active warps. The SMs are not sitting empty because we launched too few threads. Occupancy is not the same thing as performance, but in this case low occupancy does not look like the main problem. This means every warp is using all 32 threads. There is no meaningful branch divergence here. That makes sense because the kernel is simple. The 16 x 16 block and 128 x 128 grid map cleanly to the 2048 x 2048 output matrix. So far, the report says: So now we need to look at memory behavior. This is the most useful part of the report: The important two numbers are: The kernel is accessing about 45.7B global memory sectors, while the ideal number is about 13.4B. So the kernel is moving about 3.4x more global memory traffic than the ideal case. Another way to read it: The memory access efficiency is only around 29%. This is the real story. The naive kernel makes each thread re-read values from global memory. Many threads need overlapping data from A and B, but the kernel does not reuse that data efficiently. So the same data crosses the memory system again and again. The GPU is busy, the warps are full, and the lanes are active. But the memory access pattern is wasteful. For this kind of matrix multiplication kernel, the classic fix is shared-memory tiling. Instead of letting each thread repeatedly read everything from global memory, each block cooperatively loads a tile of A and a tile of B into shared memory. Then the threads reuse those values many times before loading the next tile. Here is the improved kernel: Now let’s run the same profiling mode again. The result is much better under the same profiling configuration. The full session duration goes down from 17.91 s to 2.90 s. Total GPU time goes down from 17.40 s to 2.22 s. The average profiled kernel duration goes down from 1.74 s to 221.64 ms. Again, these are still profiled durations, not clean baseline timings. But because both runs use the same deep profiling mode, this comparison is still useful. It tells us the tiled version behaves much better under the same measurement setup. The most important change is in the memory-sector metrics. In the naive kernel, actual global memory sectors were about 3.4x higher than ideal. In the tiled kernel, actual and ideal global memory sectors are the same. That is exactly what we wanted to see. The optimized kernel also uses shared memory: That means each block is now reusing data through shared memory instead of repeatedly pulling the same values from global memory. Instruction count also drops a lot: So the optimized kernel is not only reducing memory traffic. It is also doing much less total instruction work. This example is not a full benchmark. I am not comparing Numba against cuBLAS, and I am not claiming these numbers are the raw kernel runtimes. The run uses SASS-level profiling and sampling, so there is overhead. But the report is still useful because both versions were measured with the same profiling mode. More importantly, the report explains why the naive kernel is slow. The first version had: That means the problem was not lack of work or branch divergence. The problem was the memory access pattern. After changing the kernel to use shared-memory tiling: So the main takeaway is not just “the optimized kernel is faster.” The more important takeaway is that GPUFlight helped point to the right fix. The report showed that the naive kernel was wasting memory bandwidth, and the optimized version confirmed that shared-memory tiling reduced that waste. That is the workflow I want GPUFlight to support: Run your program normally, collect useful GPU metrics, and turn the report into a concrete optimization decision. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install -y python3.12-venv python3 --version # expect Python 3.12.x python3 -m venv ~/gpufl-venv source ~/gpufl-venv/bin/activate -weight: 500;">pip -weight: 500;">install ---weight: 500;">upgrade -weight: 500;">pip -weight: 500;">pip -weight: 500;">install gpufl "numba-cuda[cu13]" python -c "import gpufl; print('gpufl', gpufl.__version__)" -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install -y python3.12-venv python3 --version # expect Python 3.12.x python3 -m venv ~/gpufl-venv source ~/gpufl-venv/bin/activate -weight: 500;">pip -weight: 500;">install ---weight: 500;">upgrade -weight: 500;">pip -weight: 500;">pip -weight: 500;">install gpufl "numba-cuda[cu13]" python -c "import gpufl; print('gpufl', gpufl.__version__)" -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install -y python3.12-venv python3 --version # expect Python 3.12.x python3 -m venv ~/gpufl-venv source ~/gpufl-venv/bin/activate -weight: 500;">pip -weight: 500;">install ---weight: 500;">upgrade -weight: 500;">pip -weight: 500;">pip -weight: 500;">install gpufl "numba-cuda[cu13]" python -c "import gpufl; print('gpufl', gpufl.__version__)" gpufl 1.x.x gpufl 1.x.x gpufl 1.x.x python -c "from numba import cuda; print('cuda available:', cuda.is_available()); cuda.detect()" python -c "from numba import cuda; print('cuda available:', cuda.is_available()); cuda.detect()" python -c "from numba import cuda; print('cuda available:', cuda.is_available()); cuda.detect()" import gpufl as gfl from gpufl.report import generate_report from numba import cuda import numpy as np import math import os @cuda.jit def matmul_kernel(A, B, C): row, col = cuda.grid(2) if row < C.shape[0] and col < C.shape[1]: tmp = 0.0 for k in range(A.shape[1]): tmp += A[row, k] * B[k, col] C[row, col] = tmp LOG_PATH = "./gfl_logs" gfl.init( app_name="matmul_sample", log_path=LOG_PATH, sampling_auto_start=True, system_sample_rate_ms=100, profiling_engine=gfl.ProfilingEngine.PcSamplingWithSass, ) try: N = 2048 A = cuda.to_device(np.random.rand(N, N).astype(np.float32)) B = cuda.to_device(np.random.rand(N, N).astype(np.float32)) C = cuda.to_device(np.zeros((N, N), dtype=np.float32)) tpb = (16, 16) bpg = (math.ceil(N / tpb[0]), math.ceil(N / tpb[1])) with gfl.Scope("matrix_mul_compute", "math"): for _ in range(10): matmul_kernel[bpg, tpb](A, B, C) _ = C.copy_to_host() print("[OK] compute finished") finally: gfl.shutdown() print( generate_report( os.path.dirname(LOG_PATH) or ".", log_prefix=os.path.basename(LOG_PATH), top_n=10, ) ) import gpufl as gfl from gpufl.report import generate_report from numba import cuda import numpy as np import math import os @cuda.jit def matmul_kernel(A, B, C): row, col = cuda.grid(2) if row < C.shape[0] and col < C.shape[1]: tmp = 0.0 for k in range(A.shape[1]): tmp += A[row, k] * B[k, col] C[row, col] = tmp LOG_PATH = "./gfl_logs" gfl.init( app_name="matmul_sample", log_path=LOG_PATH, sampling_auto_start=True, system_sample_rate_ms=100, profiling_engine=gfl.ProfilingEngine.PcSamplingWithSass, ) try: N = 2048 A = cuda.to_device(np.random.rand(N, N).astype(np.float32)) B = cuda.to_device(np.random.rand(N, N).astype(np.float32)) C = cuda.to_device(np.zeros((N, N), dtype=np.float32)) tpb = (16, 16) bpg = (math.ceil(N / tpb[0]), math.ceil(N / tpb[1])) with gfl.Scope("matrix_mul_compute", "math"): for _ in range(10): matmul_kernel[bpg, tpb](A, B, C) _ = C.copy_to_host() print("[OK] compute finished") finally: gfl.shutdown() print( generate_report( os.path.dirname(LOG_PATH) or ".", log_prefix=os.path.basename(LOG_PATH), top_n=10, ) ) import gpufl as gfl from gpufl.report import generate_report from numba import cuda import numpy as np import math import os @cuda.jit def matmul_kernel(A, B, C): row, col = cuda.grid(2) if row < C.shape[0] and col < C.shape[1]: tmp = 0.0 for k in range(A.shape[1]): tmp += A[row, k] * B[k, col] C[row, col] = tmp LOG_PATH = "./gfl_logs" gfl.init( app_name="matmul_sample", log_path=LOG_PATH, sampling_auto_start=True, system_sample_rate_ms=100, profiling_engine=gfl.ProfilingEngine.PcSamplingWithSass, ) try: N = 2048 A = cuda.to_device(np.random.rand(N, N).astype(np.float32)) B = cuda.to_device(np.random.rand(N, N).astype(np.float32)) C = cuda.to_device(np.zeros((N, N), dtype=np.float32)) tpb = (16, 16) bpg = (math.ceil(N / tpb[0]), math.ceil(N / tpb[1])) with gfl.Scope("matrix_mul_compute", "math"): for _ in range(10): matmul_kernel[bpg, tpb](A, B, C) _ = C.copy_to_host() print("[OK] compute finished") finally: gfl.shutdown() print( generate_report( os.path.dirname(LOG_PATH) or ".", log_prefix=os.path.basename(LOG_PATH), top_n=10, ) ) =============================================================================== GPU Flight Session Report Generated: 2026-05-22 05:05:33 UTC =============================================================================== =============================================================================== Session Summary =============================================================================== Application: matmul_sample Session ID: 565d3c32-86cc-415d-8642-9c140f856f2b Duration: 17.91 s GPU Device: NVIDIA GeForce RTX 5060 Laptop GPU SMs: 26 Registers/Block: 65536 =============================================================================== Kernel Execution Summary =============================================================================== Total Kernels: 10 Unique Kernels: 1 Total GPU Time: 17.40 s GPU Busy: 97.2% Avg Duration: 1.74 s Median Duration: 1.74 s Min Duration: 1.71 s Max Duration: 1.78 s =============================================================================== Top 10 Kernels by Total GPU Time =============================================================================== # Kernel Calls Total Avg Max -------------------------------------------------------------------------------------- 1 __main__::matmul_kernel 10 17.40 s 1.74 s 1.78 s =============================================================================== Kernel Details (Top 10) =============================================================================== __main__::matmul_kernel ======================= Grid: (128,128,1) Block: (16,16,1) Occupancy: 100.0% Reg Occupancy: 100.0% SMem Occupancy: 100.0% Warp Occupancy: 100.0% Block Occupancy: 100.0% Limiting Resource: warps Registers/Thread: 40 Shared Memory: 0 B dyn + 0 B static =============================================================================== Memory Transfer Summary =============================================================================== Total Transfers: 4 Total Bytes: 64.0 MB Direction Count Total Bytes Avg Throughput ------------------------------------------------------ HtoD 3 48.0 MB 11.68 GB/s DtoH 1 16.0 MB 4.40 GB/s =============================================================================== System Metrics =============================================================================== GPU Metrics: Utilization: avg 96.6% peak 100% min 0% Temperature: avg 53.4 C peak 58 C Power: avg 71.0 W peak 75.6 W VRAM Usage: peak 1105 MiB SM Clock: avg 2631 MHz peak 2790 MHz Host Metrics: CPU Utilization: avg 8.6% peak 29.1% RAM Usage: peak 27593 / 32189 MiB (85.7%) =============================================================================== Scope Summary =============================================================================== Scope Timing: Scope Calls Total Avg Max ------------------------------------------------------------------------ matrix_mul_compute 1 195.21 ms 195.21 ms 195.21 ms GPU Time by Scope: Scope Kernels GPU Time Avg ---------------------------------------------------------------- matrix_mul_compute 10 17.40 s 1.74 s =============================================================================== Profile / SASS Analysis =============================================================================== SASS Metrics Summary: Metric Total -------------------------------------------------------------- smsp__sass_thread_inst_executed 2235815690240 smsp__sass_inst_executed 69869240320 smsp__sass_sectors_mem_global 45654999040 smsp__sass_sectors_mem_global_ideal 13427015680 Thread Divergence Analysis: Warp Instructions: 69869240320 Thread Instructions: 2235815690240 Avg Threads/Warp: 32.0 / 32 Warp Efficiency: 100.0% =============================================================================== GPU Flight Session Report Generated: 2026-05-22 05:05:33 UTC =============================================================================== =============================================================================== Session Summary =============================================================================== Application: matmul_sample Session ID: 565d3c32-86cc-415d-8642-9c140f856f2b Duration: 17.91 s GPU Device: NVIDIA GeForce RTX 5060 Laptop GPU SMs: 26 Registers/Block: 65536 =============================================================================== Kernel Execution Summary =============================================================================== Total Kernels: 10 Unique Kernels: 1 Total GPU Time: 17.40 s GPU Busy: 97.2% Avg Duration: 1.74 s Median Duration: 1.74 s Min Duration: 1.71 s Max Duration: 1.78 s =============================================================================== Top 10 Kernels by Total GPU Time =============================================================================== # Kernel Calls Total Avg Max -------------------------------------------------------------------------------------- 1 __main__::matmul_kernel 10 17.40 s 1.74 s 1.78 s =============================================================================== Kernel Details (Top 10) =============================================================================== __main__::matmul_kernel ======================= Grid: (128,128,1) Block: (16,16,1) Occupancy: 100.0% Reg Occupancy: 100.0% SMem Occupancy: 100.0% Warp Occupancy: 100.0% Block Occupancy: 100.0% Limiting Resource: warps Registers/Thread: 40 Shared Memory: 0 B dyn + 0 B static =============================================================================== Memory Transfer Summary =============================================================================== Total Transfers: 4 Total Bytes: 64.0 MB Direction Count Total Bytes Avg Throughput ------------------------------------------------------ HtoD 3 48.0 MB 11.68 GB/s DtoH 1 16.0 MB 4.40 GB/s =============================================================================== System Metrics =============================================================================== GPU Metrics: Utilization: avg 96.6% peak 100% min 0% Temperature: avg 53.4 C peak 58 C Power: avg 71.0 W peak 75.6 W VRAM Usage: peak 1105 MiB SM Clock: avg 2631 MHz peak 2790 MHz Host Metrics: CPU Utilization: avg 8.6% peak 29.1% RAM Usage: peak 27593 / 32189 MiB (85.7%) =============================================================================== Scope Summary =============================================================================== Scope Timing: Scope Calls Total Avg Max ------------------------------------------------------------------------ matrix_mul_compute 1 195.21 ms 195.21 ms 195.21 ms GPU Time by Scope: Scope Kernels GPU Time Avg ---------------------------------------------------------------- matrix_mul_compute 10 17.40 s 1.74 s =============================================================================== Profile / SASS Analysis =============================================================================== SASS Metrics Summary: Metric Total -------------------------------------------------------------- smsp__sass_thread_inst_executed 2235815690240 smsp__sass_inst_executed 69869240320 smsp__sass_sectors_mem_global 45654999040 smsp__sass_sectors_mem_global_ideal 13427015680 Thread Divergence Analysis: Warp Instructions: 69869240320 Thread Instructions: 2235815690240 Avg Threads/Warp: 32.0 / 32 Warp Efficiency: 100.0% =============================================================================== GPU Flight Session Report Generated: 2026-05-22 05:05:33 UTC =============================================================================== =============================================================================== Session Summary =============================================================================== Application: matmul_sample Session ID: 565d3c32-86cc-415d-8642-9c140f856f2b Duration: 17.91 s GPU Device: NVIDIA GeForce RTX 5060 Laptop GPU SMs: 26 Registers/Block: 65536 =============================================================================== Kernel Execution Summary =============================================================================== Total Kernels: 10 Unique Kernels: 1 Total GPU Time: 17.40 s GPU Busy: 97.2% Avg Duration: 1.74 s Median Duration: 1.74 s Min Duration: 1.71 s Max Duration: 1.78 s =============================================================================== Top 10 Kernels by Total GPU Time =============================================================================== # Kernel Calls Total Avg Max -------------------------------------------------------------------------------------- 1 __main__::matmul_kernel 10 17.40 s 1.74 s 1.78 s =============================================================================== Kernel Details (Top 10) =============================================================================== __main__::matmul_kernel ======================= Grid: (128,128,1) Block: (16,16,1) Occupancy: 100.0% Reg Occupancy: 100.0% SMem Occupancy: 100.0% Warp Occupancy: 100.0% Block Occupancy: 100.0% Limiting Resource: warps Registers/Thread: 40 Shared Memory: 0 B dyn + 0 B static =============================================================================== Memory Transfer Summary =============================================================================== Total Transfers: 4 Total Bytes: 64.0 MB Direction Count Total Bytes Avg Throughput ------------------------------------------------------ HtoD 3 48.0 MB 11.68 GB/s DtoH 1 16.0 MB 4.40 GB/s =============================================================================== System Metrics =============================================================================== GPU Metrics: Utilization: avg 96.6% peak 100% min 0% Temperature: avg 53.4 C peak 58 C Power: avg 71.0 W peak 75.6 W VRAM Usage: peak 1105 MiB SM Clock: avg 2631 MHz peak 2790 MHz Host Metrics: CPU Utilization: avg 8.6% peak 29.1% RAM Usage: peak 27593 / 32189 MiB (85.7%) =============================================================================== Scope Summary =============================================================================== Scope Timing: Scope Calls Total Avg Max ------------------------------------------------------------------------ matrix_mul_compute 1 195.21 ms 195.21 ms 195.21 ms GPU Time by Scope: Scope Kernels GPU Time Avg ---------------------------------------------------------------- matrix_mul_compute 10 17.40 s 1.74 s =============================================================================== Profile / SASS Analysis =============================================================================== SASS Metrics Summary: Metric Total -------------------------------------------------------------- smsp__sass_thread_inst_executed 2235815690240 smsp__sass_inst_executed 69869240320 smsp__sass_sectors_mem_global 45654999040 smsp__sass_sectors_mem_global_ideal 13427015680 Thread Divergence Analysis: Warp Instructions: 69869240320 Thread Instructions: 2235815690240 Avg Threads/Warp: 32.0 / 32 Warp Efficiency: 100.0% GPU Busy: 97.2% GPU Util avg: 96.6% Total GPU Time: 17.40 s Duration: 17.91 s GPU Busy: 97.2% GPU Util avg: 96.6% Total GPU Time: 17.40 s Duration: 17.91 s GPU Busy: 97.2% GPU Util avg: 96.6% Total GPU Time: 17.40 s Duration: 17.91 s Avg Duration: 1.74 s Median Duration: 1.74 s Min Duration: 1.71 s Max Duration: 1.78 s Avg Duration: 1.74 s Median Duration: 1.74 s Min Duration: 1.71 s Max Duration: 1.78 s Avg Duration: 1.74 s Median Duration: 1.74 s Min Duration: 1.71 s Max Duration: 1.78 s Occupancy: 100.0% Reg Occupancy: 100.0% SMem Occupancy: 100.0% Warp Occupancy: 100.0% Block Occupancy: 100.0% Limiting Resource: warps Occupancy: 100.0% Reg Occupancy: 100.0% SMem Occupancy: 100.0% Warp Occupancy: 100.0% Block Occupancy: 100.0% Limiting Resource: warps Occupancy: 100.0% Reg Occupancy: 100.0% SMem Occupancy: 100.0% Warp Occupancy: 100.0% Block Occupancy: 100.0% Limiting Resource: warps Avg Threads/Warp: 32.0 / 32 Warp Efficiency: 100.0% Avg Threads/Warp: 32.0 / 32 Warp Efficiency: 100.0% Avg Threads/Warp: 32.0 / 32 Warp Efficiency: 100.0% SASS Metrics Summary: Metric Total -------------------------------------------------------------- smsp__sass_thread_inst_executed 2235815690240 smsp__sass_inst_executed 69869240320 smsp__sass_sectors_mem_global 45654999040 smsp__sass_sectors_mem_global_ideal 13427015680 SASS Metrics Summary: Metric Total -------------------------------------------------------------- smsp__sass_thread_inst_executed 2235815690240 smsp__sass_inst_executed 69869240320 smsp__sass_sectors_mem_global 45654999040 smsp__sass_sectors_mem_global_ideal 13427015680 SASS Metrics Summary: Metric Total -------------------------------------------------------------- smsp__sass_thread_inst_executed 2235815690240 smsp__sass_inst_executed 69869240320 smsp__sass_sectors_mem_global 45654999040 smsp__sass_sectors_mem_global_ideal 13427015680 smsp__sass_sectors_mem_global 45,654,999,040 smsp__sass_sectors_mem_global_ideal 13,427,015,680 smsp__sass_sectors_mem_global 45,654,999,040 smsp__sass_sectors_mem_global_ideal 13,427,015,680 smsp__sass_sectors_mem_global 45,654,999,040 smsp__sass_sectors_mem_global_ideal 13,427,015,680 45.7 / 13.4 ≈ 3.4x 45.7 / 13.4 ≈ 3.4x 45.7 / 13.4 ≈ 3.4x 13.4 / 45.7 ≈ 29% 13.4 / 45.7 ≈ 29% 13.4 / 45.7 ≈ 29% from numba import cuda, float32 TPB = 16 @cuda.jit def matmul_kernel_perf(A, B, C): sA = cuda.shared.array((TPB, TPB), dtype=float32) sB = cuda.shared.array((TPB, TPB), dtype=float32) x, y = cuda.grid(2) tx = cuda.threadIdx.x ty = cuda.threadIdx.y tmp = float32(0.0) n_tiles = (A.shape[1] + TPB - 1) // TPB for i in range(n_tiles): sA[ty, tx] = 0.0 sB[ty, tx] = 0.0 if y < A.shape[0] and (tx + i * TPB) < A.shape[1]: sA[ty, tx] = A[y, tx + i * TPB] if x < B.shape[1] and (ty + i * TPB) < B.shape[0]: sB[ty, tx] = B[ty + i * TPB, x] cuda.syncthreads() for j in range(TPB): tmp += sA[ty, j] * sB[j, tx] cuda.syncthreads() if y < C.shape[0] and x < C.shape[1]: C[y, x] = tmp from numba import cuda, float32 TPB = 16 @cuda.jit def matmul_kernel_perf(A, B, C): sA = cuda.shared.array((TPB, TPB), dtype=float32) sB = cuda.shared.array((TPB, TPB), dtype=float32) x, y = cuda.grid(2) tx = cuda.threadIdx.x ty = cuda.threadIdx.y tmp = float32(0.0) n_tiles = (A.shape[1] + TPB - 1) // TPB for i in range(n_tiles): sA[ty, tx] = 0.0 sB[ty, tx] = 0.0 if y < A.shape[0] and (tx + i * TPB) < A.shape[1]: sA[ty, tx] = A[y, tx + i * TPB] if x < B.shape[1] and (ty + i * TPB) < B.shape[0]: sB[ty, tx] = B[ty + i * TPB, x] cuda.syncthreads() for j in range(TPB): tmp += sA[ty, j] * sB[j, tx] cuda.syncthreads() if y < C.shape[0] and x < C.shape[1]: C[y, x] = tmp from numba import cuda, float32 TPB = 16 @cuda.jit def matmul_kernel_perf(A, B, C): sA = cuda.shared.array((TPB, TPB), dtype=float32) sB = cuda.shared.array((TPB, TPB), dtype=float32) x, y = cuda.grid(2) tx = cuda.threadIdx.x ty = cuda.threadIdx.y tmp = float32(0.0) n_tiles = (A.shape[1] + TPB - 1) // TPB for i in range(n_tiles): sA[ty, tx] = 0.0 sB[ty, tx] = 0.0 if y < A.shape[0] and (tx + i * TPB) < A.shape[1]: sA[ty, tx] = A[y, tx + i * TPB] if x < B.shape[1] and (ty + i * TPB) < B.shape[0]: sB[ty, tx] = B[ty + i * TPB, x] cuda.syncthreads() for j in range(TPB): tmp += sA[ty, j] * sB[j, tx] cuda.syncthreads() if y < C.shape[0] and x < C.shape[1]: C[y, x] = tmp =============================================================================== GPU Flight Session Report Generated: 2026-05-22 05:20:40 UTC =============================================================================== =============================================================================== Session Summary =============================================================================== Application: matmul_sample_perf Session ID: d44e5478-ba19-4cd1-b3cf-f6d31ab8b0ca Duration: 2.90 s GPU Device: NVIDIA GeForce RTX 5060 Laptop GPU SMs: 26 Registers/Block: 65536 =============================================================================== Kernel Execution Summary =============================================================================== Total Kernels: 10 Unique Kernels: 1 Total GPU Time: 2.22 s GPU Busy: 76.4% Avg Duration: 221.64 ms Median Duration: 216.89 ms Min Duration: 215.38 ms Max Duration: 250.06 ms =============================================================================== Top 10 Kernels by Total GPU Time =============================================================================== # Kernel Calls Total Avg Max -------------------------------------------------------------------------------------- 1 __main__::matmul_kernel_perf 10 2.22 s 221.64 ms 250.06 ms =============================================================================== Kernel Details (Top 10) =============================================================================== __main__::matmul_kernel_perf ============================ Grid: (128,128,1) Block: (16,16,1) Occupancy: 100.0% Reg Occupancy: 100.0% SMem Occupancy: 100.0% Warp Occupancy: 100.0% Block Occupancy: 100.0% Limiting Resource: warps Registers/Thread: 37 Shared Memory: 0 B dyn + 2.0 KB static =============================================================================== Memory Transfer Summary =============================================================================== Total Transfers: 4 Total Bytes: 64.0 MB Direction Count Total Bytes Avg Throughput ------------------------------------------------------ HtoD 3 48.0 MB 9.87 GB/s DtoH 1 16.0 MB 4.45 GB/s =============================================================================== System Metrics =============================================================================== GPU Metrics: Utilization: avg 74.9% peak 100% min 0% Temperature: avg 43.0 C peak 48 C Power: avg 51.0 W peak 76.1 W VRAM Usage: peak 958 MiB SM Clock: avg 2180 MHz peak 2812 MHz Host Metrics: CPU Utilization: avg 16.0% peak 46.0% RAM Usage: peak 27019 / 32189 MiB (83.9%) =============================================================================== Scope Summary =============================================================================== Scope Timing: Scope Calls Total Avg Max ------------------------------------------------------------------------ matrix_mul_compute_perf 1 330.58 ms 330.58 ms 330.58 ms GPU Time by Scope: Scope Kernels GPU Time Avg ---------------------------------------------------------------- matrix_mul_compute_perf 10 2.22 s 221.64 ms =============================================================================== Profile / SASS Analysis =============================================================================== SASS Metrics Summary: Metric Total -------------------------------------------------------------- smsp__sass_thread_inst_executed 298005299200 smsp__sass_inst_executed 9312665600 smsp__sass_sectors_mem_global 1347420160 smsp__sass_sectors_mem_global_ideal 1347420160 Thread Divergence Analysis: Warp Instructions: 9312665600 Thread Instructions: 298005299200 Avg Threads/Warp: 32.0 / 32 Warp Efficiency: 100.0% =============================================================================== GPU Flight Session Report Generated: 2026-05-22 05:20:40 UTC =============================================================================== =============================================================================== Session Summary =============================================================================== Application: matmul_sample_perf Session ID: d44e5478-ba19-4cd1-b3cf-f6d31ab8b0ca Duration: 2.90 s GPU Device: NVIDIA GeForce RTX 5060 Laptop GPU SMs: 26 Registers/Block: 65536 =============================================================================== Kernel Execution Summary =============================================================================== Total Kernels: 10 Unique Kernels: 1 Total GPU Time: 2.22 s GPU Busy: 76.4% Avg Duration: 221.64 ms Median Duration: 216.89 ms Min Duration: 215.38 ms Max Duration: 250.06 ms =============================================================================== Top 10 Kernels by Total GPU Time =============================================================================== # Kernel Calls Total Avg Max -------------------------------------------------------------------------------------- 1 __main__::matmul_kernel_perf 10 2.22 s 221.64 ms 250.06 ms =============================================================================== Kernel Details (Top 10) =============================================================================== __main__::matmul_kernel_perf ============================ Grid: (128,128,1) Block: (16,16,1) Occupancy: 100.0% Reg Occupancy: 100.0% SMem Occupancy: 100.0% Warp Occupancy: 100.0% Block Occupancy: 100.0% Limiting Resource: warps Registers/Thread: 37 Shared Memory: 0 B dyn + 2.0 KB static =============================================================================== Memory Transfer Summary =============================================================================== Total Transfers: 4 Total Bytes: 64.0 MB Direction Count Total Bytes Avg Throughput ------------------------------------------------------ HtoD 3 48.0 MB 9.87 GB/s DtoH 1 16.0 MB 4.45 GB/s =============================================================================== System Metrics =============================================================================== GPU Metrics: Utilization: avg 74.9% peak 100% min 0% Temperature: avg 43.0 C peak 48 C Power: avg 51.0 W peak 76.1 W VRAM Usage: peak 958 MiB SM Clock: avg 2180 MHz peak 2812 MHz Host Metrics: CPU Utilization: avg 16.0% peak 46.0% RAM Usage: peak 27019 / 32189 MiB (83.9%) =============================================================================== Scope Summary =============================================================================== Scope Timing: Scope Calls Total Avg Max ------------------------------------------------------------------------ matrix_mul_compute_perf 1 330.58 ms 330.58 ms 330.58 ms GPU Time by Scope: Scope Kernels GPU Time Avg ---------------------------------------------------------------- matrix_mul_compute_perf 10 2.22 s 221.64 ms =============================================================================== Profile / SASS Analysis =============================================================================== SASS Metrics Summary: Metric Total -------------------------------------------------------------- smsp__sass_thread_inst_executed 298005299200 smsp__sass_inst_executed 9312665600 smsp__sass_sectors_mem_global 1347420160 smsp__sass_sectors_mem_global_ideal 1347420160 Thread Divergence Analysis: Warp Instructions: 9312665600 Thread Instructions: 298005299200 Avg Threads/Warp: 32.0 / 32 Warp Efficiency: 100.0% =============================================================================== GPU Flight Session Report Generated: 2026-05-22 05:20:40 UTC =============================================================================== =============================================================================== Session Summary =============================================================================== Application: matmul_sample_perf Session ID: d44e5478-ba19-4cd1-b3cf-f6d31ab8b0ca Duration: 2.90 s GPU Device: NVIDIA GeForce RTX 5060 Laptop GPU SMs: 26 Registers/Block: 65536 =============================================================================== Kernel Execution Summary =============================================================================== Total Kernels: 10 Unique Kernels: 1 Total GPU Time: 2.22 s GPU Busy: 76.4% Avg Duration: 221.64 ms Median Duration: 216.89 ms Min Duration: 215.38 ms Max Duration: 250.06 ms =============================================================================== Top 10 Kernels by Total GPU Time =============================================================================== # Kernel Calls Total Avg Max -------------------------------------------------------------------------------------- 1 __main__::matmul_kernel_perf 10 2.22 s 221.64 ms 250.06 ms =============================================================================== Kernel Details (Top 10) =============================================================================== __main__::matmul_kernel_perf ============================ Grid: (128,128,1) Block: (16,16,1) Occupancy: 100.0% Reg Occupancy: 100.0% SMem Occupancy: 100.0% Warp Occupancy: 100.0% Block Occupancy: 100.0% Limiting Resource: warps Registers/Thread: 37 Shared Memory: 0 B dyn + 2.0 KB static =============================================================================== Memory Transfer Summary =============================================================================== Total Transfers: 4 Total Bytes: 64.0 MB Direction Count Total Bytes Avg Throughput ------------------------------------------------------ HtoD 3 48.0 MB 9.87 GB/s DtoH 1 16.0 MB 4.45 GB/s =============================================================================== System Metrics =============================================================================== GPU Metrics: Utilization: avg 74.9% peak 100% min 0% Temperature: avg 43.0 C peak 48 C Power: avg 51.0 W peak 76.1 W VRAM Usage: peak 958 MiB SM Clock: avg 2180 MHz peak 2812 MHz Host Metrics: CPU Utilization: avg 16.0% peak 46.0% RAM Usage: peak 27019 / 32189 MiB (83.9%) =============================================================================== Scope Summary =============================================================================== Scope Timing: Scope Calls Total Avg Max ------------------------------------------------------------------------ matrix_mul_compute_perf 1 330.58 ms 330.58 ms 330.58 ms GPU Time by Scope: Scope Kernels GPU Time Avg ---------------------------------------------------------------- matrix_mul_compute_perf 10 2.22 s 221.64 ms =============================================================================== Profile / SASS Analysis =============================================================================== SASS Metrics Summary: Metric Total -------------------------------------------------------------- smsp__sass_thread_inst_executed 298005299200 smsp__sass_inst_executed 9312665600 smsp__sass_sectors_mem_global 1347420160 smsp__sass_sectors_mem_global_ideal 1347420160 Thread Divergence Analysis: Warp Instructions: 9312665600 Thread Instructions: 298005299200 Avg Threads/Warp: 32.0 / 32 Warp Efficiency: 100.0% smsp__sass_sectors_mem_global 45,654,999,040 smsp__sass_sectors_mem_global_ideal 13,427,015,680 smsp__sass_sectors_mem_global 45,654,999,040 smsp__sass_sectors_mem_global_ideal 13,427,015,680 smsp__sass_sectors_mem_global 45,654,999,040 smsp__sass_sectors_mem_global_ideal 13,427,015,680 smsp__sass_sectors_mem_global 1,347,420,160 smsp__sass_sectors_mem_global_ideal 1,347,420,160 smsp__sass_sectors_mem_global 1,347,420,160 smsp__sass_sectors_mem_global_ideal 1,347,420,160 smsp__sass_sectors_mem_global 1,347,420,160 smsp__sass_sectors_mem_global_ideal 1,347,420,160 Shared Memory: 0 B dyn + 2.0 KB static Shared Memory: 0 B dyn + 2.0 KB static Shared Memory: 0 B dyn + 2.0 KB static Naive thread instructions: 2,235,815,690,240 Tiled thread instructions: 298,005,299,200 Naive thread instructions: 2,235,815,690,240 Tiled thread instructions: 298,005,299,200 Naive thread instructions: 2,235,815,690,240 Tiled thread instructions: 298,005,299,200 - The GPU is busy. - Occupancy is high. - Warp efficiency is perfect. - high GPU utilization, - 100% occupancy, - 100% warp efficiency, - but very inefficient global memory access. - total profiled GPU time dropped from 17.40 s to 2.22 s, - average profiled kernel time dropped from 1.74 s to 221.64 ms, - global memory sectors dropped from 45.65B to 1.35B, - and actual global memory sectors matched the ideal number.