Tools

When I Took Numba to the Dojo: A Battle Royale Against Rust and CUDA

2025-12-26 0 views admin

When I Took Numba to the Dojo: A Battle Royale Against Rust and CUDA

Source: Dev.to

A Thank You Note ## 🎯 The Challenge ## 🥊 The Contenders ## Team Python 🐍 ## Team Rust 🦀 ## Team GPU 🎮 ## 🏗️ The Setup ## 📊 The Results ## The Full Leaderboard ## Speedup Visualization ## Category Champions ## 🔬 What I Learned ## 1. GPU Demolishes CPU (When It Fits) ## 2. FP32 is 20x Faster Than FP64 on Consumer GPUs ## 3. Rust ≈ Numba JIT (Single-Threaded) ## 4. Rust Beats Numba in Parallel (~20%) ## 5. We Hit the Memory Bandwidth Wall ## 🧪 The Code ## Numba (The Hero of The Original Article) ## Rust (The Challenger) ## CUDA C++ (The Champion) ## 🎯 When to Use What ## 🙏 Final Thoughts ## 🔗 Resources Before we dive in, I want to acknowledge Shreyan Ghosh (@zenoguy) and his wonderful article "When Time Became a Variable — Notes From My Journey With Numba". His piece captured something beautiful about computing: the joy of experimentation, the thrill of watching code go fast, and the curiosity to ask "what if?" This line stuck with me: "Somewhere between algorithms and hardware, Numba didn't just make my code faster. It made exploration lighter." Reading his benchmarks, I couldn't help but wonder: What happens when we throw Rust into the mix? What about raw CUDA? Where does the hardware actually give up? So I built a dojo. Let's spar. Same challenge as Shreyan's original experiment: Compute this for 20 million elements. Simple math. Maximum optimization. Who wins? I assembled fighters from different worlds: I wanted this to be reproducible and fair: The full benchmark suite is open source: github.com/copyleftdev/numba-dojo Let's see who survived the dojo. The RTX 3080 Ti at 0.21ms is 3,255x faster than NumPy. That's not a typo. For embarrassingly parallel workloads like element-wise computation, GPUs are in a different league. The massive parallelism (80 streaming multiprocessors, thousands of cores) absolutely crushes sequential execution. Consumer GPUs (GeForce series) have limited FP64 units — typically 1/32 the throughput of FP32. If your computation can tolerate single precision, use it. Both compile to LLVM IR. Both get similar codegen. The difference is noise. This validates Numba's claim: "Feels like Python, behaves like C." Rayon's work-stealing scheduler has lower overhead than Numba's threading. For CPU-parallel workloads in production, Rust has an edge. This was the most interesting discovery. When I profiled the FP32 CUDA kernel: We're running at 85% of peak memory bandwidth. The GPU cores are actually waiting for data. No algorithm can beat physics. This is the Roofline Model in action: For this workload with low arithmetic intensity (few ops per byte), we've hit the ceiling. Here's what each implementation looks like: Just add @njit. That's it. Shreyan was right — this is magical. Rayon makes parallelism feel as natural as iterators. Grid-stride loops for maximum occupancy. Shreyan's original article reminded me why I love computing: we get to ask "what if?" and then actually find out. What if we compile this loop? 43x faster. What if we use all CPU cores? 54x faster. What if we throw a GPU at it? 3,255x faster. What if we hit the memory bandwidth wall? Physics wins. The journey from Pure Python (6.6 seconds) to CUDA FP32 (0.2 milliseconds) is a 33,000x improvement. That's not optimization — that's transformation. Keep experimenting. Keep playing. That's what computing is for. ✨ What's your favorite performance optimization story? Drop it in the comments! Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: f(x) = sqrt(x² + 1) × sin(x) + cos(x/2) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: f(x) = sqrt(x² + 1) × sin(x) + cos(x/2) CODE_BLOCK: f(x) = sqrt(x² + 1) × sin(x) + cos(x/2) COMMAND_BLOCK: # Run everything yourself git clone https://github.com/copyleftdev/numba-dojo.git cd numba-dojo make all Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Run everything yourself git clone https://github.com/copyleftdev/numba-dojo.git cd numba-dojo make all COMMAND_BLOCK: # Run everything yourself git clone https://github.com/copyleftdev/numba-dojo.git cd numba-dojo make all CODE_BLOCK: CUDA FP64: 4.11 ms CUDA FP32: 0.21 ms ← 20x faster! Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: CUDA FP64: 4.11 ms CUDA FP32: 0.21 ms ← 20x faster! CODE_BLOCK: CUDA FP64: 4.11 ms CUDA FP32: 0.21 ms ← 20x faster! CODE_BLOCK: Rust Single: 555.62 ms Numba JIT: 558.30 ms Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Rust Single: 555.62 ms Numba JIT: 558.30 ms CODE_BLOCK: Rust Single: 555.62 ms Numba JIT: 558.30 ms CODE_BLOCK: Rust Parallel (Rayon): 12.39 ms Numba Parallel: 15.55 ms Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Rust Parallel (Rayon): 12.39 ms Numba Parallel: 15.55 ms CODE_BLOCK: Rust Parallel (Rayon): 12.39 ms Numba Parallel: 15.55 ms CODE_BLOCK: Time: 0.21 ms Bandwidth: ~777 GB/s achieved Theoretical: 912 GB/s (RTX 3080 Ti) Efficiency: 85% Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Time: 0.21 ms Bandwidth: ~777 GB/s achieved Theoretical: 912 GB/s (RTX 3080 Ti) Efficiency: 85% CODE_BLOCK: Time: 0.21 ms Bandwidth: ~777 GB/s achieved Theoretical: 912 GB/s (RTX 3080 Ti) Efficiency: 85% CODE_BLOCK: Peak Compute / / Performance / / ← We're here (memory-bound) / / ────────────────────── Memory Bandwidth Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Peak Compute / / Performance / / ← We're here (memory-bound) / / ────────────────────── Memory Bandwidth CODE_BLOCK: Peak Compute / / Performance / / ← We're here (memory-bound) / / ────────────────────── Memory Bandwidth CODE_BLOCK: from numba import njit, prange import numpy as np @njit(parallel=True, fastmath=True, cache=True) def compute_numba_parallel(arr, out): n = len(arr) for i in prange(n): val = arr[i] out[i] = np.sqrt(val * val + 1.0) * np.sin(val) + np.cos(val * 0.5) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: from numba import njit, prange import numpy as np @njit(parallel=True, fastmath=True, cache=True) def compute_numba_parallel(arr, out): n = len(arr) for i in prange(n): val = arr[i] out[i] = np.sqrt(val * val + 1.0) * np.sin(val) + np.cos(val * 0.5) CODE_BLOCK: from numba import njit, prange import numpy as np @njit(parallel=True, fastmath=True, cache=True) def compute_numba_parallel(arr, out): n = len(arr) for i in prange(n): val = arr[i] out[i] = np.sqrt(val * val + 1.0) * np.sin(val) + np.cos(val * 0.5) CODE_BLOCK: use rayon::prelude::; fn compute_parallel(arr: &[f64], out: &mut [f64]) { out.par_iter_mut() .zip(arr.par_iter()) .for_each(|(o, &v)| { o = (v * v + 1.0).sqrt() * v.sin() + (v * 0.5).cos(); }); } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: use rayon::prelude::; fn compute_parallel(arr: &[f64], out: &mut [f64]) { out.par_iter_mut() .zip(arr.par_iter()) .for_each(|(o, &v)| { o = (v * v + 1.0).sqrt() * v.sin() + (v * 0.5).cos(); }); } CODE_BLOCK: use rayon::prelude::; fn compute_parallel(arr: &[f64], out: &mut [f64]) { out.par_iter_mut() .zip(arr.par_iter()) .for_each(|(o, &v)| { o = (v * v + 1.0).sqrt() * v.sin() + (v * 0.5).cos(); }); } CODE_BLOCK: global void compute_fp32(const float* arr, float* out, size_t n) { for (size_t idx = blockIdx.x * blockDim.x + threadIdx.x; idx < n; idx += blockDim.x * gridDim.x) { float val = arr[idx]; out[idx] = sqrtf(val * val + 1.0f) * sinf(val) + cosf(val * 0.5f); } } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: global void compute_fp32(const float* arr, float* out, size_t n) { for (size_t idx = blockIdx.x * blockDim.x + threadIdx.x; idx < n; idx += blockDim.x * gridDim.x) { float val = arr[idx]; out[idx] = sqrtf(val * val + 1.0f) * sinf(val) + cosf(val * 0.5f); } } CODE_BLOCK: global void compute_fp32(const float* arr, float* out, size_t n) { for (size_t idx = blockIdx.x * blockDim.x + threadIdx.x; idx < n; idx += blockDim.x * gridDim.x) { float val = arr[idx]; out[idx] = sqrtf(val * val + 1.0f) * sinf(val) + cosf(val * 0.5f); } } - Pure Python — The baseline. Interpreter overhead. GIL-bound. - NumPy Vectorized — The standard approach. - Numba JIT — Single-threaded compiled. - Numba Parallel — Multi-threaded with prange. - Numba @vectorize — Parallel ufunc magic. - Single-threaded — Idiomatic iterators. - Parallel (Rayon) — Work-stealing parallelism. - Parallel Chunks — Cache-optimized chunking. - Numba CUDA — Python on the GPU. - CUDA C++ FP64 — Double precision native. - CUDA C++ FP32 — Single precision native. - CUDA C++ Intrinsics — Hardware-optimized math. - Same computation across all implementations - Same array size (20 million float64 elements) - Same random seed (42, obviously) - Multiple warmup runs to eliminate JIT/cache effects - Take the minimum of multiple runs (least noise) - Full source code: github.com/copyleftdev/numba-dojo - Original inspiration: @zenoguy's Numba article - Numba docs: numba.pydata.org - Rayon (Rust): docs.rs/rayon - Roofline Model: Wikipedia

🏷️ Tags

how-totutorialguidedev.toaikernelpythongitgithub