Tools
Tools: Porting Vello's GPU Tile Rasterizer to Pure Go
2026-02-28
0 views
admin
Why Not Just Use Scanline? ## The GPU Compute Pipeline ## Stages 1–4: Monoid Prefix Sums ## Stage 5: Path Count — DDA Tile Walk ## Stage 6: Backdrop Prefix Sum ## Stage 7: Coarse — Allocation + PTCL Generation ## Stage 8: Path Tiling ## Stage 9: Fine Rasterization ## Euler Spiral Flattening ## Per-Tile Command Lists (PTCL) ## Why Port a GPU Algorithm to CPU? ## Two Rasterizers, Two Targets ## Smart Multi-Engine Selection ## The Coregex Analogy ## Adaptive Threshold ## User Override ## Numbers ## Part of a Larger Ecosystem ## Acknowledgments When you call dc.DrawCircle(400, 300, 100) in gogpu/gg, what happens under the hood? A tile-based rasterization pipeline — a direct port of Vello's GPU compute pipeline from the linebender team — converts your vector paths into per-pixel coverage values. And it's written entirely in Pure Go. This article walks through the internals of tilecompute — a 6,700-line port of Vello's original GPU compute rasterizer (the vello_shaders/src/cpu/ pipeline, circa 2020–2023) into Pure Go. Traditional scanline rasterizers process one row of pixels at a time. They work, but they have a fundamental scalability problem: for a 4K canvas (3840x2160), you're iterating over 8.3 million pixels regardless of how simple your shapes are. Tile-based rasterizers flip this around. They divide the canvas into small tiles (16x16 pixels in our case) and only process tiles that the vector path actually touches. A circle in the corner of a 4K canvas? You process maybe 40 tiles instead of 8 million pixels. Vello — created by Raph Levien and the linebender team — pioneered this approach using GPU compute shaders (originally as piet-gpu in 2020, renamed to Vello in 2022). We ported its GPU compute pipeline to Pure Go for use as the rasterization core of gogpu/gg. Note: This article covers the original GPU compute pipeline (16x16 tiles, 2020–2023). We've also ported Vello's newer Sparse Strips algorithm (4x4 tiles, August 2024) — see the Two Rasterizers, Two Targets section below. Vello's original rasterizer (2020–2023) uses a tile-based GPU compute approach with 16x16 pixel tiles. The key insight: instead of asking "which pixels does this path cover?", ask "which tiles does each line segment cross, and what's the winding contribution?" The full pipeline has 9 GPU compute stages (matching our 9 WGSL shaders), plus scene encoding and curve flattening on the CPU: The first four stages use monoid prefix sums — Vello's core parallelism primitive. A monoid is an associative operation with an identity element; prefix sums over monoids can be computed in O(log n) parallel steps on a GPU, turning what would be sequential parsing into massively parallel work. Let's walk through the key stages. The first four stages parse the encoded scene in parallel using monoid prefix sums. Each path tag and draw tag is reduced into a monoid (an associative structure), then scanned to produce cumulative values: On the GPU, reduce + scan runs in O(log n) parallel steps. On the CPU, it's a sequential scan — but the data structures are identical, making the CPU code a direct reference implementation for the GPU shaders. The path count stage answers: "which tiles does each line segment cross?" We use a Digital Differential Analyzer (DDA) — an algorithm that traces a line through a grid, visiting every cell the line passes through. For each tile visited, we record two things: The backdrop is the critical concept. When a line segment crosses a tile's left edge, it contributes a signed delta to the winding number: -1 for downward segments, +1 for upward. This is how we know whether a pixel is "inside" or "outside" the path without checking every segment. This is where tile-based rasterization really shines. Instead of checking every segment for every pixel, we accumulate winding numbers left-to-right across each row of tiles: After this step, each tile knows the accumulated winding number from all segments to its left. A tile with winding number 1 and no segments crossing it is fully solid — no per-pixel computation needed. The coarse stage allocates space in a global segment buffer and generates Per-Tile Command Lists. Segment counts are converted into indices using Vello's inverted index trick: The ^ (bitwise NOT) serves double duty: it marks the tile as "has segments" (the NOT of a valid index is always negative as int32) while encoding the starting index into the global segment array. Now we clip each line segment to its tile boundaries and compute the crucial yEdge value: YEdge tells the fine rasterizer where the segment enters or exits the tile at x=0. If the segment doesn't cross the left edge, YEdge is set to 1e9 (sentinel value). This single float32 captures the geometric relationship needed for coverage computation. The final stage computes per-pixel anti-aliased coverage within each 16x16 tile: This is analytic anti-aliasing — we compute exact sub-pixel coverage by integrating line segment contributions, not by supersampling. The result is mathematically precise edges with smooth alpha gradients. But wait — DrawCircle produces cubic Bezier curves, not line segments. How do we get from curves to lines? Vello uses Euler spiral approximation for adaptive curve flattening. Unlike naive subdivision (which produces too many or too few segments), Euler spirals provide optimal error bounds: The algorithm evaluates curvature at each point and subdivides only where the error exceeds a tolerance (0.25 pixels by default). A nearly-straight segment becomes 1 line. A tight curve gets more subdivisions. The result: minimum line segments for maximum visual quality. The coarse stage (Stage 7) generates Per-Tile Command Lists — each tile gets a stream of commands like "fill with coverage from segment N", "apply color #FF0000", "begin clip", "end clip". This is what makes the pipeline work for multiple overlapping paths (UI with buttons, text, backgrounds) in a single fine rasterization pass: The fine rasterizer walks each tile's PTCL, executing commands and compositing results with premultiplied alpha — exactly like a GPU would: This is the most common question we get. Here's why: 1. CPU is the Core. After analyzing 8 enterprise 2D engines (Skia, Cairo, Vello, Blend2D, tiny-skia, piet, Qt RHI, Pathfinder), we found that in zero of them is CPU rasterization a "backend". It's always the core. GPU is the optional accelerator. 2. Correctness Reference. The CPU implementation serves as a reference for the GPU compute shaders. When GPU and CPU produce different results, we know which one to trust. 3. Universal Availability. Servers, CI/CD, embedded systems — many environments have no GPU. A server generating 10,000 chart images doesn't need GPU acceleration; it needs reliable software rendering. 4. Identical Algorithm. Because the CPU code mirrors the GPU pipeline stage-by-stage, enabling GPU acceleration later is straightforward — same data structures, same logic, different execution model. tilecompute is not the only Vello algorithm we've ported. gogpu/gg includes both of Vello's rasterization pipelines — each optimized for a different execution target: Why 16x16 for GPU? GPU compute shaders process tiles in parallel workgroups. 256 pixels per tile matches the typical workgroup size (256 threads), giving each thread exactly one pixel. Why 4x4 for CPU? SIMD instructions operate on 128-bit registers. 16 pixels of u8 coverage data fit into a single SSE register, enabling vectorized operations across an entire tile at once — the same approach Intel used in Larrabee. Both rasterizers use analytic anti-aliasing, Euler spiral flattening, and support NonZero/EvenOdd fill rules. The difference is purely in how they partition the canvas and process tiles. Having both means gogpu/gg is ready for the optimal algorithm regardless of the target: GPU compute shaders use the 16x16 pipeline, CPU rendering uses the 4x4 pipeline. Having multiple rasterizers raises a question: who decides which algorithm handles which path? We analyzed 8 enterprise 2D engines — Skia, Cairo, Blend2D, Vello, Qt, Direct2D, Flutter, SwiftUI — and found that none of them do per-path dynamic algorithm selection. Skia has separate CPU/GPU pipelines but no cross-selection. Vello has a planned vello_api for CPU/GPU choice, not yet built. Direct2D recognizes simple shapes but doesn't switch algorithms. gogpu/gg is the first 2D graphics library with systematic multi-factor per-path selection across 5 rasterization algorithms: The pattern is inspired by coregex — a Go regex library with 17 strategies and an intelligent selector that picks the optimal engine per-pattern. Same idea: analyze the input, pick the optimal engine. Both Go libraries, both first-of-kind multi-engine approaches. The key insight: scanline cost grows with width × edge crossings, while tile cost grows with fill area. For large shapes, tiles win at lower complexity. For tiny shapes (< 32px), scanline always wins. The selection between scanline and tile-based rasterizers uses an adaptive threshold derived from the path's bounding box area: Auto-selection is the default, but the user always has final say — the same principle as database query hints or GPU driver force flags: The mode is per-Context — different contexts can use different strategies simultaneously. This makes A/B benchmarking trivial: render the same scene with two contexts, compare output and timing. The package includes 9 WGSL compute shaders alongside the Go code — ready for GPU execution when the compute pipeline is enabled. tilecompute is the rasterization core of gogpu/gg, which is part of a 466K+ line Pure Go GPU computing ecosystem: The default stack is zero CGO, Pure Go — from shader compilation to GPU command submission. But gogpu also supports a Rust backend via go-webgpu FFI to wgpu-native for maximum GPU performance when needed: Both backends use the same gg API — the choice is transparent to application code. gg doesn't know or care which WebGPU implementation is underneath; it talks to hal.Queue. When GPU acceleration is available, gg uses the registered WebGPU backend (Pure Go or Rust) with support for Vulkan, Metal, DX12, and OpenGL ES. When it's not — tilecompute and the CPU rasterizer handle everything. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK:
Vector Paths (cubics, arcs, lines) ↓
Scene Encoding (CPU — pack paths, transforms, styles) ↓ 1. PathTag Reduce ─┐ 2. PathTag Scan ─┘ monoid prefix sums over path structure ↓
Flatten (Euler spiral → line segments) ↓ 3. Draw Reduce ─┐ 4. Draw Leaf Scan ─┘ monoid prefix sums over draw commands ↓ 5. Path Count (DDA walk — which tiles does each segment cross?) ↓ 6. Backdrop (prefix sum — left-to-right winding accumulation) ↓ 7. Coarse (generate Per-Tile Command Lists) ↓ 8. Path Tiling (clip segments to tile boundaries, compute yEdge) ↓ 9. Fine (per-pixel analytic anti-aliased coverage) ↓
Per-pixel RGBA output Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
Vector Paths (cubics, arcs, lines) ↓
Scene Encoding (CPU — pack paths, transforms, styles) ↓ 1. PathTag Reduce ─┐ 2. PathTag Scan ─┘ monoid prefix sums over path structure ↓
Flatten (Euler spiral → line segments) ↓ 3. Draw Reduce ─┐ 4. Draw Leaf Scan ─┘ monoid prefix sums over draw commands ↓ 5. Path Count (DDA walk — which tiles does each segment cross?) ↓ 6. Backdrop (prefix sum — left-to-right winding accumulation) ↓ 7. Coarse (generate Per-Tile Command Lists) ↓ 8. Path Tiling (clip segments to tile boundaries, compute yEdge) ↓ 9. Fine (per-pixel analytic anti-aliased coverage) ↓
Per-pixel RGBA output CODE_BLOCK:
Vector Paths (cubics, arcs, lines) ↓
Scene Encoding (CPU — pack paths, transforms, styles) ↓ 1. PathTag Reduce ─┐ 2. PathTag Scan ─┘ monoid prefix sums over path structure ↓
Flatten (Euler spiral → line segments) ↓ 3. Draw Reduce ─┐ 4. Draw Leaf Scan ─┘ monoid prefix sums over draw commands ↓ 5. Path Count (DDA walk — which tiles does each segment cross?) ↓ 6. Backdrop (prefix sum — left-to-right winding accumulation) ↓ 7. Coarse (generate Per-Tile Command Lists) ↓ 8. Path Tiling (clip segments to tile boundaries, compute yEdge) ↓ 9. Fine (per-pixel analytic anti-aliased coverage) ↓
Per-pixel RGBA output CODE_BLOCK:
// PathMonoid — accumulated state from scanning path tags
type PathMonoid struct { TransIx uint32 // Transform index PathSegIx uint32 // Path segment index PathSegOffset uint32 // Offset into path segment data StyleIx uint32 // Style index PathIx uint32 // Path index
} // DrawMonoid — accumulated state from scanning draw tags
type DrawMonoid struct { PathIx uint32 // Which path this draw belongs to ClipIx uint32 // Clip stack depth SceneOffset uint32 // Offset into scene data InfoOffset uint32 // Offset into info buffer
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
// PathMonoid — accumulated state from scanning path tags
type PathMonoid struct { TransIx uint32 // Transform index PathSegIx uint32 // Path segment index PathSegOffset uint32 // Offset into path segment data StyleIx uint32 // Style index PathIx uint32 // Path index
} // DrawMonoid — accumulated state from scanning draw tags
type DrawMonoid struct { PathIx uint32 // Which path this draw belongs to ClipIx uint32 // Clip stack depth SceneOffset uint32 // Offset into scene data InfoOffset uint32 // Offset into info buffer
} CODE_BLOCK:
// PathMonoid — accumulated state from scanning path tags
type PathMonoid struct { TransIx uint32 // Transform index PathSegIx uint32 // Path segment index PathSegOffset uint32 // Offset into path segment data StyleIx uint32 // Style index PathIx uint32 // Path index
} // DrawMonoid — accumulated state from scanning draw tags
type DrawMonoid struct { PathIx uint32 // Which path this draw belongs to ClipIx uint32 // Clip stack depth SceneOffset uint32 // Offset into scene data InfoOffset uint32 // Offset into info buffer
} CODE_BLOCK:
// pathCountMain — DDA walk through the tile grid
func pathCountMain(bump *BumpAllocators, lines []LineSoup, paths []Path, tile []Tile, segCounts []SegmentCount) { for lineIx := uint32(0); lineIx < bump.Lines; lineIx++ { line := lines[lineIx] p0 := vec2FromArray(line.P0) p1 := vec2FromArray(line.P1) // Sort by Y for consistent DDA walking isDown := p1.y >= p0.y var xy0, xy1 vec2 if isDown { xy0, xy1 = p0, p1 } else { xy0, xy1 = p1, p0 } // Scale to tile coordinates (pixel / 16) s0 := xy0.mul(tileScale) s1 := xy1.mul(tileScale) // DDA walk: trace the line through tile grid cells // counting X crossings and Y crossings separately dx := abs32(s1.x - s0.x) dy := s1.y - s0.y idxdy := 1.0 / (dx + dy) a := dx * idxdy // ... walk through tiles, update backdrop and segment counts }
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
// pathCountMain — DDA walk through the tile grid
func pathCountMain(bump *BumpAllocators, lines []LineSoup, paths []Path, tile []Tile, segCounts []SegmentCount) { for lineIx := uint32(0); lineIx < bump.Lines; lineIx++ { line := lines[lineIx] p0 := vec2FromArray(line.P0) p1 := vec2FromArray(line.P1) // Sort by Y for consistent DDA walking isDown := p1.y >= p0.y var xy0, xy1 vec2 if isDown { xy0, xy1 = p0, p1 } else { xy0, xy1 = p1, p0 } // Scale to tile coordinates (pixel / 16) s0 := xy0.mul(tileScale) s1 := xy1.mul(tileScale) // DDA walk: trace the line through tile grid cells // counting X crossings and Y crossings separately dx := abs32(s1.x - s0.x) dy := s1.y - s0.y idxdy := 1.0 / (dx + dy) a := dx * idxdy // ... walk through tiles, update backdrop and segment counts }
} CODE_BLOCK:
// pathCountMain — DDA walk through the tile grid
func pathCountMain(bump *BumpAllocators, lines []LineSoup, paths []Path, tile []Tile, segCounts []SegmentCount) { for lineIx := uint32(0); lineIx < bump.Lines; lineIx++ { line := lines[lineIx] p0 := vec2FromArray(line.P0) p1 := vec2FromArray(line.P1) // Sort by Y for consistent DDA walking isDown := p1.y >= p0.y var xy0, xy1 vec2 if isDown { xy0, xy1 = p0, p1 } else { xy0, xy1 = p1, p0 } // Scale to tile coordinates (pixel / 16) s0 := xy0.mul(tileScale) s1 := xy1.mul(tileScale) // DDA walk: trace the line through tile grid cells // counting X crossings and Y crossings separately dx := abs32(s1.x - s0.x) dy := s1.y - s0.y idxdy := 1.0 / (dx + dy) a := dx * idxdy // ... walk through tiles, update backdrop and segment counts }
} CODE_BLOCK:
// Backdrop prefix sum — left-to-right winding accumulation
for y := 0; y < bboxH; y++ { sum := int32(0) for x := 0; x < bboxW; x++ { idx := base + y*bboxW + x sum += tiles[idx].Backdrop tiles[idx].Backdrop = sum }
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
// Backdrop prefix sum — left-to-right winding accumulation
for y := 0; y < bboxH; y++ { sum := int32(0) for x := 0; x < bboxW; x++ { idx := base + y*bboxW + x sum += tiles[idx].Backdrop tiles[idx].Backdrop = sum }
} CODE_BLOCK:
// Backdrop prefix sum — left-to-right winding accumulation
for y := 0; y < bboxH; y++ { sum := int32(0) for x := 0; x < bboxW; x++ { idx := base + y*bboxW + x sum += tiles[idx].Backdrop tiles[idx].Backdrop = sum }
} CODE_BLOCK:
// Convert counts to indices using bitwise NOT
nextSegIx := uint32(0)
for i := range tiles { nSegs := tiles[i].SegmentCountOrIx if nSegs != 0 { tiles[i].SegmentCountOrIx = ^nextSegIx // !seg_ix in Rust nextSegIx += nSegs }
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
// Convert counts to indices using bitwise NOT
nextSegIx := uint32(0)
for i := range tiles { nSegs := tiles[i].SegmentCountOrIx if nSegs != 0 { tiles[i].SegmentCountOrIx = ^nextSegIx // !seg_ix in Rust nextSegIx += nSegs }
} CODE_BLOCK:
// Convert counts to indices using bitwise NOT
nextSegIx := uint32(0)
for i := range tiles { nSegs := tiles[i].SegmentCountOrIx if nSegs != 0 { tiles[i].SegmentCountOrIx = ^nextSegIx // !seg_ix in Rust nextSegIx += nSegs }
} CODE_BLOCK:
// PathSegment — a line segment clipped to tile boundaries
type PathSegment struct { Point0 [2]float32 // Tile-relative start point Point1 [2]float32 // Tile-relative end point YEdge float32 // Y where segment crosses tile left edge (x=0)
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
// PathSegment — a line segment clipped to tile boundaries
type PathSegment struct { Point0 [2]float32 // Tile-relative start point Point1 [2]float32 // Tile-relative end point YEdge float32 // Y where segment crosses tile left edge (x=0)
} CODE_BLOCK:
// PathSegment — a line segment clipped to tile boundaries
type PathSegment struct { Point0 [2]float32 // Tile-relative start point Point1 [2]float32 // Tile-relative end point YEdge float32 // Y where segment crosses tile left edge (x=0)
} CODE_BLOCK:
// fillPath computes per-pixel coverage for a single tile.
// Direct port of fill_path from vello_shaders/src/cpu/fine.rs.
func fillPath(area []float32, segments []PathSegment, backdrop int32, evenOdd bool) { // Initialize area with backdrop winding number for i := range area { area[i] = float32(backdrop) } for _, seg := range segments { // For each column in the tile, compute the area // contribution of this segment using analytic integration // ... } // Apply fill rule and convert to alpha [0, 1] if evenOdd { for i := range area { area[i] = abs32(area[i] - 2.0*round32(0.5*area[i])) } } else { for i := range area { area[i] = min32(abs32(area[i]), 1.0) } }
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
// fillPath computes per-pixel coverage for a single tile.
// Direct port of fill_path from vello_shaders/src/cpu/fine.rs.
func fillPath(area []float32, segments []PathSegment, backdrop int32, evenOdd bool) { // Initialize area with backdrop winding number for i := range area { area[i] = float32(backdrop) } for _, seg := range segments { // For each column in the tile, compute the area // contribution of this segment using analytic integration // ... } // Apply fill rule and convert to alpha [0, 1] if evenOdd { for i := range area { area[i] = abs32(area[i] - 2.0*round32(0.5*area[i])) } } else { for i := range area { area[i] = min32(abs32(area[i]), 1.0) } }
} CODE_BLOCK:
// fillPath computes per-pixel coverage for a single tile.
// Direct port of fill_path from vello_shaders/src/cpu/fine.rs.
func fillPath(area []float32, segments []PathSegment, backdrop int32, evenOdd bool) { // Initialize area with backdrop winding number for i := range area { area[i] = float32(backdrop) } for _, seg := range segments { // For each column in the tile, compute the area // contribution of this segment using analytic integration // ... } // Apply fill rule and convert to alpha [0, 1] if evenOdd { for i := range area { area[i] = abs32(area[i] - 2.0*round32(0.5*area[i])) } } else { for i := range area { area[i] = min32(abs32(area[i]), 1.0) } }
} CODE_BLOCK:
// FlattenFill flattens cubic Beziers using Vello's Euler spiral algorithm
func FlattenFill(cubics []CubicBezier) []LineSoup { var lines []LineSoup for _, c := range cubics { p0 := vec2{c.P0[0], c.P0[1]} p1 := vec2{c.P1[0], c.P1[1]} p2 := vec2{c.P2[0], c.P2[1]} p3 := vec2{c.P3[0], c.P3[1]} flattenEulerFill(p0, p1, p2, p3, &lines) } return lines
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
// FlattenFill flattens cubic Beziers using Vello's Euler spiral algorithm
func FlattenFill(cubics []CubicBezier) []LineSoup { var lines []LineSoup for _, c := range cubics { p0 := vec2{c.P0[0], c.P0[1]} p1 := vec2{c.P1[0], c.P1[1]} p2 := vec2{c.P2[0], c.P2[1]} p3 := vec2{c.P3[0], c.P3[1]} flattenEulerFill(p0, p1, p2, p3, &lines) } return lines
} CODE_BLOCK:
// FlattenFill flattens cubic Beziers using Vello's Euler spiral algorithm
func FlattenFill(cubics []CubicBezier) []LineSoup { var lines []LineSoup for _, c := range cubics { p0 := vec2{c.P0[0], c.P0[1]} p1 := vec2{c.P1[0], c.P1[1]} p2 := vec2{c.P2[0], c.P2[1]} p3 := vec2{c.P3[0], c.P3[1]} flattenEulerFill(p0, p1, p2, p3, &lines) } return lines
} CODE_BLOCK:
// PTCL commands
const ( CmdEnd = 0 CmdFill = 1 // Compute coverage from segments CmdSolid = 3 // Full coverage (no segments needed) CmdColor = 5 // Apply RGBA color with source-over blending CmdBeginClip = 10 // Push clip layer CmdEndClip = 11 // Pop and composite clip
) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
// PTCL commands
const ( CmdEnd = 0 CmdFill = 1 // Compute coverage from segments CmdSolid = 3 // Full coverage (no segments needed) CmdColor = 5 // Apply RGBA color with source-over blending CmdBeginClip = 10 // Push clip layer CmdEndClip = 11 // Pop and composite clip
) CODE_BLOCK:
// PTCL commands
const ( CmdEnd = 0 CmdFill = 1 // Compute coverage from segments CmdSolid = 3 // Full coverage (no segments needed) CmdColor = 5 // Apply RGBA color with source-over blending CmdBeginClip = 10 // Push clip layer CmdEndClip = 11 // Pop and composite clip
) CODE_BLOCK:
func fineRasterizeTile(ptcl *PTCL, segments []PathSegment, bgColor [4]float32) [TileWidth * TileHeight][4]float32 { const pixelCount = TileWidth * TileHeight var rgba [pixelCount][4]float32 for i := range rgba { rgba[i] = bgColor } for { tag, nextOffset := ptcl.ReadCmd(offset) switch tag { case CmdEnd: return rgba case CmdFill: // Compute coverage, store in area buffer case CmdColor: // Apply color using area as mask, source-over blend case CmdBeginClip: // Push current pixels onto clip stack case CmdEndClip: // Pop and composite with clip mask } }
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
func fineRasterizeTile(ptcl *PTCL, segments []PathSegment, bgColor [4]float32) [TileWidth * TileHeight][4]float32 { const pixelCount = TileWidth * TileHeight var rgba [pixelCount][4]float32 for i := range rgba { rgba[i] = bgColor } for { tag, nextOffset := ptcl.ReadCmd(offset) switch tag { case CmdEnd: return rgba case CmdFill: // Compute coverage, store in area buffer case CmdColor: // Apply color using area as mask, source-over blend case CmdBeginClip: // Push current pixels onto clip stack case CmdEndClip: // Pop and composite with clip mask } }
} CODE_BLOCK:
func fineRasterizeTile(ptcl *PTCL, segments []PathSegment, bgColor [4]float32) [TileWidth * TileHeight][4]float32 { const pixelCount = TileWidth * TileHeight var rgba [pixelCount][4]float32 for i := range rgba { rgba[i] = bgColor } for { tag, nextOffset := ptcl.ReadCmd(offset) switch tag { case CmdEnd: return rgba case CmdFill: // Compute coverage, store in area buffer case CmdColor: // Apply color using area as mask, source-over blend case CmdBeginClip: // Push current pixels onto clip stack case CmdEndClip: // Pop and composite with clip mask } }
} CODE_BLOCK:
gogpu/gg CPU Rasterizer │ ├── tilecompute (16x16 tiles) │ GPU compute pipeline ported to CPU │ 9 WGSL shaders ready for GPU execution │ └── SparseStripsRasterizer (4x4 tiles) CPU/SIMD-optimized pipeline Sort by Y,X + strip rendering Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
gogpu/gg CPU Rasterizer │ ├── tilecompute (16x16 tiles) │ GPU compute pipeline ported to CPU │ 9 WGSL shaders ready for GPU execution │ └── SparseStripsRasterizer (4x4 tiles) CPU/SIMD-optimized pipeline Sort by Y,X + strip rendering CODE_BLOCK:
gogpu/gg CPU Rasterizer │ ├── tilecompute (16x16 tiles) │ GPU compute pipeline ported to CPU │ 9 WGSL shaders ready for GPU execution │ └── SparseStripsRasterizer (4x4 tiles) CPU/SIMD-optimized pipeline Sort by Y,X + strip rendering COMMAND_BLOCK:
Path arrives at Context.Fill() │ ├── Shape detection → SDF Accelerator (circles, rrects) │ GPU SDF or CPU SDF — smoothstep coverage, highest quality │ ├── Adaptive threshold check │ │ │ ├── Below threshold → AnalyticFiller (scanline) │ │ Zero tile overhead, O(width × edges) │ │ │ └── Above threshold → AdaptiveFiller │ │ │ ├── < 10K segments → SparseStrips (4×4 tiles) │ │ CPU/SIMD-optimized, lower tile overhead │ │ │ └── > 10K segments + large canvas → TileCompute (16×16 tiles) │ GPU workgroup-ready, 16× fewer tiles │ └── GPU Compute → Vello PTCL pipeline (full scene) 9-stage GPU compute, massively parallel Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
Path arrives at Context.Fill() │ ├── Shape detection → SDF Accelerator (circles, rrects) │ GPU SDF or CPU SDF — smoothstep coverage, highest quality │ ├── Adaptive threshold check │ │ │ ├── Below threshold → AnalyticFiller (scanline) │ │ Zero tile overhead, O(width × edges) │ │ │ └── Above threshold → AdaptiveFiller │ │ │ ├── < 10K segments → SparseStrips (4×4 tiles) │ │ CPU/SIMD-optimized, lower tile overhead │ │ │ └── > 10K segments + large canvas → TileCompute (16×16 tiles) │ GPU workgroup-ready, 16× fewer tiles │ └── GPU Compute → Vello PTCL pipeline (full scene) 9-stage GPU compute, massively parallel COMMAND_BLOCK:
Path arrives at Context.Fill() │ ├── Shape detection → SDF Accelerator (circles, rrects) │ GPU SDF or CPU SDF — smoothstep coverage, highest quality │ ├── Adaptive threshold check │ │ │ ├── Below threshold → AnalyticFiller (scanline) │ │ Zero tile overhead, O(width × edges) │ │ │ └── Above threshold → AdaptiveFiller │ │ │ ├── < 10K segments → SparseStrips (4×4 tiles) │ │ CPU/SIMD-optimized, lower tile overhead │ │ │ └── > 10K segments + large canvas → TileCompute (16×16 tiles) │ GPU workgroup-ready, 16× fewer tiles │ └── GPU Compute → Vello PTCL pipeline (full scene) 9-stage GPU compute, massively parallel CODE_BLOCK:
// threshold = max(32, 2048/sqrt(bboxArea))
//
// 100×100 bbox → threshold 20 elements (tiles kick in early)
// 500×500 bbox → threshold 4 elements (large shapes → always tiles)
// 30×30 bbox → always scanline (below bboxMinDimension)
func adaptiveThreshold(bboxArea float64) int { if bboxArea <= 0 { return maxElementThreshold } threshold := int(2048.0 / math.Sqrt(bboxArea)) return clamp(threshold, 32, 256)
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
// threshold = max(32, 2048/sqrt(bboxArea))
//
// 100×100 bbox → threshold 20 elements (tiles kick in early)
// 500×500 bbox → threshold 4 elements (large shapes → always tiles)
// 30×30 bbox → always scanline (below bboxMinDimension)
func adaptiveThreshold(bboxArea float64) int { if bboxArea <= 0 { return maxElementThreshold } threshold := int(2048.0 / math.Sqrt(bboxArea)) return clamp(threshold, 32, 256)
} CODE_BLOCK:
// threshold = max(32, 2048/sqrt(bboxArea))
//
// 100×100 bbox → threshold 20 elements (tiles kick in early)
// 500×500 bbox → threshold 4 elements (large shapes → always tiles)
// 30×30 bbox → always scanline (below bboxMinDimension)
func adaptiveThreshold(bboxArea float64) int { if bboxArea <= 0 { return maxElementThreshold } threshold := int(2048.0 / math.Sqrt(bboxArea)) return clamp(threshold, 32, 256)
} CODE_BLOCK:
dc := gg.NewContext(800, 600) // Auto (default) — intelligent per-path selection
dc.SetRasterizerMode(gg.RasterizerAuto) // Force scanline for all paths (debugging, isolation)
dc.SetRasterizerMode(gg.RasterizerAnalytic) // Force 4×4 tiles (benchmarking CPU/SIMD path)
dc.SetRasterizerMode(gg.RasterizerSparseStrips) // Force 16×16 tiles (benchmarking GPU workgroup path)
dc.SetRasterizerMode(gg.RasterizerTileCompute) // Force SDF for maximum circle/rrect quality
dc.SetRasterizerMode(gg.RasterizerSDF) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
dc := gg.NewContext(800, 600) // Auto (default) — intelligent per-path selection
dc.SetRasterizerMode(gg.RasterizerAuto) // Force scanline for all paths (debugging, isolation)
dc.SetRasterizerMode(gg.RasterizerAnalytic) // Force 4×4 tiles (benchmarking CPU/SIMD path)
dc.SetRasterizerMode(gg.RasterizerSparseStrips) // Force 16×16 tiles (benchmarking GPU workgroup path)
dc.SetRasterizerMode(gg.RasterizerTileCompute) // Force SDF for maximum circle/rrect quality
dc.SetRasterizerMode(gg.RasterizerSDF) CODE_BLOCK:
dc := gg.NewContext(800, 600) // Auto (default) — intelligent per-path selection
dc.SetRasterizerMode(gg.RasterizerAuto) // Force scanline for all paths (debugging, isolation)
dc.SetRasterizerMode(gg.RasterizerAnalytic) // Force 4×4 tiles (benchmarking CPU/SIMD path)
dc.SetRasterizerMode(gg.RasterizerSparseStrips) // Force 16×16 tiles (benchmarking GPU workgroup path)
dc.SetRasterizerMode(gg.RasterizerTileCompute) // Force SDF for maximum circle/rrect quality
dc.SetRasterizerMode(gg.RasterizerSDF) CODE_BLOCK:
go get github.com/gogpu/gg@latest Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
go get github.com/gogpu/gg@latest CODE_BLOCK:
go get github.com/gogpu/gg@latest - Segment count — how many line segments cross this tile
- Backdrop — the signed winding contribution at the tile's left edge - The Vello team and Raph Levien for the tile rasterization pipeline and Euler spiral flattening research
- The variable naming in tilecompute intentionally matches Vello's Rust originals for easy cross-reference - GitHub: gogpu/gg
- Package: internal/gpu/tilecompute/
- Vello: linebender/vello
- Raph Levien's blog: raphlinus.github.io
- Discussion: Join the conversation
how-totutorialguidedev.toaiserverswitchdatabasegitgithub