Tools: I Built a Compiler with AI Engineering Over a Weekend. These are 3 Core Strategies for Scalable AI Development

Tools: I Built a Compiler with AI Engineering Over a Weekend. These are 3 Core Strategies for Scalable AI Development

Source: Dev.to

Wait, what is wrong with 1,000 commits per hour? ## The project: a whole programming language ## The basic workflow: implementing a feature ## What does a Task actually look like? ## But what about bigger features? ## What does an Epic look like? ## Scaling the workflow: many Epics, many Phases ## Step 1: Plan multiple Phases top-down ## Step 2: Execute Phase by Phase ## Step 3: Review with a different model ## Step 4: Re-planning with the judge ## What does a Phase Plan look like? ## The real results ## Try it yourself You know that feeling when you've been doing something for years, and then someone comes along and says "nah, throw all that away"? That is exactly how I felt reading Cursor's blog post about self-driving codebases. Don't get me wrong, I do believe this is impressive. 3M+ lines of code. Approximately 1,000 commits per hour. Thousands of agents working together to build a web browser. But something about it bugged me. It ignores everything we have learned about software engineering. As you know: Throughput is not progress. Maybe 10 meaningful commits targeting goals we would like to achieve would be more helpful. The Cursor approach optimizes for raw output. More agents, more commits, more lines of code. But after years of building software the "right way," here is what I know matters: So when I set out to build Sifr, a compiled programming language that uses Python syntax and compiles to Rust, I decided to do it with AI engineering. But I wanted disciplined agents. The kind that follow a process. The kind that write PRDs, create tickets, review each other's code, and do not merge without passing tests. And let me tell you, it works. Really well. 😄 Before we get into the workflow, let me give you a taste of what we are building. Sifr is a compiled language with: This is not a toy. It is a full compiler pipeline (lexer, parser, AST, binder, type checker, HIR, Rust codegen, rustc, and finally binary) with a roadmap stretching from language foundations all the way to a web framework, package manager, and ecosystem. And it was built almost entirely using AI engineering following the workflow I am about to describe. Sponsorship note: This project was initially sponsored by CDON, a leading marketplace in the Nordics (Sweden, Norway, Denmark, Finland). Let's start small. Here is how a single Task moves through the board, from an idea to merged code. The Task moves across board columns as it progresses: Backlog -> Ready -> In Progress -> Review -> Done. Each step maps to a real action that an AI agent can execute. Let me show you a real example from Sifr. Here is Task #100: Expand Built-in Functions: That is it. Small, focused, concrete. The agent implemented this in one PR. 🚨 Gotcha: You do not want to become the bottleneck. Make sure that shipping does not require you to be in the middle. That includes manual testing, manual clicking, manual deployment, all of it. If you are the human doing QA on every PR, you have defeated the purpose. A single Task is great for "add a len() method to strings." But what about "implement a borrow checker"? That is where Epics come in. Here is the critical insight. Every Epic starts with a PRDS, a combined PRD and Solution Design document. The agent does not just jump in and start coding. It uses a specific tool to write a single structured document covering both sides: The PRDS gets added to the board as an Epic, refined, and then comes the part that matters most. Step 4 is a human reviewing the PRDS. This is the human-in-the-loop checkpoint. You are not reviewing 50 PRs. You are reviewing one document that shapes all of them. Once approved, the Epic gets broken down into smaller Tasks, and those Tasks follow the basic workflow above. And here is a step the blog posts never talk about: the Epic demo. Before marking an Epic as Done, you create a working demo that showcases all major features delivered. In Sifr, these live in a ./demos folder, each named after the Epic. If the demo does not work, the Epic is not done. Simple as that. Here is Epic: Add collections.Counter. This epic was about adding the first class-based API to the standard library. The agents broke this down into tasks: implement the intrinsics, implement the Sifr class, add tests, and create a demo. 🚨 Gotcha: Without reviewing the PRDS, you cannot guarantee the results. The agent might build the completely wrong thing, beautifully. I have seen it happen. In Sifr, every Epic has a PRDS document. The borrow-by-default Phase? It started with a PRDS that defined parameter conventions, escape analysis rules, and codegen patterns, before a single line of code was written. Okay, so you can ship a feature. You can ship a big feature. But what about building an entire programming language with 21 Phases? This is where things get interesting. The key word here is top-down. You plan the high-level Phases first, then drill into Epics within each Phase. And critically, avoid parallel Epics. In Sifr's roadmap, each Phase has a clear ordering rationale. For example: Every ordering decision is documented. Not in someone's head, but in the codebase, in the roadmap, with explicit rationale for why Phase N depends on Phase N-1. Each Epic within a Phase follows the epic workflow: PRD -> solution design -> human review -> Task breakdown -> execute. The agents pick up Tasks, implement, create PRs, get reviewed. 🚨 Gotcha: Don't execute too many Phases at once. I tried. The agents start creating workarounds for dependencies that haven't been implemented yet, and you end up with spaghetti. Sequential execution with clear Phase boundaries is the way. This is one of my favorite tricks. After a Phase of execution, I use a different agent session (and often a different model) to review the work. The reviewer has fresh context, with no sunk cost bias or "I already wrote this so it must be right" mentality. The reviewer runs in a feedback loop: review -> fix -> review -> fix -> review. Three iterations is the sweet spot. After review cycles, a "judge" (the smartest model you have access to) evaluates whether the plan needs to be steered. Maybe the type system completion phase revealed that the codegen architecture should be restructured first. Maybe a new constraint emerged. The judge decides whether to continue as planned or adjust. If adjustment is needed, the plan is updated and execution continues. Multiple reviewer agents can also weigh in during this phase. 🚨 Gotcha #1: Parallel work might not be the best idea. There could be unidentified dependencies between Epics. Agents will make workarounds and create sloppy solutions instead of waiting for the right foundation to be in place. 🚨 Gotcha #2: It is good to plan for the future, but don't get stuck with too many details about later Phases. The first few Phases will teach you things that change your assumptions about later Phases (you can also update the plan and insert new phases midway). Plan the current Phase in detail, and keep future Phases as rough outlines. This is a snippet of how we structure high-level planning. We track all phases, and for each one, we define exactly what capabilities it unlocks: Notice the "What it unlocks" column. We don't just list technical tasks; we list capabilities. Phase 1 unlocks single-file programs. Phase 2 unlocks generics. This helps the AI (and me) understand the purpose of the phase, not just the code. Sifr has completed 11 Phases and over 80 Epics using this workflow. The compiler handles: All of this was built with AI engineering following the structured workflow described above. Not thousands of agents racing to commit, but a disciplined process where every feature starts with a plan, gets implemented incrementally, and gets reviewed before merge. An impressive result is that the first version of the working compiler for the core language was built over a weekend, literally on a Saturday & Sunday!! You can find the repo here: Sifr. If you want to adopt this workflow, here is the TL;DR: The agents are the hands and the architect is YOU. What do you think? Have you tried applying AI engineering on a real project? I would love to hear about your workflow, so drop a comment or reach out! Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse - Agile development, meaning time-bounded sprints with scoped work, not infinite agent swarms. - Meaningful changes over large volume. 10 thoughtful PRs might beat 1,000 commits. - Strong feedback loops like tests, CI, and code review, rather than just hoping the agents figure it out. - Architecture decisions and interface contracts backed by documented reasoning, not emergent chaos. - Python syntax plus static typing - Compilation to Rust for native binaries - A borrow-by-default ownership model - TypeScript-style union types, type narrowing, and protocols - Over 45 standard library modules with zero-panic guarantees - 21 planned phases, with 11 completed and over 80 milestones - Draft the Task. The agent writes a Task with the current situation, desired situation, and a suggested solution. This is scoped to a small number of changes. - Add to the board. The agent creates a GitHub issue and adds it to the project board. The Task lands in the Backlog. - Refine & prioritize. The agent assesses effort versus value and moves the highest-priority Tasks to Ready. - Work on the Task. The agent picks up the highest-priority Ready Task, creates a branch, implements the changes, runs tests locally, and creates a PR. This PR uses a template that requires an issue link, bullet-point changes, and deployment considerations. The Task moves to Review. - Review the PR. A different agent, preferably a different model, reviews the PR for logic bugs, unnecessary complexity, test coverage, style, and architecture alignment. - Adjust. The implementing agent addresses review comments. - Merge. The PR merges, and the Task moves to Done. Ship it. - Current Situation: max(a, b) and min(a, b) with two arguments are not supported (only the list form max([1, 2]) works). - Desired Situation: All common Python built-in function signatures should work. - Suggested Solution: Update the compiler's lowering phase to handle 2-argument max/min `. - Acceptance Criteria: max(1, 2) returns 2 - Product requirements: problem statement, goals, scope, constraints, acceptance criteria (with Given/When/Then). - Solution design: architecture, data model, API design, error handling, testing strategy, trade-offs. - Objective: Users need to count hashable objects easily. Implement collections.Counter. - Scope: Define class Counter in lib/sifr/collections.sifr. Implement methods: __init__, most_common, total, update, keys, values. Add necessary rust implementation to support these methods. - Define class Counter in lib/sifr/collections.sifr. - Implement methods: __init__, most_common, total, update, keys, values. - Add necessary rust implementation to support these methods. - Solution Design: Data Structure: Wrap a Rust HashMap but expose it as a Python class. API: Match Python's Counter API exactly. Testing: Verify counting works, most_common returns sorted results, and empty counters behave correctly. - Data Structure: Wrap a Rust HashMap but expose it as a Python class. - API: Match Python's Counter API exactly. - Testing: Verify counting works, most_common returns sorted results, and empty counters behave correctly. - Acceptance Criteria: from sifr.collections import Counter works, and Counter("hello").most_common(1) returns [('l', 2)]. - Define class Counter in lib/sifr/collections.sifr. - Implement methods: __init__, most_common, total, update, keys, values. - Add necessary rust implementation to support these methods. - Data Structure: Wrap a Rust HashMap but expose it as a Python class. - API: Match Python's Counter API exactly. - Testing: Verify counting works, most_common returns sorted results, and empty counters behave correctly. - Type System Power comes before Standard Library, because the stdlib needs generics and closures for proper type signatures. - Error Safety comes before Stdlib Safety Remediation, because you cannot make intrinsics return Result types if the compiler does not enforce error class hierarchies yet. - Borrow-by-Default comes before Stdlib Deepening, so new stdlib functions are written with the final ownership model from day one. - A full type system with generics, protocols, union types, and type narrowing. - Over 45 stdlib modules. - Borrow-by-default ownership semantics. - Error handling with compiler-enforced exhaustiveness checking. - Small Tasks: Draft -> Board -> Refine -> Work -> Review -> Merge. - Epics: PRDS -> Board -> Refine -> Human Review -> Break Down into Tasks -> Execute -> Epic Demo -> Done. - Phases: Plan top-down -> Execute sequentially -> Review with different model -> Re-plan with judge. - Automate the boring stuff: Ticket creation, PR templates, review checklists, board management. Make them commands the agent can run. - Don't be the bottleneck: If shipping requires you in the loop for every PR, you have lost.