Tools: Building a 1,056-Test Rust CLI Without Writing Rust — Claude Code Did It
The Subagent Pattern
Week 1: Fork and Rename
Week 2: The 6 New Filters
Week 3: Benchmarks and Honest Failures
What I Actually Did vs. What Claude Did
Final Stats I don't write Rust. I can read it well enough to catch obvious bugs, but I've never typed impl or fn main() from scratch. Yet I shipped a 40-module Rust CLI with 1,056 tests in 3 weeks. Claude Code wrote every line of Rust. I wrote prompts, reviewed diffs, and made architecture decisions. The tool — ContextZip — compresses Claude Code's own context window. So the AI built a tool to make itself work better. That irony wasn't lost on me. Here's exactly how the process worked, including the parts that went wrong. I never gave Claude Code a vague instruction like "build a context compressor." Every task was a subagent dispatch — a scoped prompt with clear inputs, expected outputs, and test requirements. "Implement an error stacktrace filter for Node.js. Input: raw stderr with Express middleware frames. Output: error message + user code frames only. Write 20+ test cases covering nested errors, empty traces, and mixed stdout/stderr. Put the filter in src/filters/error_stacktrace.rs." The subagent implements, writes tests, runs them. Then I dispatch a second subagent to review: "Review the error_stacktrace filter. Check edge cases: what happens with zero frames? Frames with no file path? Stack traces inside JSON output?" This two-agent cycle — implement, then review — caught 80% of bugs before I even looked at the code. The foundation was RTK (Rust Token Killer), an open-source CLI with 34 command modules, 60+ TOML filters, and 950 tests. I forked it and dispatched a subagent to rename every reference from "rtk" to "contextzip" across 70 files. 1,544 insertions, 1,182 deletions. All 950 tests still passing. Then three agents worked in parallel: one on the install script, one on GitHub Actions CI/CD for 5 platforms, one extending the SQLite tracking system. By Friday: curl | bash installs the binary on Linux or macOS, and contextzip gain --by-feature shows per-filter savings. This is where ContextZip stops being a rename and starts being a product. Six new compression filters, each built by a subagent cycle: Each filter got 15-20 dedicated test cases. The error stacktrace filter alone has 20 tests covering 5 languages. I ran 102 benchmark tests with production-scale inputs. The results were not uniformly impressive. Rust panic compression started at 2%. The subagent's first implementation only stripped the backtrace header line. I rewrote the prompt with explicit examples of Rust panic output and dispatched again. It landed at 80%. Java stacktrace compression went negative (-12%) on short traces. The formatted output was longer than the raw input. I added a threshold: if compression ratio is below 10%, pass through the original output unchanged. Final result: 20% savings on Java, no negative cases. Build error grouping hit -10% on single-error inputs. Same fix — threshold passthrough. Lying about benchmarks is worse than imperfect numbers. The README shows every result, including the weak spots. Me: Architecture decisions, prompt design, review, quality gates, benchmark analysis, bug triage. Claude Code: All Rust implementation, test writing, CI/CD configuration, README generation, install script. The split was roughly 20% me (thinking, reviewing, deciding) and 80% Claude (typing, testing, building). But that 20% was the difference between shipping and not shipping. Without review cycles, the Rust panic filter would still be at 2%. The tool works. I use it daily. My Claude Code sessions last 40-60% longer before hitting context limits. The AI built a tool to extend its own memory, and the humans reviewing it are the reason it actually works. Templates let you quickly answer FAQs or store snippets for re-use. as well , this person and/or - Error stacktraces — strips framework frames from Node.js, Python, Rust, Go, Java
- ANSI preprocessor — removes escape codes, spinners, progress bars- Web page extraction — strips nav, footer, ads, keeps article content- Build error grouping — collapses 40 identical TypeScript errors into one group- Package install compression — removes deprecated warnings, keeps security alerts- Docker build compression — success = 1 line, failure = full context - 1,056 tests, 0 failures- 102 benchmark cases- 40+ command modules (34 inherited + 6 new)- 5-platform CI/CD (Linux x86/musl, macOS arm64/x86, Windows)- 3 install methods (curl, Homebrew, cargo)- README in 4 languages - GitHub: jee599/contextzip- 102-test benchmark results