Tools: Building Taskmaster: A Go-Powered Process Supervisor from Scratch (2026)

Tools: Building Taskmaster: A Go-Powered Process Supervisor from Scratch (2026)

What Is Taskmaster?

Why Go?

Architecture Deep Dive

The High-Level Picture

Goroutine per Process

Channel-Based Communication

Process State Machine

Configuration: Simple and Expressive

Hot Reload: Zero-Downtime Config Changes

Signal Propagation and Graceful Shutdown

Key Takeaways

Try It Yourself How two 42 School students reimagined process management with Go's concurrency model If you've ever managed a production server, you know the pain: a critical process crashes at 3 AM, no one notices until morning, and your users are left with a broken experience. Tools like Supervisor were built to solve this — but they come with Python overhead, complex configurations, and aging architectures. We decided to build our own. Taskmaster is a lightweight, production-ready process supervisor written in Go — and building it taught us more about operating systems, concurrency, and daemon design than any textbook could. Taskmaster is a process control daemon. It manages the full lifecycle of your processes — starting, stopping, restarting, and monitoring them — all through a simple interactive shell. Think of it as a modern, Go-native alternative to Supervisor or systemd service management, without the complexity. A single YAML file to configure everything. A single binary to run it. The choice of Go wasn't arbitrary. Process supervision is inherently concurrent — you need to monitor dozens of processes simultaneously without blocking. Traditionally, this is solved with threads and shared memory, which leads to complex locking, race conditions, and hard-to-debug crashes. Go offers a better model: goroutines and channels. This aligned perfectly with our architecture: one goroutine per managed process, communicating via channels. Elegant, efficient, and easy to reason about. The main process runs the interactive CLI (using the readline library for history and completion) and listens for system signals. It owns a Task Manager which holds the state of all configured processes. Each managed process gets its own goroutine — a StartTaskManager — that is entirely responsible for that process's lifecycle. Each StartTaskManager goroutine does three things independently: Because each process is isolated in its own goroutine, a crash or slowdown in one process monitor cannot affect others. The system stays responsive even when managing hundreds of processes. The CLI never directly kills or starts a process. Instead, it sends a message over a channel: This decoupling means the CLI remains non-blocking regardless of how long a process takes to shut down. The goroutine handles timeouts, fallback signals, and cleanup entirely on its own. Three channel types power the system: Every process moves through a well-defined set of states: The successfulStartTimeout parameter is a key reliability feature. A process that starts and immediately crashes is different from one that runs for 30 seconds before failing — Taskmaster treats them differently. Taskmaster tasks are defined in a single YAML file. Here's a real-world example managing a web server and a pool of background workers: A few things worth highlighting: instances: 5 — Taskmaster automatically spawns 5 copies of worker.py and names them worker_1 through worker_5. You manage them individually or all at once with restart all. restart: on-failure vs restart: always — The distinction matters in production. on-failure only restarts if the exit code isn't in expectedExitCodes. An intentional exit(0) won't trigger a restart. always is for long-running daemons that should never stop. gracefulStopTimeout: 15 — When you issue a stop, Taskmaster sends the configured signal and waits up to 15 seconds for a clean exit. If the process hasn't stopped, it gets SIGKILL. No zombie processes. One of the features we're most proud of: you can change your configuration file and apply it without restarting the daemon or killing your processes. Under the hood, this sends a SIGHUP signal (or you can do it from outside with kill -HUP <pid>). The config parser re-reads the YAML, diffs it against the current state, and applies changes incrementally. New tasks get started; removed tasks get stopped; modified tasks get restarted. Running tasks that haven't changed? They keep running, untouched. Getting shutdown right is tricky. We had to handle: We used Go's sync.WaitGroup for this. Every goroutine registers itself with a global WaitGroup before starting, and signals done when it exits. The main process waits on this group before terminating — guaranteeing that no child processes are left orphaned. Building Taskmaster from scratch gave us a deep appreciation for: 1. Go's concurrency model is genuinely different. Not just syntactically different from threads — conceptually different. "Don't communicate by sharing memory; share memory by communicating" isn't just a motto. It's a design philosophy that produces cleaner, more correct code. 2. UNIX process management is a deep topic. Process groups, session leaders, signal inheritance, file descriptor leaks, zombie processes — every one of these is a footgun waiting to go off. We hit most of them. 3. Small surface area wins. Taskmaster has one config file, one binary, and one shell. No agents, no web UIs, no databases. This simplicity makes it auditable, embeddable, and easy to debug. Taskmaster is open source and available on GitHub: If you're building something in Go that needs process management, or if you're curious about how supervisors work under the hood, we hope Taskmaster serves as a useful reference. Built with ❤️ by Yassine Bel Hachmi and Hassan Idhmmououhya as part of the 42 School curriculum. ⭐ Star the repo if you find it useful! Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ ./bin/taskmaster config.yaml Taskmaster> -weight: 500;">status Task Status PID Uptime Restarts Command nginx RUNNING 1234 2m15s 0 /usr/local/bin/nginx worker_1 RUNNING 1235 2m15s 1 python3 worker.py worker_2 STOPPED - - 0 python3 worker.py Taskmaster> -weight: 500;">start worker_2 Process 'worker_2' started with PID 1240 Taskmaster> logs worker_1 5 [2026-02-02 10:15:25] Processing task #1 [2026-02-02 10:15:26] Task completed [2026-02-02 10:15:27] Waiting for tasks... $ ./bin/taskmaster config.yaml Taskmaster> -weight: 500;">status Task Status PID Uptime Restarts Command nginx RUNNING 1234 2m15s 0 /usr/local/bin/nginx worker_1 RUNNING 1235 2m15s 1 python3 worker.py worker_2 STOPPED - - 0 python3 worker.py Taskmaster> -weight: 500;">start worker_2 Process 'worker_2' started with PID 1240 Taskmaster> logs worker_1 5 [2026-02-02 10:15:25] Processing task #1 [2026-02-02 10:15:26] Task completed [2026-02-02 10:15:27] Waiting for tasks... $ ./bin/taskmaster config.yaml Taskmaster> -weight: 500;">status Task Status PID Uptime Restarts Command nginx RUNNING 1234 2m15s 0 /usr/local/bin/nginx worker_1 RUNNING 1235 2m15s 1 python3 worker.py worker_2 STOPPED - - 0 python3 worker.py Taskmaster> -weight: 500;">start worker_2 Process 'worker_2' started with PID 1240 Taskmaster> logs worker_1 5 [2026-02-02 10:15:25] Processing task #1 [2026-02-02 10:15:26] Task completed [2026-02-02 10:15:27] Waiting for tasks... ┌─────────────────────────────────────────────────────────────┐ │ Main Process │ │ ┌──────────────┐ ┌─────────────┐ ┌──────────────────┐ │ │ │ CLI Loop │ │ Config │ │ Signal Handler │ │ │ │ (readline) │ │ Parser │ │ (SIGHUP) │ │ │ └──────────────┘ └─────────────┘ └──────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ┌───────────┴───────────┐ │ Task Manager │ └───────────┬───────────┘ │ ┌───────────────────┼───────────────────┐ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │ Process │ │ Process │ │ Process │ │ Monitor │ ... │ Monitor │ │ Monitor │ │(gorout.)│ │(gorout.)│ │(gorout.)│ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │ Child │ │ Child │ │ Child │ │ Process │ │ Process │ │ Process │ └─────────┘ └─────────┘ └─────────┘ ┌─────────────────────────────────────────────────────────────┐ │ Main Process │ │ ┌──────────────┐ ┌─────────────┐ ┌──────────────────┐ │ │ │ CLI Loop │ │ Config │ │ Signal Handler │ │ │ │ (readline) │ │ Parser │ │ (SIGHUP) │ │ │ └──────────────┘ └─────────────┘ └──────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ┌───────────┴───────────┐ │ Task Manager │ └───────────┬───────────┘ │ ┌───────────────────┼───────────────────┐ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │ Process │ │ Process │ │ Process │ │ Monitor │ ... │ Monitor │ │ Monitor │ │(gorout.)│ │(gorout.)│ │(gorout.)│ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │ Child │ │ Child │ │ Child │ │ Process │ │ Process │ │ Process │ └─────────┘ └─────────┘ └─────────┘ ┌─────────────────────────────────────────────────────────────┐ │ Main Process │ │ ┌──────────────┐ ┌─────────────┐ ┌──────────────────┐ │ │ │ CLI Loop │ │ Config │ │ Signal Handler │ │ │ │ (readline) │ │ Parser │ │ (SIGHUP) │ │ │ └──────────────┘ └─────────────┘ └──────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ┌───────────┴───────────┐ │ Task Manager │ └───────────┬───────────┘ │ ┌───────────────────┼───────────────────┐ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │ Process │ │ Process │ │ Process │ │ Monitor │ ... │ Monitor │ │ Monitor │ │(gorout.)│ │(gorout.)│ │(gorout.)│ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │ Child │ │ Child │ │ Child │ │ Process │ │ Process │ │ Process │ └─────────┘ └─────────┘ └─────────┘ StartTaskManager StartTaskManager CLI ──── "-weight: 500;">stop nginx" ───► nginx's CmdChan ──► goroutine acts on it CLI ──── "-weight: 500;">stop nginx" ───► nginx's CmdChan ──► goroutine acts on it CLI ──── "-weight: 500;">stop nginx" ───► nginx's CmdChan ──► goroutine acts on it [STOPPED] → -weight: 500;">start → [STARTED] → (after successfulStartTimeout) → [RUNNING] │ └─ unexpected exit ──► [FATAL] │ -weight: 500;">restart policy applies [STOPPED] → -weight: 500;">start → [STARTED] → (after successfulStartTimeout) → [RUNNING] │ └─ unexpected exit ──► [FATAL] │ -weight: 500;">restart policy applies [STOPPED] → -weight: 500;">start → [STARTED] → (after successfulStartTimeout) → [RUNNING] │ └─ unexpected exit ──► [FATAL] │ -weight: 500;">restart policy applies successfulStartTimeout tasks: web_server: command: "/usr/local/bin/nginx -g 'daemon off;'" instances: 1 autoLaunch: true -weight: 500;">restart: on-failure expectedExitCodes: [0] successfulStartTimeout: 3 restartsAttempts: 3 stopingSignal: SIGTERM gracefulStopTimeout: 10 stdout: /var/log/taskmaster/nginx.out.log stderr: /var/log/taskmaster/nginx.err.log environment: PORT: "8080" ENV: "production" workingDirectory: /var/www worker: command: "python3 worker.py" instances: 5 autoLaunch: true -weight: 500;">restart: always restartsAttempts: 5 gracefulStopTimeout: 15 stdout: /var/log/taskmaster/worker.out.log stderr: /var/log/taskmaster/worker.err.log tasks: web_server: command: "/usr/local/bin/nginx -g 'daemon off;'" instances: 1 autoLaunch: true -weight: 500;">restart: on-failure expectedExitCodes: [0] successfulStartTimeout: 3 restartsAttempts: 3 stopingSignal: SIGTERM gracefulStopTimeout: 10 stdout: /var/log/taskmaster/nginx.out.log stderr: /var/log/taskmaster/nginx.err.log environment: PORT: "8080" ENV: "production" workingDirectory: /var/www worker: command: "python3 worker.py" instances: 5 autoLaunch: true -weight: 500;">restart: always restartsAttempts: 5 gracefulStopTimeout: 15 stdout: /var/log/taskmaster/worker.out.log stderr: /var/log/taskmaster/worker.err.log tasks: web_server: command: "/usr/local/bin/nginx -g 'daemon off;'" instances: 1 autoLaunch: true -weight: 500;">restart: on-failure expectedExitCodes: [0] successfulStartTimeout: 3 restartsAttempts: 3 stopingSignal: SIGTERM gracefulStopTimeout: 10 stdout: /var/log/taskmaster/nginx.out.log stderr: /var/log/taskmaster/nginx.err.log environment: PORT: "8080" ENV: "production" workingDirectory: /var/www worker: command: "python3 worker.py" instances: 5 autoLaunch: true -weight: 500;">restart: always restartsAttempts: 5 gracefulStopTimeout: 15 stdout: /var/log/taskmaster/worker.out.log stderr: /var/log/taskmaster/worker.err.log instances: 5 -weight: 500;">restart all -weight: 500;">restart: on-failure -weight: 500;">restart: always expectedExitCodes gracefulStopTimeout: 15 Taskmaster> reload Configuration reloaded. Taskmaster> reload Configuration reloaded. Taskmaster> reload Configuration reloaded. kill -HUP <pid> sync.WaitGroup // Simplified version of the shutdown flow tasks.WaitGroup.Wait() // Block until all process monitors are done os.Exit(0) // Simplified version of the shutdown flow tasks.WaitGroup.Wait() // Block until all process monitors are done os.Exit(0) // Simplified version of the shutdown flow tasks.WaitGroup.Wait() // Block until all process monitors are done os.Exit(0) -weight: 500;">git clone https://github.com/UBA-code/taskmaster.-weight: 500;">git cd taskmaster make build ./bin/taskmaster # Generates an example config and starts the shell -weight: 500;">git clone https://github.com/UBA-code/taskmaster.-weight: 500;">git cd taskmaster make build ./bin/taskmaster # Generates an example config and starts the shell -weight: 500;">git clone https://github.com/UBA-code/taskmaster.-weight: 500;">git cd taskmaster make build ./bin/taskmaster # Generates an example config and starts the shell - Goroutines are lightweight (a few KB of stack vs. MB for threads) and can be spawned in the thousands without issue. - Channels provide safe, structured communication between concurrent components — no mutexes, no shared state hell. - Listens for control commands over a buffered channel (CmdChan) - Monitors the child process for exits and unexpected crashes - Handles restarts based on the configured policy - Command Channels — carry control messages (-weight: 500;">start, -weight: 500;">stop, -weight: 500;">restart) - Done Channels — signal that a child process has exited - Timeout Channels — implement deadlines for startup grace periods and graceful shutdowns - STOPPED: Not running (intentionally or not yet started) - STARTED: Running but in the startup grace period - RUNNING: Confirmed healthy and past the startup timeout - FATAL: Crashed and -weight: 500;">restart attempts exhausted - The daemon receiving SIGTERM (e.g., from the OS on shutdown) - Propagating the right signal to each child process - Waiting for all children to exit before the daemon itself exits