Tools: Report: How I built a self-hosted PaaS on AWS from scratch - no Docker, no Kubernetes

Tools: Report: How I built a self-hosted PaaS on AWS from scratch - no Docker, no Kubernetes

nelsonramosua / Forge Forge is a self-hosted PaaS that deploys web apps directly from GitHub webhooks - no Docker, no Kubernetes, just push to deploy. Forge Forge is a self-hosted deployment platform. You push a commit to GitHub, and Forge builds and runs your application on your own infrastructure - no third-party platform required.

Think of it as a stripped-down Render or Fly.io that you own entirely: one server runs the control plane, one or more servers run your apps, and a reverse proxy routes traffic automatically with TLS. How it works in plain terms You add a forge.yaml file to your repository describing how to build and run your app.You configure a GitHub webhook pointing to your Forge instance.Every push to your allowed branch triggers a deployment Forge clones your repository.Builds it using the commands you defined.Starts the application process.Runs health checks to confirm it is live.

Updates the reverse proxy so traffic reaches it at your-app.yourdomain.com. Logs stream in real time. If the health checks fail, Forge… View on GitHub

nelsonramosua / Forge

Forge is a self-hosted PaaS that deploys web apps directly from GitHub webhooks - no Docker, no Kubernetes, just push to deploy.

How it works in plain terms

Table of Contents

What Forge is

Infrastructure

The deploy lifecycle

The problems that shaped Forge

Observability

What I would do differently

Conclusion Every time I pushed to GitHub, the ritual was the same: SSH into the server, git pull, restart the process (hoping the port was free), and pray nothing broke. Logs were scattered, rollbacks were manual, and TLS felt like an afterthought. That works for a small demo, but it gets old very quickly. I wanted push-to-deploy: commit, push, wait for health checks, and get a live URL. I knew platforms like Render and Fly.io solved this elegantly - but I wanted to understand how. Not at the API level. At the level of what actually happens when a push triggers a deployment. So I built Forge: a small self-hosted PaaS. A GitHub push becomes a running process on my own AWS infrastructure, with health checks, logs, routing, TLS, rollback, and metrics. No Docker daemon. No Kubernetes cluster. Just Linux processes, cgroups, namespaces, Caddy, SQLite, Terraform, and Ansible. However, Forge became much more than «run my app after a push». It forced me to learn about VPCs, Linux process isolation, state machines, TLS routing, CI/CD, observability and failure recovery. The architecture was not designed perfectly upfront. It was shaped by real problems - and that is what this article is actually about. Forge is a self-hosted deployment platform. You push a commit to GitHub, and Forge builds and runs your application on your own infrastructure - no third-party platform required. Think of it as a stripped-down Render or Fly.io that you own entirely: one server runs the control plane, one or more servers run your apps, and a reverse proxy routes traffic automatically with TLS. Forge is a small self-hosted PaaS. A GitHub push becomes a running process on my own AWS infrastructure, with health checks, logs, routing, TLS, rollback, and metrics. There is no Docker daemon and no Kubernetes cluster. At a high level, the system has one Go control plane, one or more C worker agents, and Caddy as the edge reverse proxy. The control plane receives webhooks, validates manifests, stores state, schedules tasks, and updates routes. The worker builds and runs applications. The control plane is written in Go. It exposes a REST API, processes GitHub webhooks, schedules deployments, and manages a Caddy reverse proxy via its Admin API. State lives in SQLite in WAL mode - Write-Ahead Logging, which allows concurrent reads without blocking writes, making it a good fit for a single-node system with frequent status updates. Secrets are encrypted at rest with AES-256-GCM. The worker agent - forge-agent - is written in C11. It runs on a private worker VM, polls the control plane for tasks, and executes builds and processes. The build runner is also C11: it sets up cgroups v2 limits and Linux user, PID, and mount namespaces before running any user-supplied command. Apps declare everything in a forge.yaml: The infra journey was not linear. The original plan was Oracle Cloud's Always Free tier: two Ampere A1 cores and 24 GB of RAM at zero cost, forever. In practice, free A1 instances were consistently unavailable in the regions that mattered - capacity errors on every attempt, for days. When instances did provision, ARM64 build toolchain issues and inconsistent networking behaviour on the free tier made the setup brittle enough to abandon. The experiment taught me that «free» and «reliable» tend to be orthogonal properties. The next step logical step was to try AWS Free Tier: t3.micro instances, 12 months, no cost. That worked - until it didn't. The free tier has no NAT gateway. The worker, architecturally and in terms of security, should sit in a private subnet with no public IP, which means it needs a NAT gateway to reach GitHub for clones and package registries for builds. A NAT gateway costs $0.048/hour before data transfer (in May 2026) - about $35/month just to exist - a number that surprised me. That single line item pushed the project from «free experiment» into «this has a real monthly bill» (which, as a student, is not nothing). Add two t3.micro instances, two 30 GB gp3 volumes, two public IPv4 addresses, and a Route 53 hosted zone, and the number lands around $65/month for an always-on setup. Not expensive in absolute terms, but meaningfully more than zero - and a good reminder that private subnets are not free architecture. The AWS layout is deliberate. The control plane is a t3.micro in a public subnet - it accepts HTTPS from the internet and serves as the single entry point. The worker is a t3.micro in a private subnet with no public IP. A NAT gateway gives it outbound access for Git clones and package downloads. Security groups enforce the topology: Caddy runs on the control plane and proxies HTTPS traffic to private worker ports. The worker is never directly reachable from the internet. Terraform provisions the VMs and network. An Ansible playbook builds the Forge binaries on the remote host, renders env files from templates, installs systemd units, configures Caddy, Prometheus, Alertmanager, and Grafana, and validates service health after every restart. A deployment starts when GitHub sends: If health checks fail, the new process is killed. If a previous deployment was running, Forge restores the previous Caddy route. The interesting part of building a deploy platform is not the happy path. It is the sequence of failures that reveal what the system actually needs. DNS and TLS were not instant. ACME (the protocol that Let's Encrypt and Caddy use to issue certificates automatically) validation sometimes failed because some resolvers/ACME vantage points still saw the old DNS state after others had already updated. The Cloudflare proxy mode added further confusion: Caddy thought the record pointed to Cloudflare, not to my instance. The deeper issue was that Caddy should not issue certificates for arbitrary hostnames - only for domains that Forge knows have a running deployment. This led to implementing a /api/v1/tls/ask endpoint that Caddy's On-Demand TLS calls before issuing any certificate. If the hostname is not a known running deployment, the certificate request is rejected. CI is also a production actor. GitHub Actions deploys failed because the security group allowed SSH only from my local admin IP. The ephemeral runner had a different IP. Then the correct IP was allowed, but the SSH key was wrong. Then the key was right but passphrase-protected, which does not work cleanly in an automated pipeline. Each failure pointed to the same conclusion: CI/CD is not just scripting - it is an infrastructure actor that needs the same treatment as any other system: proper credentials, scoped permissions, and explicit access. That led to cleaner thinking around OIDC, IAM roles, temporary ingress, and deploys that run only after CI has passed. Green Ansible did not mean healthy Forge. A playbook would complete with all tasks marked ok or changed, and Forge would not actually be usable. Grafana needed the API to be ready before the admin password could be reset programmatically. Remote Go builds failed on checksum mismatches due to version inconsistencies. Small VMs ran out of memory during heavier builds because there was no swap configured. The lesson: provisioning is not just installing packages. It is verifying that the resulting system behaves correctly, which means adding real readiness checks after every service restart, not just checking that the process started. The database said running; reality disagreed. This was the most important and frustrating runtime bug. A deployment could be marked running in SQLite while the worker process was dead, the agent had stopped heartbeating, or the VM had been restarted. Forge had state transitions but no reconciliation. Adding a scheduled reconciliation loop - checking running deployments against actual agent heartbeats, re-syncing Caddy routes on control-plane restart, and handling stale records - changed the system from a state machine into something closer to a control loop. Early versions assumed apps would bind to fixed ports like 8000, which caused collisions. Forge now allocates a host port from 20000-39999 and injects it as $PORT. The run.port value in forge.yaml is now best understood as a manifest/default fallback; the platform owns the actual runtime port. Security hardening came from uncomfortable edge cases. An empty admin token must never authenticate - but naive string comparison with an empty string might allow it. GitHub webhook signatures need to be validated with constant-time comparison to avoid timing attacks. The repo and branch allowlists need to exist and be non-empty - an unconfigured Forge should not accept arbitrary webhook payloads. The agent runs as an unprivileged forge user, not root. None of this is exciting to implement. All of it is necessary. Private repos exposed a design limitation. Repo credentials were initially stored per owner/repo. A GitHub classic PAT is scoped to a user account - not to a single repository. Storing the same token three times for three repos in the same org was operationally wrong. Forge now supports owner-level credentials with per-repo override: a credential for myorg covers all of that org's repos unless a more specific entry exists. Forge exposes Prometheus metrics from two sources. The control plane serves its own metrics directly on :8080: current deployment/task status counts, and agent heartbeat status. The worker side is trickier - the C agent serves Prometheus text (CPU, memory, running process count, heartbeat timestamps) over a Unix socket using a minimal HTTP-compatible response. forge-exporter is a small Go bridge that reads that socket and translates it into a standard Prometheus HTTP endpoint on :9108, so Prometheus can scrape it without exposing a TCP listener from the agent itself. This keeps the agent lean and focused on process management, without exposing a TCP HTTP listener from the C agent itself. The worker metrics are not per-app or per-deployment yet. The agent exposes them over a Unix socket, and forge-exporter bridges that socket to HTTP on :9108, so Prometheus can scrape it. Grafana has a small dashboard for deployment state, online agents, and worker memory. The alert I check first, arguably very simple and the most important: If this fires, the control plane has no workers to schedule on. Pending deployments will stay pending indefinitely. No other metric captures that specific failure mode. Grafana sits behind the control-plane security group, accessible only from the admin CIDR. Alertmanager is installed and wired into Prometheus, but the default receiver is still a placeholder; the next step is connecting it to Slack, Discord, PagerDuty, or another notification channel. Observability was added incrementally, not upfront. But it changed how I debug. The difference between «something is wrong» and «the agent on worker-1 stopped heartbeating 4 minutes ago» is the difference between guessing and knowing. Start with the agent in Go, then move the low-level isolation parts to C. The C agent is the most technically interesting part of Forge — direct cgroup writes, raw syscalls, /proc parsing. It is also the part that is hardest to iterate on quickly. Starting in Go would have accelerated the initial development; the C build runner is the right place to draw that boundary. Debugging a hand-rolled HTTP client while also debugging namespace setup is educational, but painful. Design the store interface earlier. SQLite WAL is a great choice for single-node Forge - zero external dependencies, concurrent reads, durable writes. But the store interface leaked SQLite-specific details into the server layer earlier than it should have. A cleaner repository interface from the start would have made testing easier and kept the option of replacing the backend open. If I ever wanted multi-region or leader election, I would want that boundary to be cleaner. Use remote Terraform state from day one. Local state works until you need to apply from a different machine, recover from a failed apply, or share infrastructure with a CI pipeline. The migration cost is low but the friction of doing it mid-project is real. Finally, I would be clearer about the observability boundary from day one. Forge currently has worker-level metrics, which are enough to see whether a node is busy, but not enough to attribute CPU or memory to a specific app. Per-app resource accounting would be a separate feature, not something the current metrics magically provide. Forge is not trying to beat Kubernetes. It is a learning project that exposed the real mechanics behind deployment platforms: the state machines, the reconciliation loops, the TLS edge cases, the security groups, the cgroup limits, the rollback logic, the alerting gaps. The most useful thing it taught me is that the interesting engineering in a deploy platform is not the deploy itself - it is everything that happens when the deploy goes wrong. Clone it, write a forge.yaml, open a PR, and watch your own infrastructure deploy it. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

name: myapp runtime: python3.11 build: commands: - python3 -m venv .venv - . .venv/bin/activate && python -m pip install -r requirements.txt run: command: . .venv/bin/activate && uvicorn app:main --host 0.0.0.0 --port $PORT port: 8000 resources: memory: 256M cpu: 0.5 health: path: /health interval: 10s timeout: 3s retries: 3 name: myapp runtime: python3.11 build: commands: - python3 -m venv .venv - . .venv/bin/activate && python -m pip install -r requirements.txt run: command: . .venv/bin/activate && uvicorn app:main --host 0.0.0.0 --port $PORT port: 8000 resources: memory: 256M cpu: 0.5 health: path: /health interval: 10s timeout: 3s retries: 3 name: myapp runtime: python3.11 build: commands: - python3 -m venv .venv - . .venv/bin/activate && python -m pip install -r requirements.txt run: command: . .venv/bin/activate && uvicorn app:main --host 0.0.0.0 --port $PORT port: 8000 resources: memory: 256M cpu: 0.5 health: path: /health interval: 10s timeout: 3s retries: 3 resource "aws_security_group" "worker" { name = "forge-worker" vpc_id = aws_vpc.forge.id ingress { description = "Application ports from Caddy/control plane" from_port = 20000 to_port = 39999 protocol = "tcp" security_groups = [aws_security_group.control_plane.id] } } resource "aws_security_group" "worker" { name = "forge-worker" vpc_id = aws_vpc.forge.id ingress { description = "Application ports from Caddy/control plane" from_port = 20000 to_port = 39999 protocol = "tcp" security_groups = [aws_security_group.control_plane.id] } } resource "aws_security_group" "worker" { name = "forge-worker" vpc_id = aws_vpc.forge.id ingress { description = "Application ports from Caddy/control plane" from_port = 20000 to_port = 39999 protocol = "tcp" security_groups = [aws_security_group.control_plane.id] } } POST /api/v1/webhook/github POST /api/v1/webhook/github POST /api/v1/webhook/github - alert: ForgeNoOnlineAgents expr: forge_agents_online == 0 for: 2m labels: severity: page - alert: ForgeNoOnlineAgents expr: forge_agents_online == 0 for: 2m labels: severity: page - alert: ForgeNoOnlineAgents expr: forge_agents_online == 0 for: 2m labels: severity: page - You add a forge.yaml file to your repository describing how to build and run your app. - You configure a GitHub webhook pointing to your Forge instance. - Every push to your allowed branch triggers a deployment Forge clones your repository. Builds it using the commands you defined. Starts the application process. Runs health checks to confirm it is live. Updates the reverse proxy so traffic reaches it at your-app.yourdomain.com. - Forge clones your repository. - Builds it using the commands you defined. - Starts the application process. - Runs health checks to confirm it is live. - Updates the reverse proxy so traffic reaches it at your-app.yourdomain.com. - Logs stream in real time. If the health checks fail, Forge… - Forge clones your repository. - Builds it using the commands you defined. - Starts the application process. - Runs health checks to confirm it is live. - Updates the reverse proxy so traffic reaches it at your-app.yourdomain.com. - What Forge is - Infrastructure - The deploy lifecycle - The problems that shaped Forge - Observability - What I would do differently - The control plane accepts 80/443 from the internet, and SSH, Prometheus, Alertmanager, and Grafana only from my admin CIDR. - The control plane API on port 8080 is only open inside the VPC. - The worker accepts nothing from the internet - SSH, the metrics exporter on 9108, and app ports 20000–39999 are open only from the control-plane security group: - Verifies X-Hub-Signature-256 with HMAC-SHA256. - Checks the repo and branch against allowlists. - Validates the commit SHA format. - Clones the repo and parses forge.yaml - unknown fields are rejected via gopkg.in/yaml.v3 with KnownFields(true), so misconfigured deploys fail before any worker task is created. - Creates a pending deployment in SQLite. - The scheduler picks an online worker based on available CPU and memory headroom. - forge-agent claims a build task, and the build runner executes the build commands inside cgroups v2 limits and Linux namespaces. It is not Docker or Firecracker, but it exercises the primitives directly. In production mode, if namespace isolation is unavailable, the build fails closed. - The agent starts the app process with $PORT injected, then health-checks http://127.0.0.1:$PORT/health. - Once health passes, the control plane updates Caddy's route via the Admin API - without restarting the Caddy service. - forge_deployments_total{status}: deployment rows grouped by current status. - forge_tasks_total{status}: task rows grouped by current status, across build, run, and stop tasks. - forge_agents_online: agents with a recent heartbeat. - forge_agent_cpu_used, forge_agent_memory_used_bytes, forge_agent_memory_capacity_bytes, forge_agent_processes, and forge_agent_last_heartbeat_seconds: worker-level metrics exported by the agent.