Tools: I Am Building a Cloud: Lessons From Designing My Own Infrastructure From Scratch - Expert Insights

Tools: I Am Building a Cloud: Lessons From Designing My Own Infrastructure From Scratch - Expert Insights

I Am Building a Cloud: Lessons From Designing My Own Infrastructure From Scratch

Why Build a Cloud at All?

What "Building a Cloud" Actually Means

The Stack I Chose (And Why)

Compute: Proxmox VE

Networking: Open vSwitch + VXLANs

Storage: Ceph

Orchestration: Kubernetes (K3s)

The Control Plane: The Hardest Part Nobody Talks About

Mistakes I've Made (So You Don't Have To)

1. Skipping Idempotency Early On

2. Underestimating Networking Complexity

3. No Observability From Day One

What This Teaches You About Public Clouds

Where I Am Today

Should You Do This? Three months ago, I typed mkdir my-cloud and told myself: how hard can it be? Spoiler: very. But also deeply rewarding in ways I didn't expect. This isn't a tutorial. It's a brutally honest account of what it actually takes when you decide to stop renting compute from AWS, GCP, or Azure and start building your own cloud — even a small, self-hosted one. Whether you're doing this for learning, for cost savings at scale, or because you genuinely want control over your stack, there's a lot nobody tells you upfront. Before we dive into architecture diagrams and YAML files, let's be honest about motivation. Here are the real reasons developers end up down this rabbit hole: I fall into the last two camps. I have a rack of second-hand servers in a co-location facility, a dangerous amount of free time, and a strong aversion to accepting "just use managed X" as a final answer. A cloud platform is not one thing. It's a layered stack of problems: Most tutorials pick one layer and call it a day. Building a cloud means you have to care about all of them — and more importantly, how they talk to each other. I run Proxmox VE as my hypervisor layer. It's open-source, handles both KVM virtual machines and LXC containers, and has a solid REST API I can script against. Spinning up a VM via the Proxmox API looks like this: This is the foundation. Every VM, every container, everything runs on top of this. This is where things get interesting — and painful. When you want tenant isolation (so different users or projects can't see each other's traffic), you need a software-defined network. I use Open vSwitch (OVS) with VXLAN overlays. Each project gets its own VXLAN segment: This gives you Layer 2 connectivity across physically separate hosts — the same trick that AWS and GCP use under the hood (just with a lot more engineering muscle behind it). For distributed block and object storage, I run a small Ceph cluster across three nodes. Ceph is the backbone of many production clouds, including parts of OpenStack deployments. Creating a storage pool: Is Ceph operationally complex? Absolutely. But it gives you replicated, fault-tolerant storage that behaves like EBS or GCS under the hood. For running containerized workloads, I use K3s — a lightweight Kubernetes distribution that doesn't require a team of SREs to operate. It runs comfortably on VMs provisioned by Proxmox. Here's the dirty secret: the compute, storage, and network layers are solved problems. There is open-source software for all of it. The genuinely hard part is the control plane — the API and logic that ties everything together and exposes it to users. When a user says "give me a 4-core VM with 8GB RAM in region east," something needs to: That workflow is your control plane. I'm building mine as a Go service with a PostgreSQL backend: This is simplified, but it captures the essence. Notice the rollback on network allocation failure — distributed system failures bite hard when you don't handle partial states. Cloud APIs need to be idempotent. If a VM creation request times out and the client retries, you must not create two VMs. I didn't implement idempotency keys early enough and ended up with orphaned VMs I couldn't account for. Fix: Accept a client_request_id on every mutating API call and deduplicate in your database. I naively thought networking was "just routing." It's not. You're dealing with ARP storms, MTU mismatches across VXLAN tunnels, asymmetric routing, and firewall state tables that don't survive node reboots. Budget triple the time you think you need here. I added metrics and logging as an afterthought. Big mistake. When your control plane starts behaving weirdly at 2 AM, printf debugging across three hypervisor nodes is a nightmare. Fix: Instrument everything from day one. I now use [[Prometheus and Grafana for metrics]] alongside structured JSON logging shipped to a central Loki instance. Building even a tiny cloud fundamentally changes how you read AWS or GCP documentation. Phrases like "availability zone," "VPC peering," "instance scheduling," and "eventual consistency" stop being buzzwords and become concrete engineering decisions you've personally wrestled with. You start to understand why EBS volumes have latency characteristics they do, why cross-AZ traffic costs money, why spot instances can be interrupted. These aren't arbitrary decisions — they're consequences of real physical and logical constraints. If you want to become a genuinely better cloud engineer (not just a cloud user), [[infrastructure-as-code tools and deep cloud internals books]] will only take you so far. At some point, you have to build. After three months, my cloud can: The repo is private for now, but I'm planning to open-source the control plane once it's less embarrassing. Follow me here on DEV if you want to be notified when that drops. If you want to deeply understand distributed systems, networking, and how modern infrastructure actually works — yes, absolutely. Building a cloud, even a toy one, is one of the most educational things I've ever done as an engineer. If you need to ship a product next quarter — no, rent compute. That's what AWS is for. But if you have the itch, scratch it. mkdir my-cloud and see where it takes you. If you're building something similar, or you've done this before and want to tell me where I'm going wrong — drop a comment below. I read everything. And if you want updates as this project evolves, follow me here on DEV. There's a lot more to come. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

┌──────────────────────────────┐ │ Developer Portal / API │ ← Users interact here ├──────────────────────────────┤ │ Orchestration Layer │ ← Kubernetes, Nomad, etc. ├──────────────────────────────┤ │ Networking Layer │ ← SDN, load balancers, DNS ├──────────────────────────────┤ │ Storage Layer │ ← Block, object, file ├──────────────────────────────┤ │ Compute Layer │ ← Hypervisors, bare metal ├──────────────────────────────┤ │ Physical / Bare Metal │ ← The actual servers └──────────────────────────────┘ ┌──────────────────────────────┐ │ Developer Portal / API │ ← Users interact here ├──────────────────────────────┤ │ Orchestration Layer │ ← Kubernetes, Nomad, etc. ├──────────────────────────────┤ │ Networking Layer │ ← SDN, load balancers, DNS ├──────────────────────────────┤ │ Storage Layer │ ← Block, object, file ├──────────────────────────────┤ │ Compute Layer │ ← Hypervisors, bare metal ├──────────────────────────────┤ │ Physical / Bare Metal │ ← The actual servers └──────────────────────────────┘ ┌──────────────────────────────┐ │ Developer Portal / API │ ← Users interact here ├──────────────────────────────┤ │ Orchestration Layer │ ← Kubernetes, Nomad, etc. ├──────────────────────────────┤ │ Networking Layer │ ← SDN, load balancers, DNS ├──────────────────────────────┤ │ Storage Layer │ ← Block, object, file ├──────────────────────────────┤ │ Compute Layer │ ← Hypervisors, bare metal ├──────────────────────────────┤ │ Physical / Bare Metal │ ← The actual servers └──────────────────────────────┘ curl -s -k -b "PVEAuthCookie=${TICKET}" \ -H "CSRFPreventionToken: ${CSRF}" \ -X POST \ "https://proxmox-host:8006/api2/json/nodes/pve/qemu" \ -d 'vmid=101&name=my-vm&memory=2048&cores=2&net0=virtio,bridge=vmbr0&ide2=local:iso/ubuntu-22.04.iso,media=cdrom&scsihw=virtio-scsi-pci&scsi0=local-lvm:20' curl -s -k -b "PVEAuthCookie=${TICKET}" \ -H "CSRFPreventionToken: ${CSRF}" \ -X POST \ "https://proxmox-host:8006/api2/json/nodes/pve/qemu" \ -d 'vmid=101&name=my-vm&memory=2048&cores=2&net0=virtio,bridge=vmbr0&ide2=local:iso/ubuntu-22.04.iso,media=cdrom&scsihw=virtio-scsi-pci&scsi0=local-lvm:20' curl -s -k -b "PVEAuthCookie=${TICKET}" \ -H "CSRFPreventionToken: ${CSRF}" \ -X POST \ "https://proxmox-host:8006/api2/json/nodes/pve/qemu" \ -d 'vmid=101&name=my-vm&memory=2048&cores=2&net0=virtio,bridge=vmbr0&ide2=local:iso/ubuntu-22.04.iso,media=cdrom&scsihw=virtio-scsi-pci&scsi0=local-lvm:20' # Create a VXLAN tunnel between two hypervisor nodes ovs-vsctl add-br br-overlay ovs-vsctl add-port br-overlay vxlan0 -- \ set interface vxlan0 type=vxlan \ options:remote_ip=10.0.0.2 \ options:key=1001 # Create a VXLAN tunnel between two hypervisor nodes ovs-vsctl add-br br-overlay ovs-vsctl add-port br-overlay vxlan0 -- \ set interface vxlan0 type=vxlan \ options:remote_ip=10.0.0.2 \ options:key=1001 # Create a VXLAN tunnel between two hypervisor nodes ovs-vsctl add-br br-overlay ovs-vsctl add-port br-overlay vxlan0 -- \ set interface vxlan0 type=vxlan \ options:remote_ip=10.0.0.2 \ options:key=1001 ceph osd pool create my-cloud-vms 128 rbd pool init my-cloud-vms ceph osd pool create my-cloud-vms 128 rbd pool init my-cloud-vms ceph osd pool create my-cloud-vms 128 rbd pool init my-cloud-vms # Install K3s on a fresh VM curl -sfL https://get.k3s.io | sh -s - \ --cluster-init \ --disable traefik \ --node-name cloud-control-01 # Install K3s on a fresh VM curl -sfL https://get.k3s.io | sh -s - \ --cluster-init \ --disable traefik \ --node-name cloud-control-01 # Install K3s on a fresh VM curl -sfL https://get.k3s.io | sh -s - \ --cluster-init \ --disable traefik \ --node-name cloud-control-01 func (s *Server) CreateVM(ctx context.Context, req *CreateVMRequest) (*VM, error) { // 1. Validate and authenticate user, err := s.auth.Validate(ctx, req.Token) if err != nil { return nil, ErrUnauthorized } // 2. Check quota if err := s.quota.Check(ctx, user.ID, req.Resources); err != nil { return nil, ErrQuotaExceeded } // 3. Schedule: pick a hypervisor node node, err := s.scheduler.Select(ctx, req.Resources) if err != nil { return nil, ErrNoCapacity } // 4. Provision the VM vmID, err := s.proxmox.CreateVM(ctx, node, req) if err != nil { return nil, fmt.Errorf("provisioning failed: %w", err) } // 5. Configure networking ip, err := s.network.Allocate(ctx, vmID, user.ProjectID) if err != nil { _ = s.proxmox.DeleteVM(ctx, node, vmID) // rollback return nil, fmt.Errorf("network allocation failed: %w", err) } // 6. Persist state vm := &VM{ID: vmID, NodeID: node.ID, IP: ip, OwnerID: user.ID} if err := s.db.SaveVM(ctx, vm); err != nil { return nil, err } return vm, nil } func (s *Server) CreateVM(ctx context.Context, req *CreateVMRequest) (*VM, error) { // 1. Validate and authenticate user, err := s.auth.Validate(ctx, req.Token) if err != nil { return nil, ErrUnauthorized } // 2. Check quota if err := s.quota.Check(ctx, user.ID, req.Resources); err != nil { return nil, ErrQuotaExceeded } // 3. Schedule: pick a hypervisor node node, err := s.scheduler.Select(ctx, req.Resources) if err != nil { return nil, ErrNoCapacity } // 4. Provision the VM vmID, err := s.proxmox.CreateVM(ctx, node, req) if err != nil { return nil, fmt.Errorf("provisioning failed: %w", err) } // 5. Configure networking ip, err := s.network.Allocate(ctx, vmID, user.ProjectID) if err != nil { _ = s.proxmox.DeleteVM(ctx, node, vmID) // rollback return nil, fmt.Errorf("network allocation failed: %w", err) } // 6. Persist state vm := &VM{ID: vmID, NodeID: node.ID, IP: ip, OwnerID: user.ID} if err := s.db.SaveVM(ctx, vm); err != nil { return nil, err } return vm, nil } func (s *Server) CreateVM(ctx context.Context, req *CreateVMRequest) (*VM, error) { // 1. Validate and authenticate user, err := s.auth.Validate(ctx, req.Token) if err != nil { return nil, ErrUnauthorized } // 2. Check quota if err := s.quota.Check(ctx, user.ID, req.Resources); err != nil { return nil, ErrQuotaExceeded } // 3. Schedule: pick a hypervisor node node, err := s.scheduler.Select(ctx, req.Resources) if err != nil { return nil, ErrNoCapacity } // 4. Provision the VM vmID, err := s.proxmox.CreateVM(ctx, node, req) if err != nil { return nil, fmt.Errorf("provisioning failed: %w", err) } // 5. Configure networking ip, err := s.network.Allocate(ctx, vmID, user.ProjectID) if err != nil { _ = s.proxmox.DeleteVM(ctx, node, vmID) // rollback return nil, fmt.Errorf("network allocation failed: %w", err) } // 6. Persist state vm := &VM{ID: vmID, NodeID: node.ID, IP: ip, OwnerID: user.ID} if err := s.db.SaveVM(ctx, vm); err != nil { return nil, err } return vm, nil } // Instrument your handlers from the start func (s *Server) CreateVM(ctx context.Context, req *CreateVMRequest) (*VM, error) { timer := prometheus.NewTimer(vmCreationDuration) defer timer.ObserveDuration() vmCreationTotal.Inc() // ... rest of the logic } // Instrument your handlers from the start func (s *Server) CreateVM(ctx context.Context, req *CreateVMRequest) (*VM, error) { timer := prometheus.NewTimer(vmCreationDuration) defer timer.ObserveDuration() vmCreationTotal.Inc() // ... rest of the logic } // Instrument your handlers from the start func (s *Server) CreateVM(ctx context.Context, req *CreateVMRequest) (*VM, error) { timer := prometheus.NewTimer(vmCreationDuration) defer timer.ObserveDuration() vmCreationTotal.Inc() // ... rest of the logic } - Cost at scale. Managed services are convenient but punishing at volume. At a certain number of VMs or data transfer gigabytes, the math tilts hard toward owning iron. - Control and compliance. Some industries (healthcare, finance, government) need data sovereignty that public clouds make complicated. - Learning. Nothing teaches you how Kubernetes actually works like building the thing that Kubernetes runs on. - The itch. Sometimes you just want to know if you can. - Authenticate the request - Check quota and billing - Select the right hypervisor node (scheduling) - Call the Proxmox API - Configure networking for the new VM - Register the VM in a state database - Return an IP address and credentials to the user - ✅ Provision and destroy VMs via API - ✅ Allocate isolated project networks automatically - ✅ Serve block storage from Ceph - ✅ Run Kubernetes workloads across the VM fleet - ✅ Track basic resource usage per user/project - 🔧 Live VM migration between hypervisor nodes - 🔧 A proper billing and quota system - 🔧 A usable developer portal (the API is functional but ugly) - 🔧 Automated certificate management for tenant workloads - Designing Data-Intensive Applications by Martin Kleppmann — essential for understanding the state management challenges - The Proxmox API documentation (surprisingly good) - The Ceph documentation (less good, but comprehensive) - [[Cloud Native Patterns and architecture guides]] — for thinking about multi-tenancy correctly - The OpenStack source code — not to run it, but to read how they solved problems