Tools: Containerizing a Microservices App with Docker and CI/CD: HNG DevOps Stage 2 - Complete Guide

Tools: Containerizing a Microservices App with Docker and CI/CD: HNG DevOps Stage 2 - Complete Guide

A Quick Recap

The Task

But First: Why Containerization?

Step 1: Reading the Code and Finding the Bugs

Step 2: Writing the Dockerfiles

Step 3: Docker Compose Orchestration

Step 4: The CI/CD Pipeline

Stage 1: Lint

Stage 2: Test

Stage 3: Build

Stage 4: Security Scan

Stage 5: Integration Test

Stage 6: Deploy

Final Verification

The Big Picture This is part of my HNG DevOps internship series. Follow along as I document every stage.

Previous articles:Stage 0: How I Secured a Linux Server from ScratchStage 1: Build, Deploy and Reverse Proxy a Rust API Stage 0 was about provisioning and hardening a Linux server. Stage 1 was about writing and deploying an API behind Nginx. Stage 2 is where things get serious. This time I was not writing the application. I was given one, told it had bugs, and asked to find them, fix them, containerize the whole thing, wire up a CI/CD pipeline, and ship it to production. The application is a job processing system made up of four components: The source code was intentionally shipped with bugs. Finding them, fixing them, and documenting every single one was a graded part of the task. On top of that, the requirements were: The starter repo is here: https://github.com/chukwukelu2023/hng14-stage2-devops. My fork with all the work is here: https://github.com/GideonBature/hng14-stage2-devops. Before Stage 2, I was deploying binaries and managing services directly on the host OS. That works for one service. It does not scale when you have four services that need to talk to each other, each with different language runtimes, different dependencies, and different startup requirements. Containers solve this by packaging each service with everything it needs to run. Docker gives you isolation (each service cannot interfere with others), portability (it runs the same on your laptop and on the server), and reproducibility (the exact same image runs in every environment). Docker Compose takes it further by letting you define how all four services relate to each other, what order they start in, and what network they share, all in a single file. The CI/CD pipeline adds the final piece: automation. Every time you push code, the pipeline runs your tests, scans your images for vulnerabilities, verifies the full stack works end to end, and deploys to production. No manual steps, no "it works on my machine." The task was explicit about this: read the source files before touching any infrastructure. I went through every file carefully and documented every issue I found in FIXES.md. Here is a summary of what was broken: In api/main.py:The Redis host was hardcoded to localhost. That works when running the API directly on a machine, but inside Docker, services communicate over a network using their service names, not localhost. A hardcoded localhost means the API can never find Redis inside a container. There was also no health check endpoint, no logging, and the queue name was inconsistent with what the worker expected. In worker/worker.py:Same hardcoded Redis host problem. On top of that, the infinite loop had no signal handling, meaning Docker could never gracefully shut down the worker. It would just be force-killed, potentially losing a job mid-process. There were also no try-catch blocks, so any Redis connection error would crash the entire worker with no recovery. In frontend/app.js:The API URL was hardcoded to http://localhost:8000. Same problem as above. Inside Docker, the frontend needs to reach the API by its service name, not localhost. There was also no health check endpoint, which meant Docker had no way to verify the frontend was actually ready. In configuration:

A .env file containing secrets had been committed to the repository. This is a serious security issue. Any secret committed to git history is compromised, even if you delete the file later. I added .env to .gitignore, removed it from tracking, and created a .env.example with placeholder values instead. All of this is documented with exact file names, line numbers, what the problem was, and what was changed, in the FIXES.md in the repository. Each service got its own Dockerfile. The requirements were specific: multi-stage builds, non-root user, working health check, no secrets baked in. Why multi-stage builds? A naive Dockerfile installs your build tools, compiles your code, and ships everything together in one image. That means your production image contains gcc, build headers, and other tools that have no business being in production. They increase image size and expand the attack surface. Multi-stage builds solve this by using a builder stage to install dependencies, then copying only the final artifacts into a clean production stage. Notice I switched from python:3.11-slim (Debian-based) to python:3.11-alpine. This was not an arbitrary choice. Alpine images are significantly smaller and have a much smaller number of packages, which means fewer potential vulnerabilities. This became important later during the security scanning stage. The worker does not expose an HTTP port, so a traditional HTTP health check doesn't apply. Instead, the worker writes a heartbeat file every iteration of its main loop, and the health check verifies that file exists and was written within the last 60 seconds: In worker.py, the main loop touches this file on every iteration: If the worker hangs, the heartbeat file stops being updated, the timestamp check fails, and Docker marks the container unhealthy. Clean and reliable. One issue I hit here was that npm ci requires a package-lock.json to exist. The original project only had package.json. Rather than generating a lockfile, I switched to: --omit=dev skips development dependencies, keeping the image lean without needing a lockfile. The docker-compose.yml defines how all four services relate to each other. A few things worth understanding: Health-based dependencies: condition: service_healthy means the service will not start until its dependency has passed its health check. Not just started, actually healthy. This is the difference between a container that is running and a container that is ready. Without this, you get race conditions where the API starts before Redis is ready and immediately crashes. No exposed Redis port: Redis has no ports mapping in the compose file. It is only accessible from within the app-network Docker network. The outside world cannot reach it at all. Only the API and worker can talk to Redis, and only because they are on the same internal network. All config from environment variables: Nothing is hardcoded in the compose file. Every value that could vary between environments comes from environment variables: The .env.example documents every variable needed: This is the most complex part of the task. Six stages, all in a single GitHub Actions workflow, each one blocking the next on failure. Let me walk through each stage and the problems I hit. Three linters run here: flake8 for Python, eslint for JavaScript, and hadolint for Dockerfiles. PEP 8 issues: flake8 immediately complained about missing blank lines and missing newlines at end of files. These are real code style violations, not noise. Fixed by adding proper spacing before top-level function definitions. ESLint v9 incompatibility: ESLint v9 dropped support for the .eslintrc.json config format and requires a new eslint.config.js format. Rather than rewriting the config, I pinned ESLint to v8 in the pipeline: This is a pragmatic fix. Chasing the latest ESLint config format mid-task would have been a rabbit hole. Hadolint warnings: Hadolint complained that apk add and apt-get install did not pin package versions. This is a legitimate concern in some contexts, but for base packages it is standard practice to take the latest. I created a .hadolint.yml to suppress these specific rules: At least three unit tests for the API with Redis mocked, plus a coverage report uploaded as an artifact. I wrote 12 tests covering all three endpoints and multiple edge cases. The tricky one was testing the health check failure path. My first attempt used a generic Exception: This did not work because api/main.py specifically catches redis.ConnectionError, not a generic exception. The test was passing the wrong exception type and the except block was never hit. Fixed by importing redis and using the correct exception: This is exactly the kind of thing that unit tests are supposed to catch. The test found a gap between what I expected the code to do and what it actually did. All three images are built and pushed to a local Docker registry running as a service container within the job: Each image is tagged with both latest and the short git SHA so every build is traceable. After pushing, the images are saved as a tar artifact using docker save and uploaded so the subsequent security scan and integration test stages can load and use the exact same images without rebuilding. Trivy scans all three images for vulnerabilities and fails the pipeline on any CRITICAL finding. This stage produced the most interesting debugging of the whole task. First problem: Using format: sarif with exit-code: 1 caused Trivy to fail, but the SARIF output gave no human-readable information about what failed. I temporarily switched to format: table with exit-code: 0 just to see what was actually being found. This revealed the actual CVEs. Second problem: I had already switched from Debian-based (python:3.11-slim) to Alpine (python:3.11-alpine) to reduce vulnerabilities, but Trivy still flagged two CVEs related to wheel 0.45.1 and jaraco.context 5.3.0. The root cause was non-obvious. Trivy was not finding these in my installed packages. It was finding them in vendored metadata directories inside setuptools: Setuptools bundles older versions of some packages internally for its own use and ships their .dist-info metadata. Trivy sees those metadata directories and reports them as installed packages. The fix was to upgrade setuptools itself to a version that had updated its internal vendored copies, and then explicitly remove those vendored dist-info paths in the Dockerfile: I verified the fix by running the image locally and checking: Empty output confirmed the vendored metadata was gone. The full stack is brought up inside the GitHub Actions runner, a job is submitted through the frontend, the pipeline polls until it completes, asserts the final status is completed, then tears everything down. One issue hit here: my integration test was trying to reach the API at localhost:8000, but the API port was not exposed to the host. The API only exists on the internal Docker network. Fixed by using docker compose exec to run the health check from inside the container network: The if: always() condition on the teardown step ensures the stack is brought down cleanly regardless of whether the test passed or failed. The deploy job only runs on pushes to main. It SSH-es into the Oracle Cloud server and performs a rolling update: The key insight here is --force-recreate instead of --scale. The --scale approach requires removing container_name from the compose file and needs a load balancer to be meaningful. --force-recreate is simpler and more honest for a single-instance deployment. One problem I hit during deployment: port 3000 was already in use on the server. The hng-api.service systemd service from Stage 1 was still running on that port. Fixed with: The SSH authentication for GitHub Actions also needed careful setup. I generated a dedicated key pair for this: Added the public key to the server's ~/.ssh/authorized_keys, and added the private key contents as a GitHub Actions secret called SERVER_SSH_KEY. After a successful pipeline run, all services are up on the server: The full stack can be brought up on any clean machine with: Stage 2 introduced a set of concepts that show up in every real production engineering team: The hardest bugs in this task were not the obvious ones. Hardcoded localhost is easy to spot. Trivy flagging vendored metadata inside setuptools is not. That is the difference between reading code and understanding what runs in production. Stage 3 is next. Follow along as I keep documenting the journey. Find me on Dev.to | GitHub Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

# Build stage FROM python:3.11-alpine AS builder WORKDIR /app RUN -weight: 500;">apk -weight: 500;">upgrade --no-cache && \ -weight: 500;">apk add --no-cache gcc musl-dev linux-headers COPY requirements.txt . RUN -weight: 500;">pip -weight: 500;">install --no-cache-dir --user -r requirements.txt # Production stage FROM python:3.11-alpine AS production RUN -weight: 500;">apk -weight: 500;">upgrade --no-cache && \ addgroup -S appuser && adduser -S appuser -G appuser WORKDIR /app COPY --from=builder /root/.local /home/appuser/.local ENV PATH=/home/appuser/.local/bin:$PATH COPY main.py . RUN chown -R appuser:appuser /app /home/appuser/.local USER appuser EXPOSE 8000 HEALTHCHECK --interval=30s --timeout=10s ---weight: 500;">start-period=10s --retries=3 \ CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] # Build stage FROM python:3.11-alpine AS builder WORKDIR /app RUN -weight: 500;">apk -weight: 500;">upgrade --no-cache && \ -weight: 500;">apk add --no-cache gcc musl-dev linux-headers COPY requirements.txt . RUN -weight: 500;">pip -weight: 500;">install --no-cache-dir --user -r requirements.txt # Production stage FROM python:3.11-alpine AS production RUN -weight: 500;">apk -weight: 500;">upgrade --no-cache && \ addgroup -S appuser && adduser -S appuser -G appuser WORKDIR /app COPY --from=builder /root/.local /home/appuser/.local ENV PATH=/home/appuser/.local/bin:$PATH COPY main.py . RUN chown -R appuser:appuser /app /home/appuser/.local USER appuser EXPOSE 8000 HEALTHCHECK --interval=30s --timeout=10s ---weight: 500;">start-period=10s --retries=3 \ CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] # Build stage FROM python:3.11-alpine AS builder WORKDIR /app RUN -weight: 500;">apk -weight: 500;">upgrade --no-cache && \ -weight: 500;">apk add --no-cache gcc musl-dev linux-headers COPY requirements.txt . RUN -weight: 500;">pip -weight: 500;">install --no-cache-dir --user -r requirements.txt # Production stage FROM python:3.11-alpine AS production RUN -weight: 500;">apk -weight: 500;">upgrade --no-cache && \ addgroup -S appuser && adduser -S appuser -G appuser WORKDIR /app COPY --from=builder /root/.local /home/appuser/.local ENV PATH=/home/appuser/.local/bin:$PATH COPY main.py . RUN chown -R appuser:appuser /app /home/appuser/.local USER appuser EXPOSE 8000 HEALTHCHECK --interval=30s --timeout=10s ---weight: 500;">start-period=10s --retries=3 \ CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] HEALTHCHECK --interval=30s --timeout=10s ---weight: 500;">start-period=10s --retries=3 \ CMD test -f /tmp/worker_healthy && \ test $(( $(date +%s) - $(stat -c %Y /tmp/worker_healthy) )) -lt 60 || exit 1 HEALTHCHECK --interval=30s --timeout=10s ---weight: 500;">start-period=10s --retries=3 \ CMD test -f /tmp/worker_healthy && \ test $(( $(date +%s) - $(stat -c %Y /tmp/worker_healthy) )) -lt 60 || exit 1 HEALTHCHECK --interval=30s --timeout=10s ---weight: 500;">start-period=10s --retries=3 \ CMD test -f /tmp/worker_healthy && \ test $(( $(date +%s) - $(stat -c %Y /tmp/worker_healthy) )) -lt 60 || exit 1 with open('/tmp/worker_healthy', 'w') as f: f.write(str(time.time())) with open('/tmp/worker_healthy', 'w') as f: f.write(str(time.time())) with open('/tmp/worker_healthy', 'w') as f: f.write(str(time.time())) RUN -weight: 500;">npm -weight: 500;">install --omit=dev RUN -weight: 500;">npm -weight: 500;">install --omit=dev RUN -weight: 500;">npm -weight: 500;">install --omit=dev depends_on: redis: condition: service_healthy api: condition: service_healthy depends_on: redis: condition: service_healthy api: condition: service_healthy depends_on: redis: condition: service_healthy api: condition: service_healthy environment: - REDIS_HOST=redis - REDIS_PORT=6379 - REDIS_PASSWORD=${REDIS_PASSWORD} environment: - REDIS_HOST=redis - REDIS_PORT=6379 - REDIS_PASSWORD=${REDIS_PASSWORD} environment: - REDIS_HOST=redis - REDIS_PORT=6379 - REDIS_PASSWORD=${REDIS_PASSWORD} REDIS_HOST=redis REDIS_PORT=6379 REDIS_PASSWORD=your_redis_password_here API_HOST=0.0.0.0 API_PORT=8000 FRONTEND_PORT=3000 API_URL=http://api:8000 APP_ENV=production REDIS_HOST=redis REDIS_PORT=6379 REDIS_PASSWORD=your_redis_password_here API_HOST=0.0.0.0 API_PORT=8000 FRONTEND_PORT=3000 API_URL=http://api:8000 APP_ENV=production REDIS_HOST=redis REDIS_PORT=6379 REDIS_PASSWORD=your_redis_password_here API_HOST=0.0.0.0 API_PORT=8000 FRONTEND_PORT=3000 API_URL=http://api:8000 APP_ENV=production lint -> test -> build -> security scan -> integration test -> deploy lint -> test -> build -> security scan -> integration test -> deploy lint -> test -> build -> security scan -> integration test -> deploy run: -weight: 500;">npm -weight: 500;">install -g eslint@8 run: -weight: 500;">npm -weight: 500;">install -g eslint@8 run: -weight: 500;">npm -weight: 500;">install -g eslint@8 ignored: - DL3008 - DL3018 ignored: - DL3008 - DL3018 ignored: - DL3008 - DL3018 mock_redis.ping.side_effect = Exception("connection refused") mock_redis.ping.side_effect = Exception("connection refused") mock_redis.ping.side_effect = Exception("connection refused") import redis mock_redis.ping.side_effect = redis.ConnectionError("connection refused") import redis mock_redis.ping.side_effect = redis.ConnectionError("connection refused") import redis mock_redis.ping.side_effect = redis.ConnectionError("connection refused") services: registry: image: registry:2 ports: - 5000:5000 services: registry: image: registry:2 ports: - 5000:5000 services: registry: image: registry:2 ports: - 5000:5000 /usr/local/lib/python3.11/site-packages/setuptools/_vendor/wheel-0.45.1.dist-info /usr/local/lib/python3.11/site-packages/setuptools/_vendor/jaraco.context-5.3.0.dist-info /usr/local/lib/python3.11/site-packages/setuptools/_vendor/wheel-0.45.1.dist-info /usr/local/lib/python3.11/site-packages/setuptools/_vendor/jaraco.context-5.3.0.dist-info /usr/local/lib/python3.11/site-packages/setuptools/_vendor/wheel-0.45.1.dist-info /usr/local/lib/python3.11/site-packages/setuptools/_vendor/jaraco.context-5.3.0.dist-info RUN -weight: 500;">pip -weight: 500;">install ---weight: 500;">upgrade setuptools==80.9.0 && \ find /usr/local/lib/python3.11/site-packages/setuptools/_vendor \ -type d \( -name 'wheel-0.45.1*' -o -name 'jaraco.context-5.3.0*' \) \ -exec rm -rf {} + 2>/dev/null || true RUN -weight: 500;">pip -weight: 500;">install ---weight: 500;">upgrade setuptools==80.9.0 && \ find /usr/local/lib/python3.11/site-packages/setuptools/_vendor \ -type d \( -name 'wheel-0.45.1*' -o -name 'jaraco.context-5.3.0*' \) \ -exec rm -rf {} + 2>/dev/null || true RUN -weight: 500;">pip -weight: 500;">install ---weight: 500;">upgrade setuptools==80.9.0 && \ find /usr/local/lib/python3.11/site-packages/setuptools/_vendor \ -type d \( -name 'wheel-0.45.1*' -o -name 'jaraco.context-5.3.0*' \) \ -exec rm -rf {} + 2>/dev/null || true -weight: 500;">docker run --rm local-api-debug sh -lc \ "find / -type d -name '*wheel*0.45.1*' 2>/dev/null | head -20" -weight: 500;">docker run --rm local-api-debug sh -lc \ "find / -type d -name '*wheel*0.45.1*' 2>/dev/null | head -20" -weight: 500;">docker run --rm local-api-debug sh -lc \ "find / -type d -name '*wheel*0.45.1*' 2>/dev/null | head -20" -weight: 500;">docker compose exec -T api python -c \ "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" -weight: 500;">docker compose exec -T api python -c \ "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" -weight: 500;">docker compose exec -T api python -c \ "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">stop hng-api.-weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">disable hng-api.-weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">stop hng-api.-weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">disable hng-api.-weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">stop hng-api.-weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">disable hng-api.-weight: 500;">service ssh-keygen -t ed25519 -C "github-actions" -f ~/.ssh/github_actions ssh-keygen -t ed25519 -C "github-actions" -f ~/.ssh/github_actions ssh-keygen -t ed25519 -C "github-actions" -f ~/.ssh/github_actions API healthy after 5s Container hng14-stage2-devops-worker-1 Started Container hng14-stage2-devops-frontend-1 Started Deployment complete API healthy after 5s Container hng14-stage2-devops-worker-1 Started Container hng14-stage2-devops-frontend-1 Started Deployment complete API healthy after 5s Container hng14-stage2-devops-worker-1 Started Container hng14-stage2-devops-frontend-1 Started Deployment complete -weight: 500;">git clone https://github.com/GideonBature/hng14-stage2-devops.-weight: 500;">git cd hng14-stage2-devops cp .env.example .env # Edit .env with your values -weight: 500;">docker compose up -d -weight: 500;">git clone https://github.com/GideonBature/hng14-stage2-devops.-weight: 500;">git cd hng14-stage2-devops cp .env.example .env # Edit .env with your values -weight: 500;">docker compose up -d -weight: 500;">git clone https://github.com/GideonBature/hng14-stage2-devops.-weight: 500;">git cd hng14-stage2-devops cp .env.example .env # Edit .env with your values -weight: 500;">docker compose up -d - Frontend (Node.js/Express): where users submit and track jobs - API (Python/FastAPI): creates jobs and serves -weight: 500;">status updates - Worker (Python): picks up jobs from a Redis queue and processes them - Redis: shared queue and state storage between the API and worker - Write production-quality Dockerfiles for all three services (multi-stage, non-root user, health checks, no secrets in images) - Write a -weight: 500;">docker-compose.yml that orchestrates the full stack (named network, health-based dependencies, resource limits, no hardcoded config) - Build a full CI/CD pipeline with six stages that must run in strict order: lint -> test -> build -> security scan -> integration test -> deploy - Document everything in a README.md, a FIXES.md, and a .env.example - Pull the latest code from main - Build new images - Bring up the API first with --force-recreate - Wait up to 60 seconds for the API health check to pass - If it passes, bring up the worker and frontend - If the 60 second timeout expires, abort and leave the old container running