Tools

Tools: Building a Self-Healing Backend with AI + Docker

2026-04-12 0 views admin

The Setup

The Log Watcher

The Fix Prompt

Retries and Cooldowns

The Part Nobody Talks About: SSH Into Running Containers

The Entrypoint Trick: Logs That Go Two Places

What the Fix Actually Looks Like

The Results

What I'd Do Differently

When Would You Use This For Real?

Try It Yourself I had this idea that kept bugging me: what if your backend could fix itself? Not in some hand-wavy "AI will handle it" way. I mean actually tail its own logs, spot a real error, figure out what's wrong in the code, patch it, rebuild the container, and move on. While you sleep. So I built it. Five Docker containers. One of them watches the others, and when something breaks, it calls an LLM to generate a code fix, applies it, restarts the broken service, and verifies the fix worked. No human in the loop. This post is the full breakdown of how it works, what surprised me, and where it falls apart. The demo is a small data pipeline. Two FastAPI services and a MongoDB instance, all running in Docker. There's also an init container that seeds the database on startup : 1000 records, most of them clean, but a handful intentionally malformed. Different kinds of malformed: numbers wrapped in weird JSON objects, mixed-case field names, nested data serialized as strings. The kind of stuff that happens when real systems talk to each other. When Service A tries to transfer everything to Service B, those malformed records get rejected. Service A logs the rejections as errors. And that's when the AI container wakes up. The core of the system is a shell script that runs inside the AI container. It streams logs from the other containers using docker compose logs -f and watches every line against a configurable regex pattern. The idea is simple. Most log lines are noise — request timings, debug output, health checks. But when a line matches the error pattern (in my case, something like transfer_remote_rejections), the system wakes up. Here's the stripped-down logic: A few things I want to highlight because they matter: Rolling buffer, not just the matched line. When the AI needs to fix something, it doesn't just get the error — it gets the last 30 lines of logs for context. A rejection error alone doesn't tell you much. But 30 lines of context? Now you can see the actual payload that failed, the validation error message, the traceback. Signature-based deduplication. Without this, the same error triggers the fix loop over and over. Each matched line gets hashed, and if we've already triggered on that hash within a cooldown window (say, 3 minutes), we skip it. Reconnection. Log streams can drop. The outer while true loop reconnects automatically with a short delay. When the watcher triggers, it builds a prompt and sends it to an LLM. This is the part that took the most iteration to get right. The naive version — "here's an error, fix it" - doesn't work. The model needs structure. It needs to know what the system is, what the error means, and exactly what files to look at. Here's roughly what the prompt looks like: [...actual log output...] Tell the model exactly which file to modify. Don't let it go exploring the whole repo. In my case, the fix always lives in the receiving service's main application file. List the variant shapes explicitly. The model can't guess what "malformed" means in your context. Be specific about what the data looks like and what it should be normalized into. Include the verification step. The prompt doesn't just say "fix the code" — it says "fix the code, rebuild, re-run the transfer, check the counts." The AI needs to know when it's done. The fix doesn't always work on the first try. Sometimes the model gets it 80% right - handles three out of four variants, misses one. That's fine, because the system is built for retries. The 900-second timeout (15 minutes) is generous on purpose. The model doesn't just edit a file - it also rebuilds the container, waits for it to become healthy, triggers the transfer, and checks the results. That whole cycle takes time. And the cooldown between error signatures prevents the system from going into a spin loop when something truly can't be fixed. Three strikes and it stops, leaving the error for a human to look at. Here's something I didn't see coming when I started this project. The AI container needs to actually do things inside the other containers - read files, apply patches, restart processes. You can't just docker exec for everything. The solution I landed on: the AI container generates an SSH keypair on startup, shares the public key through a Docker volume, and all service containers configure their SSH daemons to accept it. Each service container runs a small entrypoint script that waits for the public key to appear, copies it into authorized_keys, starts the SSH daemon, and then launches the actual application. This means the AI container can do things like: It felt overengineered at first, but it turned out to be the cleanest way to give the AI full access without mounting every source directory as a shared volume. One challenge with Docker is that you want logs to go to stdout (so docker logs works) but you also want them in a file (so the AI can read them via SSH or tail them). The solution is a thin entrypoint wrapper: Every container uses this as its entrypoint. The actual service command gets passed as arguments. Output goes to a file and to stdout. Everybody's happy. For the curious — what does the AI actually change? In my demo, the receiving service has strict Pydantic validation. It expects fields like long_value to be integers, object_values to be a dict with specific keys, etc. But the malformed records come in with stuff like: The AI adds a normalization layer before validation — something like: That's it. A normalization function that handles four different "dirty" shapes. The AI writes this, plugs it into the ingest endpoint, rebuilds the container, and re-runs the transfer. All 1000 records pass. After the AI fix runs (automatically, no human): The whole cycle — error detection, prompt construction, LLM call, code patch, container rebuild, verification — takes about 2-3 minutes depending on the model speed. The prompt is the product. I spent way more time tuning the prompt than writing the orchestration logic. If your prompt is vague, the model will make creative decisions you don't want. Be specific about what to change, what not to touch, and how to verify success. Shell scripts are actually fine for this. I started rewriting the orchestrator in Python, then stopped. The core logic is "tail logs, grep for patterns, run a command." Shell does this natively. Don't overcomplicate the glue code. SSH access was the right call. I tried volume mounts first (share source code between containers). It works but gets messy fast with permissions and file locking. SSH gives you a clean interface — "read this file, write this file, run this command" — without coupling container filesystems. The circuit breaker matters more than the AI. The cooldown, the retry limit, the signature dedup — that's what prevents the system from doing something stupid in a loop. The AI fix is the flashy part, but the guardrails are what make it safe to actually run. Honestly? Not in production. Not yet. But here's where it makes sense today: The interesting thing is that the pattern — tail logs, detect errors, call an LLM, apply a fix, verify — doesn't require Docker at all. You could do the same thing with systemd services, Kubernetes pods, or Lambda functions. Docker just makes it easy to prototype. The whole thing is five containers and a docker-compose.yml. The stack: You need an LLM API key and Docker. That's it. The orchestrator shell script is under 300 lines. The FastAPI services are under 400 lines each. There's no framework, no agent library, no orchestration platform. Just containers, logs, regex, a prompt, and an API call. If you've built something similar - or think this is a terrible idea : I'd genuinely like to hear about it. Drop a comment or ping me. The best feedback I've gotten on this project has been from people who tried to poke holes in it. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

+-----------+ +-----------+ | Service A |------->| Service B | | (no | HTTP | (strict | | validation) | validation) +-----+-----+ +-----+-----+ | | | logs errors | rejects bad data v v +----------------------------------+ | AI Orchestrator | | | | 1. tail logs from all services | | 2. regex match on error pattern | | 3. build prompt with context | | 4. call LLM for a code fix | | 5. apply patch, rebuild, verify | +----------------------------------+ | v +-----------+ | MongoDB | +-----------+ +-----------+ +-----------+ | Service A |------->| Service B | | (no | HTTP | (strict | | validation) | validation) +-----+-----+ +-----+-----+ | | | logs errors | rejects bad data v v +----------------------------------+ | AI Orchestrator | | | | 1. tail logs from all services | | 2. regex match on error pattern | | 3. build prompt with context | | 4. call LLM for a code fix | | 5. apply patch, rebuild, verify | +----------------------------------+ | v +-----------+ | MongoDB | +-----------+ +-----------+ +-----------+ | Service A |------->| Service B | | (no | HTTP | (strict | | validation) | validation) +-----+-----+ +-----+-----+ | | | logs errors | rejects bad data v v +----------------------------------+ | AI Orchestrator | | | | 1. tail logs from all services | | 2. regex match on error pattern | | 3. build prompt with context | | 4. call LLM for a code fix | | 5. apply patch, rebuild, verify | +----------------------------------+ | v +-----------+ | MongoDB | +-----------+ # Stream logs from all monitored services docker compose logs --tail 0 -f service_a service_b mongodb | while read -r line; do # Append to a rolling buffer (keeps last N lines for context) echo "$line" >> "$BUFFER_FILE" # Check if this line matches our error pattern if echo "$line" | grep -Eq "$ERROR_REGEX"; then # Hash the line to avoid retriggering on the same error signature=$(echo "$line" | sha256sum | awk '{print $1}') if should_trigger "$signature"; then echo "Detected error. Triggering AI fix." run_ai_fix "$line" "$signature" fi fi done # Stream logs from all monitored services docker compose logs --tail 0 -f service_a service_b mongodb | while read -r line; do # Append to a rolling buffer (keeps last N lines for context) echo "$line" >> "$BUFFER_FILE" # Check if this line matches our error pattern if echo "$line" | grep -Eq "$ERROR_REGEX"; then # Hash the line to avoid retriggering on the same error signature=$(echo "$line" | sha256sum | awk '{print $1}') if should_trigger "$signature"; then echo "Detected error. Triggering AI fix." run_ai_fix "$line" "$signature" fi fi done # Stream logs from all monitored services docker compose logs --tail 0 -f service_a service_b mongodb | while read -r line; do # Append to a rolling buffer (keeps last N lines for context) echo "$line" >> "$BUFFER_FILE" # Check if this line matches our error pattern if echo "$line" | grep -Eq "$ERROR_REGEX"; then # Hash the line to avoid retriggering on the same error signature=$(echo "$line" | sha256sum | awk '{print $1}') if should_trigger "$signature"; then echo "Detected error. Triggering AI fix." run_ai_fix "$line" "$signature" fi fi done You are debugging a running backend service inside Docker. Detected error pattern: transfer_remote_rejections Matched log line: [truncated to ~1400 chars] Recent log context (last 30 lines): You are debugging a running backend service inside Docker. Detected error pattern: transfer_remote_rejections Matched log line: [truncated to ~1400 chars] Recent log context (last 30 lines): You are debugging a running backend service inside Docker. Detected error pattern: transfer_remote_rejections Matched log line: [truncated to ~1400 chars] Recent log context (last 30 lines): Task: - The receiving service rejects records with unexpected payload shapes. - Fix the validation/normalization code to handle these variants: - Numbers wrapped as {"$numberInt": "42"} or {"$numberLong": "999"} - Object keys with inconsistent casing (e.g., "Category" vs "category") - Nested objects serialized as JSON strings instead of dicts - Nested objects sent as a list of {key, value} pairs - Only modify the receiving service's code. Preserve the API contract. - Rebuild the container, run the transfer again, verify counts. Task: - The receiving service rejects records with unexpected payload shapes. - Fix the validation/normalization code to handle these variants: - Numbers wrapped as {"$numberInt": "42"} or {"$numberLong": "999"} - Object keys with inconsistent casing (e.g., "Category" vs "category") - Nested objects serialized as JSON strings instead of dicts - Nested objects sent as a list of {key, value} pairs - Only modify the receiving service's code. Preserve the API contract. - Rebuild the container, run the transfer again, verify counts. Task: - The receiving service rejects records with unexpected payload shapes. - Fix the validation/normalization code to handle these variants: - Numbers wrapped as {"$numberInt": "42"} or {"$numberLong": "999"} - Object keys with inconsistent casing (e.g., "Category" vs "category") - Nested objects serialized as JSON strings instead of dicts - Nested objects sent as a list of {key, value} pairs - Only modify the receiving service's code. Preserve the API contract. - Rebuild the container, run the transfer again, verify counts. MAX_RETRIES=3 attempt=1 while [ "$attempt" -le "$MAX_RETRIES" ]; do echo "Fix attempt $attempt/$MAX_RETRIES" # Run the LLM with a timeout timeout 900 run_llm_fix < "$PROMPT_FILE" exit_code=$? if [ "$exit_code" -eq 0 ]; then echo "Fix succeeded on attempt $attempt" break fi attempt=$((attempt + 1)) sleep 2 done MAX_RETRIES=3 attempt=1 while [ "$attempt" -le "$MAX_RETRIES" ]; do echo "Fix attempt $attempt/$MAX_RETRIES" # Run the LLM with a timeout timeout 900 run_llm_fix < "$PROMPT_FILE" exit_code=$? if [ "$exit_code" -eq 0 ]; then echo "Fix succeeded on attempt $attempt" break fi attempt=$((attempt + 1)) sleep 2 done MAX_RETRIES=3 attempt=1 while [ "$attempt" -le "$MAX_RETRIES" ]; do echo "Fix attempt $attempt/$MAX_RETRIES" # Run the LLM with a timeout timeout 900 run_llm_fix < "$PROMPT_FILE" exit_code=$? if [ "$exit_code" -eq 0 ]; then echo "Fix succeeded on attempt $attempt" break fi attempt=$((attempt + 1)) sleep 2 done # docker-compose.yml (simplified) services: ai_orchestrator: volumes: - ssh_keys:/shared-keys # writes the keypair here - /var/run/docker.sock:/var/run/docker.sock service_a: volumes: - ssh_keys:/shared-keys:ro # reads the public key service_b: volumes: - ssh_keys:/shared-keys:ro volumes: ssh_keys: # docker-compose.yml (simplified) services: ai_orchestrator: volumes: - ssh_keys:/shared-keys # writes the keypair here - /var/run/docker.sock:/var/run/docker.sock service_a: volumes: - ssh_keys:/shared-keys:ro # reads the public key service_b: volumes: - ssh_keys:/shared-keys:ro volumes: ssh_keys: # docker-compose.yml (simplified) services: ai_orchestrator: volumes: - ssh_keys:/shared-keys # writes the keypair here - /var/run/docker.sock:/var/run/docker.sock service_a: volumes: - ssh_keys:/shared-keys:ro # reads the public key service_b: volumes: - ssh_keys:/shared-keys:ro volumes: ssh_keys: ssh service_a "tail -n 50 /var/log/app/service.log" ssh service_b "cat /app/main.py" ssh service_a "tail -n 50 /var/log/app/service.log" ssh service_b "cat /app/main.py" ssh service_a "tail -n 50 /var/log/app/service.log" ssh service_b "cat /app/main.py" #!/bin/sh LOG_FILE="/var/log/app/service.log" mkdir -p "$(dirname "$LOG_FILE")" # Run the actual command, redirect all output to the log file "$@" >> "$LOG_FILE" 2>&1 & MAIN_PID=$! # Tail the log file to stdout (so docker logs still works) tail -n +1 -F "$LOG_FILE" & TAIL_PID=$! wait "$MAIN_PID" kill "$TAIL_PID" 2>/dev/null #!/bin/sh LOG_FILE="/var/log/app/service.log" mkdir -p "$(dirname "$LOG_FILE")" # Run the actual command, redirect all output to the log file "$@" >> "$LOG_FILE" 2>&1 & MAIN_PID=$! # Tail the log file to stdout (so docker logs still works) tail -n +1 -F "$LOG_FILE" & TAIL_PID=$! wait "$MAIN_PID" kill "$TAIL_PID" 2>/dev/null #!/bin/sh LOG_FILE="/var/log/app/service.log" mkdir -p "$(dirname "$LOG_FILE")" # Run the actual command, redirect all output to the log file "$@" >> "$LOG_FILE" 2>&1 & MAIN_PID=$! # Tail the log file to stdout (so docker logs still works) tail -n +1 -F "$LOG_FILE" & TAIL_PID=$! wait "$MAIN_PID" kill "$TAIL_PID" 2>/dev/null { "long_value": {"$numberLong": "900000000000000001"}, "object_values": "{\"category\": \"ALPHA\", \"quality\": \"HIGH\", \"multiplier\": 2}" } { "long_value": {"$numberLong": "900000000000000001"}, "object_values": "{\"category\": \"ALPHA\", \"quality\": \"HIGH\", \"multiplier\": 2}" } { "long_value": {"$numberLong": "900000000000000001"}, "object_values": "{\"category\": \"ALPHA\", \"quality\": \"HIGH\", \"multiplier\": 2}" } def normalize_payload(raw: dict) -> dict: """Unwrap MongoDB extended JSON and normalize shapes.""" # Handle {"$numberLong": "..."} and {"$numberInt": "..."} wrappers for field in ["long_value", "short_value", "integer_value"]: val = raw.get(field) if isinstance(val, dict): raw[field] = int(val.get("$numberLong") or val.get("$numberInt", 0)) # Handle object_values as a JSON string obj = raw.get("object_values") if isinstance(obj, str): obj = json.loads(obj) raw["object_values"] = obj # Handle object_values as [{key, value}, ...] list if isinstance(obj, list): raw["object_values"] = {item["key"]: item["value"] for item in obj} obj = raw["object_values"] # Normalize mixed-case keys if isinstance(obj, dict): normalized = {k.lower(): v for k, v in obj.items()} raw["object_values"] = normalized return raw def normalize_payload(raw: dict) -> dict: """Unwrap MongoDB extended JSON and normalize shapes.""" # Handle {"$numberLong": "..."} and {"$numberInt": "..."} wrappers for field in ["long_value", "short_value", "integer_value"]: val = raw.get(field) if isinstance(val, dict): raw[field] = int(val.get("$numberLong") or val.get("$numberInt", 0)) # Handle object_values as a JSON string obj = raw.get("object_values") if isinstance(obj, str): obj = json.loads(obj) raw["object_values"] = obj # Handle object_values as [{key, value}, ...] list if isinstance(obj, list): raw["object_values"] = {item["key"]: item["value"] for item in obj} obj = raw["object_values"] # Normalize mixed-case keys if isinstance(obj, dict): normalized = {k.lower(): v for k, v in obj.items()} raw["object_values"] = normalized return raw def normalize_payload(raw: dict) -> dict: """Unwrap MongoDB extended JSON and normalize shapes.""" # Handle {"$numberLong": "..."} and {"$numberInt": "..."} wrappers for field in ["long_value", "short_value", "integer_value"]: val = raw.get(field) if isinstance(val, dict): raw[field] = int(val.get("$numberLong") or val.get("$numberInt", 0)) # Handle object_values as a JSON string obj = raw.get("object_values") if isinstance(obj, str): obj = json.loads(obj) raw["object_values"] = obj # Handle object_values as [{key, value}, ...] list if isinstance(obj, list): raw["object_values"] = {item["key"]: item["value"] for item in obj} obj = raw["object_values"] # Normalize mixed-case keys if isinstance(obj, dict): normalized = {k.lower(): v for k, v in obj.items()} raw["object_values"] = normalized return raw source_total: 1000 transferred: 996 rejected: 4 source_total: 1000 transferred: 996 rejected: 4 source_total: 1000 transferred: 996 rejected: 4 source_total: 1000 transferred: 1000 rejected: 0 source_total: 1000 transferred: 1000 rejected: 0 source_total: 1000 transferred: 1000 rejected: 0 - Service A holds raw records. It doesn't validate them — just stores whatever it gets. - Service B is the strict one. It receives records from Service A, validates every field against business rules, and writes the good ones to a separate database. Bad records get rejected. - The AI container sits alongside them. It has access to the Docker socket, can SSH into the other containers, and tails their logs in real time. - Tell the model exactly which file to modify. Don't let it go exploring the whole repo. In my case, the fix always lives in the receiving service's main application file. - List the variant shapes explicitly. The model can't guess what "malformed" means in your context. Be specific about what the data looks like and what it should be normalized into. - Include the verification step. The prompt doesn't just say "fix the code" — it says "fix the code, rebuild, re-run the transfer, check the counts." The AI needs to know when it's done. - Staging environments where you want fast iteration on integration bugs - Demo environments that need to self-recover when data gets messy - Data pipelines where upstream systems send unpredictable payloads and you need the receiving end to adapt - Internal tools where the cost of an hour of downtime is higher than the risk of an automated fix - 2x FastAPI services (Python, one stores data, one validates it) - 1x Init container (seeds test data with intentional malformed records) - 1x AI orchestrator (tails logs, calls LLM, applies fixes)

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolsbuildinghealingbackenddockerwatcher

More from Tools

Tools: Essential Guide: I Built a Private Cloud + 4 AI Assistants on One Server (No DevOps Required)

2026-04-12 0

Tools: Setting Up a Custom Caddy Reverse Proxy for OpenClaw on macOS (2026)

2026-04-12 0

Tools: Essential Guide: Why can two Docker containers ping each other by name but one cannot make HTTP requests to the other?

2026-04-12 0

Tools: From Scripts to Infrastructure-as-Code: Building a Multi-Tier Ansible Playbook (2026)

2026-04-12 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: Building a Self-Healing Backend with AI + Docker

The Setup

The Log Watcher

The Fix Prompt

Retries and Cooldowns

The Part Nobody Talks About: SSH Into Running Containers

The Entrypoint Trick: Logs That Go Two Places

What the Fix Actually Looks Like

The Results

What I'd Do Differently

When Would You Use This For Real?

🏷️ Tags

More from Tools

Tools: Essential Guide: I Built a Private Cloud + 4 AI Assistants on One Server (No DevOps Required)

Tools: Setting Up a Custom Caddy Reverse Proxy for OpenClaw on macOS (2026)

Tools: Essential Guide: Why can two Docker containers ping each other by name but one cannot make HTTP requests to the other?

Tools: From Scripts to Infrastructure-as-Code: Building a Multi-Tier Ansible Playbook (2026)

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting