Tools
Tools: OpenClaw Guide Ch8: Monitoring and Debugging
2026-02-14
0 views
admin
Chapter 8: Monitoring and Debugging ## π Monitoring System Overview ## ποΈ Monitoring Architecture ## 8.1 Monitoring Layer Model ## π Built-in OpenClaw Monitoring ## 8.3 Gateway Status Monitoring ## Basic Status Queries ## Key Metrics ## 8.4 Health Checks ## Automated Health Checks ## π Log Management ## 8.5 Log Configuration ## Log Viewing Commands ## 8.6 ELK Stack Integration ## 8.7 Structured Logging Best Practices ## π Performance Monitoring and Debugging ## 8.8 Prometheus Integration ## Custom Metrics Endpoint ## 8.9 Performance Debugging Tools ## π¨ Alerting and Notification ## 8.10 Alert Rule Configuration ## Prometheus Alert Rules ## 8.11 Intelligent Alerting Strategy ## Tiered Alerting and Escalation ## π Visualization Dashboards ## 8.12 Grafana Dashboard ## π§ Fault Diagnosis and Debugging ## 8.14 Diagnostic Decision Tree ## Automated Diagnostic Script ## π€ Automated Operations ## 8.16 Self-Healing System ## π Chapter Summary ## Key Takeaways ## Monitoring Checklist ## Practice Tips π― Learning Objective: Build a comprehensive OpenClaw monitoring system, master performance debugging techniques, and implement rapid fault diagnosis with automated operations A production-grade OpenClaw deployment requires comprehensive monitoring: π Related Resources: Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK:
βββββββββββββββββββββββββββββββββββββββββββ
β Application Monitoring β
β Agent perf | Session status | Tools β
βββββββββββββββββββββββββββββββββββββββββββ€
β Gateway Monitoring β
β Connections | Latency | Throughput β
βββββββββββββββββββββββββββββββββββββββββββ€
β System Monitoring β
β CPU | Memory | Disk | Network β
βββββββββββββββββββββββββββββββββββββββββββ€
β Infrastructure Monitoring β
β Servers | Network | Storage β
βββββββββββββββββββββββββββββββββββββββββββ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
βββββββββββββββββββββββββββββββββββββββββββ
β Application Monitoring β
β Agent perf | Session status | Tools β
βββββββββββββββββββββββββββββββββββββββββββ€
β Gateway Monitoring β
β Connections | Latency | Throughput β
βββββββββββββββββββββββββββββββββββββββββββ€
β System Monitoring β
β CPU | Memory | Disk | Network β
βββββββββββββββββββββββββββββββββββββββββββ€
β Infrastructure Monitoring β
β Servers | Network | Storage β
βββββββββββββββββββββββββββββββββββββββββββ CODE_BLOCK:
βββββββββββββββββββββββββββββββββββββββββββ
β Application Monitoring β
β Agent perf | Session status | Tools β
βββββββββββββββββββββββββββββββββββββββββββ€
β Gateway Monitoring β
β Connections | Latency | Throughput β
βββββββββββββββββββββββββββββββββββββββββββ€
β System Monitoring β
β CPU | Memory | Disk | Network β
βββββββββββββββββββββββββββββββββββββββββββ€
β Infrastructure Monitoring β
β Servers | Network | Storage β
βββββββββββββββββββββββββββββββββββββββββββ COMMAND_BLOCK:
# Full system status
openclaw status # Detailed monitoring info
openclaw status --all --deep # JSON output (for scripting)
openclaw status --json Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# Full system status
openclaw status # Detailed monitoring info
openclaw status --all --deep # JSON output (for scripting)
openclaw status --json COMMAND_BLOCK:
# Full system status
openclaw status # Detailed monitoring info
openclaw status --all --deep # JSON output (for scripting)
openclaw status --json CODE_BLOCK:
{ "gateway": { "uptime": "72h 15m", "version": "2026.2.9", "connections": 42, "requests_total": 15847, "requests_per_minute": 23.4, "memory_usage": "512MB", "cpu_usage": "15%" }, "agents": { "total": 8, "active": 6, "sessions": 299, "avg_response_time": "1.2s" }
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
{ "gateway": { "uptime": "72h 15m", "version": "2026.2.9", "connections": 42, "requests_total": 15847, "requests_per_minute": 23.4, "memory_usage": "512MB", "cpu_usage": "15%" }, "agents": { "total": 8, "active": 6, "sessions": 299, "avg_response_time": "1.2s" }
} CODE_BLOCK:
{ "gateway": { "uptime": "72h 15m", "version": "2026.2.9", "connections": 42, "requests_total": 15847, "requests_per_minute": 23.4, "memory_usage": "512MB", "cpu_usage": "15%" }, "agents": { "total": 8, "active": 6, "sessions": 299, "avg_response_time": "1.2s" }
} COMMAND_BLOCK:
# Run a full health check
openclaw doctor --non-interactive # Check specific components
openclaw doctor --check-channels
openclaw doctor --check-models
openclaw doctor --check-security Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# Run a full health check
openclaw doctor --non-interactive # Check specific components
openclaw doctor --check-channels
openclaw doctor --check-models
openclaw doctor --check-security COMMAND_BLOCK:
# Run a full health check
openclaw doctor --non-interactive # Check specific components
openclaw doctor --check-channels
openclaw doctor --check-models
openclaw doctor --check-security CODE_BLOCK:
{ "logging": { "level": "info", "format": "structured", "outputs": [ { "type": "file", "path": "/var/log/openclaw/gateway.log", "rotation": { "maxSize": "100MB", "maxFiles": 10, "compress": true } } ] }
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
{ "logging": { "level": "info", "format": "structured", "outputs": [ { "type": "file", "path": "/var/log/openclaw/gateway.log", "rotation": { "maxSize": "100MB", "maxFiles": 10, "compress": true } } ] }
} CODE_BLOCK:
{ "logging": { "level": "info", "format": "structured", "outputs": [ { "type": "file", "path": "/var/log/openclaw/gateway.log", "rotation": { "maxSize": "100MB", "maxFiles": 10, "compress": true } } ] }
} COMMAND_BLOCK:
# Real-time log stream
openclaw logs --follow # Filter error logs
openclaw logs --level error --since "1h" # Agent-specific logs
openclaw logs --agent main --limit 100 # Filter by channel
openclaw logs --channel telegram --since "2026-02-13" Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# Real-time log stream
openclaw logs --follow # Filter error logs
openclaw logs --level error --since "1h" # Agent-specific logs
openclaw logs --agent main --limit 100 # Filter by channel
openclaw logs --channel telegram --since "2026-02-13" COMMAND_BLOCK:
# Real-time log stream
openclaw logs --follow # Filter error logs
openclaw logs --level error --since "1h" # Agent-specific logs
openclaw logs --agent main --limit 100 # Filter by channel
openclaw logs --channel telegram --since "2026-02-13" COMMAND_BLOCK:
# docker-compose-logging.yml
version: '3.8' services: elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0 environment: - discovery.type=single-node - "ES_JAVA_OPTS=-Xms512m -Xmx512m" ports: - "9200:9200" logstash: image: docker.elastic.co/logstash/logstash:8.11.0 volumes: - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf - /var/log/openclaw:/logs:ro depends_on: - elasticsearch kibana: image: docker.elastic.co/kibana/kibana:8.11.0 ports: - "5601:5601" depends_on: - elasticsearch Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# docker-compose-logging.yml
version: '3.8' services: elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0 environment: - discovery.type=single-node - "ES_JAVA_OPTS=-Xms512m -Xmx512m" ports: - "9200:9200" logstash: image: docker.elastic.co/logstash/logstash:8.11.0 volumes: - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf - /var/log/openclaw:/logs:ro depends_on: - elasticsearch kibana: image: docker.elastic.co/kibana/kibana:8.11.0 ports: - "5601:5601" depends_on: - elasticsearch COMMAND_BLOCK:
# docker-compose-logging.yml
version: '3.8' services: elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0 environment: - discovery.type=single-node - "ES_JAVA_OPTS=-Xms512m -Xmx512m" ports: - "9200:9200" logstash: image: docker.elastic.co/logstash/logstash:8.11.0 volumes: - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf - /var/log/openclaw:/logs:ro depends_on: - elasticsearch kibana: image: docker.elastic.co/kibana/kibana:8.11.0 ports: - "5601:5601" depends_on: - elasticsearch CODE_BLOCK:
{ "timestamp": "2026-02-13T11:02:45.123Z", "level": "INFO", "component": "gateway", "agent_id": "main", "session_id": "abc123", "channel": "telegram", "action": "message_received", "duration_ms": 1250, "metadata": { "tool_calls": 3, "tokens_used": 1847, "model": "claude-sonnet-4" }
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
{ "timestamp": "2026-02-13T11:02:45.123Z", "level": "INFO", "component": "gateway", "agent_id": "main", "session_id": "abc123", "channel": "telegram", "action": "message_received", "duration_ms": 1250, "metadata": { "tool_calls": 3, "tokens_used": 1847, "model": "claude-sonnet-4" }
} CODE_BLOCK:
{ "timestamp": "2026-02-13T11:02:45.123Z", "level": "INFO", "component": "gateway", "agent_id": "main", "session_id": "abc123", "channel": "telegram", "action": "message_received", "duration_ms": 1250, "metadata": { "tool_calls": 3, "tokens_used": 1847, "model": "claude-sonnet-4" }
} COMMAND_BLOCK:
# prometheus.yml
global: scrape_interval: 15s scrape_configs: - job_name: 'openclaw-gateway' static_configs: - targets: ['localhost:18789'] metrics_path: '/metrics' scrape_interval: 10s - job_name: 'node-exporter' static_configs: - targets: ['localhost:9100'] Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# prometheus.yml
global: scrape_interval: 15s scrape_configs: - job_name: 'openclaw-gateway' static_configs: - targets: ['localhost:18789'] metrics_path: '/metrics' scrape_interval: 10s - job_name: 'node-exporter' static_configs: - targets: ['localhost:9100'] COMMAND_BLOCK:
# prometheus.yml
global: scrape_interval: 15s scrape_configs: - job_name: 'openclaw-gateway' static_configs: - targets: ['localhost:18789'] metrics_path: '/metrics' scrape_interval: 10s - job_name: 'node-exporter' static_configs: - targets: ['localhost:9100'] COMMAND_BLOCK:
# OpenClaw Prometheus metrics endpoint
curl http://localhost:18789/metrics # Example metrics output
openclaw_gateway_requests_total{channel="telegram"} 15847
openclaw_gateway_response_time_seconds{quantile="0.5"} 1.2
openclaw_gateway_response_time_seconds{quantile="0.95"} 3.5
openclaw_agent_sessions_active{agent="main"} 12 Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# OpenClaw Prometheus metrics endpoint
curl http://localhost:18789/metrics # Example metrics output
openclaw_gateway_requests_total{channel="telegram"} 15847
openclaw_gateway_response_time_seconds{quantile="0.5"} 1.2
openclaw_gateway_response_time_seconds{quantile="0.95"} 3.5
openclaw_agent_sessions_active{agent="main"} 12 COMMAND_BLOCK:
# OpenClaw Prometheus metrics endpoint
curl http://localhost:18789/metrics # Example metrics output
openclaw_gateway_requests_total{channel="telegram"} 15847
openclaw_gateway_response_time_seconds{quantile="0.5"} 1.2
openclaw_gateway_response_time_seconds{quantile="0.95"} 3.5
openclaw_agent_sessions_active{agent="main"} 12 COMMAND_BLOCK:
# Agent performance profiling
openclaw agent --profile --message "test query" # Memory search performance test
openclaw memory benchmark --queries 1000 # Gateway load test
openclaw gateway --benchmark --duration 60s Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# Agent performance profiling
openclaw agent --profile --message "test query" # Memory search performance test
openclaw memory benchmark --queries 1000 # Gateway load test
openclaw gateway --benchmark --duration 60s COMMAND_BLOCK:
# Agent performance profiling
openclaw agent --profile --message "test query" # Memory search performance test
openclaw memory benchmark --queries 1000 # Gateway load test
openclaw gateway --benchmark --duration 60s COMMAND_BLOCK:
# openclaw-alerts.yml
groups: - name: openclaw.rules rules: - alert: HighErrorRate expr: rate(openclaw_gateway_errors_total[5m]) > 0.1 for: 2m labels: severity: critical annotations: summary: "High error rate detected" - alert: HighResponseTime expr: openclaw_gateway_response_time_seconds{quantile="0.95"} > 5 for: 5m labels: severity: warning annotations: summary: "High response time" - alert: AgentDown expr: openclaw_agent_status == 0 for: 1m labels: severity: critical annotations: summary: "Agent is down" Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# openclaw-alerts.yml
groups: - name: openclaw.rules rules: - alert: HighErrorRate expr: rate(openclaw_gateway_errors_total[5m]) > 0.1 for: 2m labels: severity: critical annotations: summary: "High error rate detected" - alert: HighResponseTime expr: openclaw_gateway_response_time_seconds{quantile="0.95"} > 5 for: 5m labels: severity: warning annotations: summary: "High response time" - alert: AgentDown expr: openclaw_agent_status == 0 for: 1m labels: severity: critical annotations: summary: "Agent is down" COMMAND_BLOCK:
# openclaw-alerts.yml
groups: - name: openclaw.rules rules: - alert: HighErrorRate expr: rate(openclaw_gateway_errors_total[5m]) > 0.1 for: 2m labels: severity: critical annotations: summary: "High error rate detected" - alert: HighResponseTime expr: openclaw_gateway_response_time_seconds{quantile="0.95"} > 5 for: 5m labels: severity: warning annotations: summary: "High response time" - alert: AgentDown expr: openclaw_agent_status == 0 for: 1m labels: severity: critical annotations: summary: "Agent is down" CODE_BLOCK:
{ "alerting": { "levels": [ { "name": "info", "channels": ["log"], "escalation": false }, { "name": "warning", "channels": ["slack", "email"], "escalation": { "after": "15m", "to": "critical" } }, { "name": "critical", "channels": ["telegram", "phone", "pager"], "escalation": { "after": "5m", "to": "emergency" } } ] }
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
{ "alerting": { "levels": [ { "name": "info", "channels": ["log"], "escalation": false }, { "name": "warning", "channels": ["slack", "email"], "escalation": { "after": "15m", "to": "critical" } }, { "name": "critical", "channels": ["telegram", "phone", "pager"], "escalation": { "after": "5m", "to": "emergency" } } ] }
} CODE_BLOCK:
{ "alerting": { "levels": [ { "name": "info", "channels": ["log"], "escalation": false }, { "name": "warning", "channels": ["slack", "email"], "escalation": { "after": "15m", "to": "critical" } }, { "name": "critical", "channels": ["telegram", "phone", "pager"], "escalation": { "after": "5m", "to": "emergency" } } ] }
} CODE_BLOCK:
{ "dashboard": { "title": "OpenClaw System Overview", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "rate(openclaw_gateway_requests_total[5m])", "legendFormat": "{{channel}}" } ] }, { "title": "Response Time Percentiles", "type": "graph", "targets": [ { "expr": "openclaw_gateway_response_time_seconds{quantile=\"0.5\"}", "legendFormat": "p50" }, { "expr": "openclaw_gateway_response_time_seconds{quantile=\"0.95\"}", "legendFormat": "p95" } ] }, { "title": "Active Sessions", "type": "singlestat", "targets": [ { "expr": "sum(openclaw_agent_sessions_active)" } ] } ] }
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
{ "dashboard": { "title": "OpenClaw System Overview", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "rate(openclaw_gateway_requests_total[5m])", "legendFormat": "{{channel}}" } ] }, { "title": "Response Time Percentiles", "type": "graph", "targets": [ { "expr": "openclaw_gateway_response_time_seconds{quantile=\"0.5\"}", "legendFormat": "p50" }, { "expr": "openclaw_gateway_response_time_seconds{quantile=\"0.95\"}", "legendFormat": "p95" } ] }, { "title": "Active Sessions", "type": "singlestat", "targets": [ { "expr": "sum(openclaw_agent_sessions_active)" } ] } ] }
} CODE_BLOCK:
{ "dashboard": { "title": "OpenClaw System Overview", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "rate(openclaw_gateway_requests_total[5m])", "legendFormat": "{{channel}}" } ] }, { "title": "Response Time Percentiles", "type": "graph", "targets": [ { "expr": "openclaw_gateway_response_time_seconds{quantile=\"0.5\"}", "legendFormat": "p50" }, { "expr": "openclaw_gateway_response_time_seconds{quantile=\"0.95\"}", "legendFormat": "p95" } ] }, { "title": "Active Sessions", "type": "singlestat", "targets": [ { "expr": "sum(openclaw_agent_sessions_active)" } ] } ] }
} CODE_BLOCK:
Fault Report β
βββ User cannot access?
β βββ Check Gateway status β openclaw status
β βββ Check channel connections β openclaw status --channels
β βββ Check network connectivity β ping/traceroute
β
βββ Slow response?
β βββ Check system load β top/htop
β βββ Check Agent performance β openclaw logs --level performance
β βββ Check memory usage β openclaw memory stats
β
βββ Feature malfunction? βββ Check error logs β openclaw logs --level error βββ Check configuration β openclaw doctor βββ Check model status β openclaw status --models Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
Fault Report β
βββ User cannot access?
β βββ Check Gateway status β openclaw status
β βββ Check channel connections β openclaw status --channels
β βββ Check network connectivity β ping/traceroute
β
βββ Slow response?
β βββ Check system load β top/htop
β βββ Check Agent performance β openclaw logs --level performance
β βββ Check memory usage β openclaw memory stats
β
βββ Feature malfunction? βββ Check error logs β openclaw logs --level error βββ Check configuration β openclaw doctor βββ Check model status β openclaw status --models CODE_BLOCK:
Fault Report β
βββ User cannot access?
β βββ Check Gateway status β openclaw status
β βββ Check channel connections β openclaw status --channels
β βββ Check network connectivity β ping/traceroute
β
βββ Slow response?
β βββ Check system load β top/htop
β βββ Check Agent performance β openclaw logs --level performance
β βββ Check memory usage β openclaw memory stats
β
βββ Feature malfunction? βββ Check error logs β openclaw logs --level error βββ Check configuration β openclaw doctor βββ Check model status β openclaw status --models COMMAND_BLOCK:
#!/bin/bash
# openclaw-troubleshoot.sh echo "π OpenClaw Automated Diagnostics"
echo "=================================" # 1. Basic connectivity check
echo "π‘ Checking Gateway connectivity..."
if ! curl -s http://localhost:18789/health > /dev/null; then echo "β Gateway not responding, checking service status" systemctl --user status openclaw-gateway exit 1
fi
echo "β
Gateway running normally" # 2. Channel status check
echo "π± Checking channel status..."
CHANNELS=$(openclaw status --json | jq -r '.channels | to_entries[] | select(.value.status != "OK") | .key')
if [[ -n "$CHANNELS" ]]; then echo "β οΈ Channel issues found: $CHANNELS"
else echo "β
All channels healthy"
fi # 3. Performance metrics check
echo "β‘ Checking performance metrics..."
echo " - System load: $(uptime)"
echo " - Memory usage: $(free -h | grep Mem | awk '{print $3"/"$2}')" # 4. Error log analysis
echo "π Checking recent errors..."
ERROR_COUNT=$(openclaw logs --level error --since "1h" --json | jq '. | length')
if [[ "$ERROR_COUNT" -gt 10 ]]; then echo "β οΈ Found $ERROR_COUNT errors. Recent errors:" openclaw logs --level error --limit 5
fi Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
#!/bin/bash
# openclaw-troubleshoot.sh echo "π OpenClaw Automated Diagnostics"
echo "=================================" # 1. Basic connectivity check
echo "π‘ Checking Gateway connectivity..."
if ! curl -s http://localhost:18789/health > /dev/null; then echo "β Gateway not responding, checking service status" systemctl --user status openclaw-gateway exit 1
fi
echo "β
Gateway running normally" # 2. Channel status check
echo "π± Checking channel status..."
CHANNELS=$(openclaw status --json | jq -r '.channels | to_entries[] | select(.value.status != "OK") | .key')
if [[ -n "$CHANNELS" ]]; then echo "β οΈ Channel issues found: $CHANNELS"
else echo "β
All channels healthy"
fi # 3. Performance metrics check
echo "β‘ Checking performance metrics..."
echo " - System load: $(uptime)"
echo " - Memory usage: $(free -h | grep Mem | awk '{print $3"/"$2}')" # 4. Error log analysis
echo "π Checking recent errors..."
ERROR_COUNT=$(openclaw logs --level error --since "1h" --json | jq '. | length')
if [[ "$ERROR_COUNT" -gt 10 ]]; then echo "β οΈ Found $ERROR_COUNT errors. Recent errors:" openclaw logs --level error --limit 5
fi COMMAND_BLOCK:
#!/bin/bash
# openclaw-troubleshoot.sh echo "π OpenClaw Automated Diagnostics"
echo "=================================" # 1. Basic connectivity check
echo "π‘ Checking Gateway connectivity..."
if ! curl -s http://localhost:18789/health > /dev/null; then echo "β Gateway not responding, checking service status" systemctl --user status openclaw-gateway exit 1
fi
echo "β
Gateway running normally" # 2. Channel status check
echo "π± Checking channel status..."
CHANNELS=$(openclaw status --json | jq -r '.channels | to_entries[] | select(.value.status != "OK") | .key')
if [[ -n "$CHANNELS" ]]; then echo "β οΈ Channel issues found: $CHANNELS"
else echo "β
All channels healthy"
fi # 3. Performance metrics check
echo "β‘ Checking performance metrics..."
echo " - System load: $(uptime)"
echo " - Memory usage: $(free -h | grep Mem | awk '{print $3"/"$2}')" # 4. Error log analysis
echo "π Checking recent errors..."
ERROR_COUNT=$(openclaw logs --level error --since "1h" --json | jq '. | length')
if [[ "$ERROR_COUNT" -gt 10 ]]; then echo "β οΈ Found $ERROR_COUNT errors. Recent errors:" openclaw logs --level error --limit 5
fi COMMAND_BLOCK:
#!/bin/bash
# auto-heal.sh β OpenClaw self-healing script HEALTH_CHECK_URL="http://localhost:18789/health"
MAX_RETRIES=3 check_health() { local response=$(curl -s -w "%{http_code}" -o /dev/null "$HEALTH_CHECK_URL") [[ "$response" == "200" ]]
} restart_gateway() { echo "$(date): Anomaly detected, preparing to restart..." openclaw gateway stop --graceful --timeout 30s sleep 5 openclaw gateway start --background sleep 10 if check_health; then echo "$(date): Gateway restart successful" return 0 else echo "$(date): Gateway restart failed" return 1 fi
} # Main loop
while true; do if ! check_health; then for i in $(seq 1 $MAX_RETRIES); do echo "$(date): Restart attempt ($i/$MAX_RETRIES)" if restart_gateway; then break; fi if [[ $i -eq $MAX_RETRIES ]]; then echo "$(date): Restart failed, sending alert" curl -X POST "$ALERT_WEBHOOK" \ -d '{"level":"critical","message":"OpenClaw Gateway restart failed"}' fi sleep 300 done fi sleep 60
done Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
#!/bin/bash
# auto-heal.sh β OpenClaw self-healing script HEALTH_CHECK_URL="http://localhost:18789/health"
MAX_RETRIES=3 check_health() { local response=$(curl -s -w "%{http_code}" -o /dev/null "$HEALTH_CHECK_URL") [[ "$response" == "200" ]]
} restart_gateway() { echo "$(date): Anomaly detected, preparing to restart..." openclaw gateway stop --graceful --timeout 30s sleep 5 openclaw gateway start --background sleep 10 if check_health; then echo "$(date): Gateway restart successful" return 0 else echo "$(date): Gateway restart failed" return 1 fi
} # Main loop
while true; do if ! check_health; then for i in $(seq 1 $MAX_RETRIES); do echo "$(date): Restart attempt ($i/$MAX_RETRIES)" if restart_gateway; then break; fi if [[ $i -eq $MAX_RETRIES ]]; then echo "$(date): Restart failed, sending alert" curl -X POST "$ALERT_WEBHOOK" \ -d '{"level":"critical","message":"OpenClaw Gateway restart failed"}' fi sleep 300 done fi sleep 60
done COMMAND_BLOCK:
#!/bin/bash
# auto-heal.sh β OpenClaw self-healing script HEALTH_CHECK_URL="http://localhost:18789/health"
MAX_RETRIES=3 check_health() { local response=$(curl -s -w "%{http_code}" -o /dev/null "$HEALTH_CHECK_URL") [[ "$response" == "200" ]]
} restart_gateway() { echo "$(date): Anomaly detected, preparing to restart..." openclaw gateway stop --graceful --timeout 30s sleep 5 openclaw gateway start --background sleep 10 if check_health; then echo "$(date): Gateway restart successful" return 0 else echo "$(date): Gateway restart failed" return 1 fi
} # Main loop
while true; do if ! check_health; then for i in $(seq 1 $MAX_RETRIES); do echo "$(date): Restart attempt ($i/$MAX_RETRIES)" if restart_gateway; then break; fi if [[ $i -eq $MAX_RETRIES ]]; then echo "$(date): Restart failed, sending alert" curl -X POST "$ALERT_WEBHOOK" \ -d '{"level":"critical","message":"OpenClaw Gateway restart failed"}' fi sleep 300 done fi sleep 60
done - π Real-Time Monitoring: System status, performance metrics, error rates
- π Log Management: Structured logging, centralized collection, intelligent analysis
- β οΈ Alerting: Anomaly detection, tiered alerts, automated response
- π Visualization: Dashboards, trend analysis, capacity planning - Layered Monitoring: Application, Gateway, System, Infrastructure layers
- Full Observability: Metrics, Logs, Traces β the three pillars
- Intelligent Alerting: Tiered alerts, escalation, silence windows
- Automated Ops: Self-healing, auto-scaling, backup & recovery - β
Basic Metrics: CPU, memory, disk, network
- β
Application Metrics: Request rate, response time, error rate, concurrency
- β
Business Metrics: User activity, token usage, channel distribution
- β
Security Metrics: Auth failures, anomalous access, permission changes - Start with basic monitoring, then progressively enhance
- Establish standardized alert response procedures
- Regularly conduct disaster recovery drills
- Continuously optimize monitoring strategies and alert thresholds - OpenClaw Monitoring Docs
- Prometheus OpenClaw Exporter
- Grafana Dashboard Templates
how-totutorialguidedev.toaimlserverbashnetworkdockernode