Tools

Tools: OpenClaw Guide Ch8: Monitoring and Debugging

2026-02-14 0 views admin

Tools: OpenClaw Guide Ch8: Monitoring and Debugging

Source: Dev.to

Chapter 8: Monitoring and Debugging ## 📊 Monitoring System Overview ## 🏗️ Monitoring Architecture ## 8.1 Monitoring Layer Model ## 📋 Built-in OpenClaw Monitoring ## 8.3 Gateway Status Monitoring ## Basic Status Queries ## Key Metrics ## 8.4 Health Checks ## Automated Health Checks ## 📝 Log Management ## 8.5 Log Configuration ## Log Viewing Commands ## 8.6 ELK Stack Integration ## 8.7 Structured Logging Best Practices ## 📈 Performance Monitoring and Debugging ## 8.8 Prometheus Integration ## Custom Metrics Endpoint ## 8.9 Performance Debugging Tools ## 🚨 Alerting and Notification ## 8.10 Alert Rule Configuration ## Prometheus Alert Rules ## 8.11 Intelligent Alerting Strategy ## Tiered Alerting and Escalation ## 📊 Visualization Dashboards ## 8.12 Grafana Dashboard ## 🔧 Fault Diagnosis and Debugging ## 8.14 Diagnostic Decision Tree ## Automated Diagnostic Script ## 🤖 Automated Operations ## 8.16 Self-Healing System ## 📋 Chapter Summary ## Key Takeaways ## Monitoring Checklist ## Practice Tips 🎯 Learning Objective: Build a comprehensive OpenClaw monitoring system, master performance debugging techniques, and implement rapid fault diagnosis with automated operations A production-grade OpenClaw deployment requires comprehensive monitoring: 🔗 Related Resources: Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: ┌─────────────────────────────────────────┐ │ Application Monitoring │ │ Agent perf | Session status | Tools │ ├─────────────────────────────────────────┤ │ Gateway Monitoring │ │ Connections | Latency | Throughput │ ├─────────────────────────────────────────┤ │ System Monitoring │ │ CPU | Memory | Disk | Network │ ├─────────────────────────────────────────┤ │ Infrastructure Monitoring │ │ Servers | Network | Storage │ └─────────────────────────────────────────┘ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ┌─────────────────────────────────────────┐ │ Application Monitoring │ │ Agent perf | Session status | Tools │ ├─────────────────────────────────────────┤ │ Gateway Monitoring │ │ Connections | Latency | Throughput │ ├─────────────────────────────────────────┤ │ System Monitoring │ │ CPU | Memory | Disk | Network │ ├─────────────────────────────────────────┤ │ Infrastructure Monitoring │ │ Servers | Network | Storage │ └─────────────────────────────────────────┘ CODE_BLOCK: ┌─────────────────────────────────────────┐ │ Application Monitoring │ │ Agent perf | Session status | Tools │ ├─────────────────────────────────────────┤ │ Gateway Monitoring │ │ Connections | Latency | Throughput │ ├─────────────────────────────────────────┤ │ System Monitoring │ │ CPU | Memory | Disk | Network │ ├─────────────────────────────────────────┤ │ Infrastructure Monitoring │ │ Servers | Network | Storage │ └─────────────────────────────────────────┘ COMMAND_BLOCK: # Full system status openclaw status # Detailed monitoring info openclaw status --all --deep # JSON output (for scripting) openclaw status --json Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Full system status openclaw status # Detailed monitoring info openclaw status --all --deep # JSON output (for scripting) openclaw status --json COMMAND_BLOCK: # Full system status openclaw status # Detailed monitoring info openclaw status --all --deep # JSON output (for scripting) openclaw status --json CODE_BLOCK: { "gateway": { "uptime": "72h 15m", "version": "2026.2.9", "connections": 42, "requests_total": 15847, "requests_per_minute": 23.4, "memory_usage": "512MB", "cpu_usage": "15%" }, "agents": { "total": 8, "active": 6, "sessions": 299, "avg_response_time": "1.2s" } } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: { "gateway": { "uptime": "72h 15m", "version": "2026.2.9", "connections": 42, "requests_total": 15847, "requests_per_minute": 23.4, "memory_usage": "512MB", "cpu_usage": "15%" }, "agents": { "total": 8, "active": 6, "sessions": 299, "avg_response_time": "1.2s" } } CODE_BLOCK: { "gateway": { "uptime": "72h 15m", "version": "2026.2.9", "connections": 42, "requests_total": 15847, "requests_per_minute": 23.4, "memory_usage": "512MB", "cpu_usage": "15%" }, "agents": { "total": 8, "active": 6, "sessions": 299, "avg_response_time": "1.2s" } } COMMAND_BLOCK: # Run a full health check openclaw doctor --non-interactive # Check specific components openclaw doctor --check-channels openclaw doctor --check-models openclaw doctor --check-security Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Run a full health check openclaw doctor --non-interactive # Check specific components openclaw doctor --check-channels openclaw doctor --check-models openclaw doctor --check-security COMMAND_BLOCK: # Run a full health check openclaw doctor --non-interactive # Check specific components openclaw doctor --check-channels openclaw doctor --check-models openclaw doctor --check-security CODE_BLOCK: { "logging": { "level": "info", "format": "structured", "outputs": [ { "type": "file", "path": "/var/log/openclaw/gateway.log", "rotation": { "maxSize": "100MB", "maxFiles": 10, "compress": true } } ] } } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: { "logging": { "level": "info", "format": "structured", "outputs": [ { "type": "file", "path": "/var/log/openclaw/gateway.log", "rotation": { "maxSize": "100MB", "maxFiles": 10, "compress": true } } ] } } CODE_BLOCK: { "logging": { "level": "info", "format": "structured", "outputs": [ { "type": "file", "path": "/var/log/openclaw/gateway.log", "rotation": { "maxSize": "100MB", "maxFiles": 10, "compress": true } } ] } } COMMAND_BLOCK: # Real-time log stream openclaw logs --follow # Filter error logs openclaw logs --level error --since "1h" # Agent-specific logs openclaw logs --agent main --limit 100 # Filter by channel openclaw logs --channel telegram --since "2026-02-13" Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Real-time log stream openclaw logs --follow # Filter error logs openclaw logs --level error --since "1h" # Agent-specific logs openclaw logs --agent main --limit 100 # Filter by channel openclaw logs --channel telegram --since "2026-02-13" COMMAND_BLOCK: # Real-time log stream openclaw logs --follow # Filter error logs openclaw logs --level error --since "1h" # Agent-specific logs openclaw logs --agent main --limit 100 # Filter by channel openclaw logs --channel telegram --since "2026-02-13" COMMAND_BLOCK: # docker-compose-logging.yml version: '3.8' services: elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0 environment: - discovery.type=single-node - "ES_JAVA_OPTS=-Xms512m -Xmx512m" ports: - "9200:9200" logstash: image: docker.elastic.co/logstash/logstash:8.11.0 volumes: - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf - /var/log/openclaw:/logs:ro depends_on: - elasticsearch kibana: image: docker.elastic.co/kibana/kibana:8.11.0 ports: - "5601:5601" depends_on: - elasticsearch Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # docker-compose-logging.yml version: '3.8' services: elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0 environment: - discovery.type=single-node - "ES_JAVA_OPTS=-Xms512m -Xmx512m" ports: - "9200:9200" logstash: image: docker.elastic.co/logstash/logstash:8.11.0 volumes: - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf - /var/log/openclaw:/logs:ro depends_on: - elasticsearch kibana: image: docker.elastic.co/kibana/kibana:8.11.0 ports: - "5601:5601" depends_on: - elasticsearch COMMAND_BLOCK: # docker-compose-logging.yml version: '3.8' services: elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0 environment: - discovery.type=single-node - "ES_JAVA_OPTS=-Xms512m -Xmx512m" ports: - "9200:9200" logstash: image: docker.elastic.co/logstash/logstash:8.11.0 volumes: - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf - /var/log/openclaw:/logs:ro depends_on: - elasticsearch kibana: image: docker.elastic.co/kibana/kibana:8.11.0 ports: - "5601:5601" depends_on: - elasticsearch CODE_BLOCK: { "timestamp": "2026-02-13T11:02:45.123Z", "level": "INFO", "component": "gateway", "agent_id": "main", "session_id": "abc123", "channel": "telegram", "action": "message_received", "duration_ms": 1250, "metadata": { "tool_calls": 3, "tokens_used": 1847, "model": "claude-sonnet-4" } } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: { "timestamp": "2026-02-13T11:02:45.123Z", "level": "INFO", "component": "gateway", "agent_id": "main", "session_id": "abc123", "channel": "telegram", "action": "message_received", "duration_ms": 1250, "metadata": { "tool_calls": 3, "tokens_used": 1847, "model": "claude-sonnet-4" } } CODE_BLOCK: { "timestamp": "2026-02-13T11:02:45.123Z", "level": "INFO", "component": "gateway", "agent_id": "main", "session_id": "abc123", "channel": "telegram", "action": "message_received", "duration_ms": 1250, "metadata": { "tool_calls": 3, "tokens_used": 1847, "model": "claude-sonnet-4" } } COMMAND_BLOCK: # prometheus.yml global: scrape_interval: 15s scrape_configs: - job_name: 'openclaw-gateway' static_configs: - targets: ['localhost:18789'] metrics_path: '/metrics' scrape_interval: 10s - job_name: 'node-exporter' static_configs: - targets: ['localhost:9100'] Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # prometheus.yml global: scrape_interval: 15s scrape_configs: - job_name: 'openclaw-gateway' static_configs: - targets: ['localhost:18789'] metrics_path: '/metrics' scrape_interval: 10s - job_name: 'node-exporter' static_configs: - targets: ['localhost:9100'] COMMAND_BLOCK: # prometheus.yml global: scrape_interval: 15s scrape_configs: - job_name: 'openclaw-gateway' static_configs: - targets: ['localhost:18789'] metrics_path: '/metrics' scrape_interval: 10s - job_name: 'node-exporter' static_configs: - targets: ['localhost:9100'] COMMAND_BLOCK: # OpenClaw Prometheus metrics endpoint curl http://localhost:18789/metrics # Example metrics output openclaw_gateway_requests_total{channel="telegram"} 15847 openclaw_gateway_response_time_seconds{quantile="0.5"} 1.2 openclaw_gateway_response_time_seconds{quantile="0.95"} 3.5 openclaw_agent_sessions_active{agent="main"} 12 Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # OpenClaw Prometheus metrics endpoint curl http://localhost:18789/metrics # Example metrics output openclaw_gateway_requests_total{channel="telegram"} 15847 openclaw_gateway_response_time_seconds{quantile="0.5"} 1.2 openclaw_gateway_response_time_seconds{quantile="0.95"} 3.5 openclaw_agent_sessions_active{agent="main"} 12 COMMAND_BLOCK: # OpenClaw Prometheus metrics endpoint curl http://localhost:18789/metrics # Example metrics output openclaw_gateway_requests_total{channel="telegram"} 15847 openclaw_gateway_response_time_seconds{quantile="0.5"} 1.2 openclaw_gateway_response_time_seconds{quantile="0.95"} 3.5 openclaw_agent_sessions_active{agent="main"} 12 COMMAND_BLOCK: # Agent performance profiling openclaw agent --profile --message "test query" # Memory search performance test openclaw memory benchmark --queries 1000 # Gateway load test openclaw gateway --benchmark --duration 60s Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Agent performance profiling openclaw agent --profile --message "test query" # Memory search performance test openclaw memory benchmark --queries 1000 # Gateway load test openclaw gateway --benchmark --duration 60s COMMAND_BLOCK: # Agent performance profiling openclaw agent --profile --message "test query" # Memory search performance test openclaw memory benchmark --queries 1000 # Gateway load test openclaw gateway --benchmark --duration 60s COMMAND_BLOCK: # openclaw-alerts.yml groups: - name: openclaw.rules rules: - alert: HighErrorRate expr: rate(openclaw_gateway_errors_total[5m]) > 0.1 for: 2m labels: severity: critical annotations: summary: "High error rate detected" - alert: HighResponseTime expr: openclaw_gateway_response_time_seconds{quantile="0.95"} > 5 for: 5m labels: severity: warning annotations: summary: "High response time" - alert: AgentDown expr: openclaw_agent_status == 0 for: 1m labels: severity: critical annotations: summary: "Agent is down" Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # openclaw-alerts.yml groups: - name: openclaw.rules rules: - alert: HighErrorRate expr: rate(openclaw_gateway_errors_total[5m]) > 0.1 for: 2m labels: severity: critical annotations: summary: "High error rate detected" - alert: HighResponseTime expr: openclaw_gateway_response_time_seconds{quantile="0.95"} > 5 for: 5m labels: severity: warning annotations: summary: "High response time" - alert: AgentDown expr: openclaw_agent_status == 0 for: 1m labels: severity: critical annotations: summary: "Agent is down" COMMAND_BLOCK: # openclaw-alerts.yml groups: - name: openclaw.rules rules: - alert: HighErrorRate expr: rate(openclaw_gateway_errors_total[5m]) > 0.1 for: 2m labels: severity: critical annotations: summary: "High error rate detected" - alert: HighResponseTime expr: openclaw_gateway_response_time_seconds{quantile="0.95"} > 5 for: 5m labels: severity: warning annotations: summary: "High response time" - alert: AgentDown expr: openclaw_agent_status == 0 for: 1m labels: severity: critical annotations: summary: "Agent is down" CODE_BLOCK: { "alerting": { "levels": [ { "name": "info", "channels": ["log"], "escalation": false }, { "name": "warning", "channels": ["slack", "email"], "escalation": { "after": "15m", "to": "critical" } }, { "name": "critical", "channels": ["telegram", "phone", "pager"], "escalation": { "after": "5m", "to": "emergency" } } ] } } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: { "alerting": { "levels": [ { "name": "info", "channels": ["log"], "escalation": false }, { "name": "warning", "channels": ["slack", "email"], "escalation": { "after": "15m", "to": "critical" } }, { "name": "critical", "channels": ["telegram", "phone", "pager"], "escalation": { "after": "5m", "to": "emergency" } } ] } } CODE_BLOCK: { "alerting": { "levels": [ { "name": "info", "channels": ["log"], "escalation": false }, { "name": "warning", "channels": ["slack", "email"], "escalation": { "after": "15m", "to": "critical" } }, { "name": "critical", "channels": ["telegram", "phone", "pager"], "escalation": { "after": "5m", "to": "emergency" } } ] } } CODE_BLOCK: { "dashboard": { "title": "OpenClaw System Overview", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "rate(openclaw_gateway_requests_total[5m])", "legendFormat": "{{channel}}" } ] }, { "title": "Response Time Percentiles", "type": "graph", "targets": [ { "expr": "openclaw_gateway_response_time_seconds{quantile=\"0.5\"}", "legendFormat": "p50" }, { "expr": "openclaw_gateway_response_time_seconds{quantile=\"0.95\"}", "legendFormat": "p95" } ] }, { "title": "Active Sessions", "type": "singlestat", "targets": [ { "expr": "sum(openclaw_agent_sessions_active)" } ] } ] } } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: { "dashboard": { "title": "OpenClaw System Overview", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "rate(openclaw_gateway_requests_total[5m])", "legendFormat": "{{channel}}" } ] }, { "title": "Response Time Percentiles", "type": "graph", "targets": [ { "expr": "openclaw_gateway_response_time_seconds{quantile=\"0.5\"}", "legendFormat": "p50" }, { "expr": "openclaw_gateway_response_time_seconds{quantile=\"0.95\"}", "legendFormat": "p95" } ] }, { "title": "Active Sessions", "type": "singlestat", "targets": [ { "expr": "sum(openclaw_agent_sessions_active)" } ] } ] } } CODE_BLOCK: { "dashboard": { "title": "OpenClaw System Overview", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "rate(openclaw_gateway_requests_total[5m])", "legendFormat": "{{channel}}" } ] }, { "title": "Response Time Percentiles", "type": "graph", "targets": [ { "expr": "openclaw_gateway_response_time_seconds{quantile=\"0.5\"}", "legendFormat": "p50" }, { "expr": "openclaw_gateway_response_time_seconds{quantile=\"0.95\"}", "legendFormat": "p95" } ] }, { "title": "Active Sessions", "type": "singlestat", "targets": [ { "expr": "sum(openclaw_agent_sessions_active)" } ] } ] } } CODE_BLOCK: Fault Report → ├── User cannot access? │ ├── Check Gateway status → openclaw status │ ├── Check channel connections → openclaw status --channels │ └── Check network connectivity → ping/traceroute │ ├── Slow response? │ ├── Check system load → top/htop │ ├── Check Agent performance → openclaw logs --level performance │ └── Check memory usage → openclaw memory stats │ └── Feature malfunction? ├── Check error logs → openclaw logs --level error ├── Check configuration → openclaw doctor └── Check model status → openclaw status --models Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Fault Report → ├── User cannot access? │ ├── Check Gateway status → openclaw status │ ├── Check channel connections → openclaw status --channels │ └── Check network connectivity → ping/traceroute │ ├── Slow response? │ ├── Check system load → top/htop │ ├── Check Agent performance → openclaw logs --level performance │ └── Check memory usage → openclaw memory stats │ └── Feature malfunction? ├── Check error logs → openclaw logs --level error ├── Check configuration → openclaw doctor └── Check model status → openclaw status --models CODE_BLOCK: Fault Report → ├── User cannot access? │ ├── Check Gateway status → openclaw status │ ├── Check channel connections → openclaw status --channels │ └── Check network connectivity → ping/traceroute │ ├── Slow response? │ ├── Check system load → top/htop │ ├── Check Agent performance → openclaw logs --level performance │ └── Check memory usage → openclaw memory stats │ └── Feature malfunction? ├── Check error logs → openclaw logs --level error ├── Check configuration → openclaw doctor └── Check model status → openclaw status --models COMMAND_BLOCK: #!/bin/bash # openclaw-troubleshoot.sh echo "🔍 OpenClaw Automated Diagnostics" echo "=================================" # 1. Basic connectivity check echo "📡 Checking Gateway connectivity..." if ! curl -s http://localhost:18789/health > /dev/null; then echo "❌ Gateway not responding, checking service status" systemctl --user status openclaw-gateway exit 1 fi echo "✅ Gateway running normally" # 2. Channel status check echo "📱 Checking channel status..." CHANNELS=$(openclaw status --json | jq -r '.channels | to_entries[] | select(.value.status != "OK") | .key') if [[ -n "$CHANNELS" ]]; then echo "⚠️ Channel issues found: $CHANNELS" else echo "✅ All channels healthy" fi # 3. Performance metrics check echo "⚡ Checking performance metrics..." echo " - System load: $(uptime)" echo " - Memory usage: $(free -h | grep Mem | awk '{print $3"/"$2}')" # 4. Error log analysis echo "📋 Checking recent errors..." ERROR_COUNT=$(openclaw logs --level error --since "1h" --json | jq '. | length') if [[ "$ERROR_COUNT" -gt 10 ]]; then echo "⚠️ Found $ERROR_COUNT errors. Recent errors:" openclaw logs --level error --limit 5 fi Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: #!/bin/bash # openclaw-troubleshoot.sh echo "🔍 OpenClaw Automated Diagnostics" echo "=================================" # 1. Basic connectivity check echo "📡 Checking Gateway connectivity..." if ! curl -s http://localhost:18789/health > /dev/null; then echo "❌ Gateway not responding, checking service status" systemctl --user status openclaw-gateway exit 1 fi echo "✅ Gateway running normally" # 2. Channel status check echo "📱 Checking channel status..." CHANNELS=$(openclaw status --json | jq -r '.channels | to_entries[] | select(.value.status != "OK") | .key') if [[ -n "$CHANNELS" ]]; then echo "⚠️ Channel issues found: $CHANNELS" else echo "✅ All channels healthy" fi # 3. Performance metrics check echo "⚡ Checking performance metrics..." echo " - System load: $(uptime)" echo " - Memory usage: $(free -h | grep Mem | awk '{print $3"/"$2}')" # 4. Error log analysis echo "📋 Checking recent errors..." ERROR_COUNT=$(openclaw logs --level error --since "1h" --json | jq '. | length') if [[ "$ERROR_COUNT" -gt 10 ]]; then echo "⚠️ Found $ERROR_COUNT errors. Recent errors:" openclaw logs --level error --limit 5 fi COMMAND_BLOCK: #!/bin/bash # openclaw-troubleshoot.sh echo "🔍 OpenClaw Automated Diagnostics" echo "=================================" # 1. Basic connectivity check echo "📡 Checking Gateway connectivity..." if ! curl -s http://localhost:18789/health > /dev/null; then echo "❌ Gateway not responding, checking service status" systemctl --user status openclaw-gateway exit 1 fi echo "✅ Gateway running normally" # 2. Channel status check echo "📱 Checking channel status..." CHANNELS=$(openclaw status --json | jq -r '.channels | to_entries[] | select(.value.status != "OK") | .key') if [[ -n "$CHANNELS" ]]; then echo "⚠️ Channel issues found: $CHANNELS" else echo "✅ All channels healthy" fi # 3. Performance metrics check echo "⚡ Checking performance metrics..." echo " - System load: $(uptime)" echo " - Memory usage: $(free -h | grep Mem | awk '{print $3"/"$2}')" # 4. Error log analysis echo "📋 Checking recent errors..." ERROR_COUNT=$(openclaw logs --level error --since "1h" --json | jq '. | length') if [[ "$ERROR_COUNT" -gt 10 ]]; then echo "⚠️ Found $ERROR_COUNT errors. Recent errors:" openclaw logs --level error --limit 5 fi COMMAND_BLOCK: #!/bin/bash # auto-heal.sh — OpenClaw self-healing script HEALTH_CHECK_URL="http://localhost:18789/health" MAX_RETRIES=3 check_health() { local response=$(curl -s -w "%{http_code}" -o /dev/null "$HEALTH_CHECK_URL") [[ "$response" == "200" ]] } restart_gateway() { echo "$(date): Anomaly detected, preparing to restart..." openclaw gateway stop --graceful --timeout 30s sleep 5 openclaw gateway start --background sleep 10 if check_health; then echo "$(date): Gateway restart successful" return 0 else echo "$(date): Gateway restart failed" return 1 fi } # Main loop while true; do if ! check_health; then for i in $(seq 1 $MAX_RETRIES); do echo "$(date): Restart attempt ($i/$MAX_RETRIES)" if restart_gateway; then break; fi if [[ $i -eq $MAX_RETRIES ]]; then echo "$(date): Restart failed, sending alert" curl -X POST "$ALERT_WEBHOOK" \ -d '{"level":"critical","message":"OpenClaw Gateway restart failed"}' fi sleep 300 done fi sleep 60 done Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: #!/bin/bash # auto-heal.sh — OpenClaw self-healing script HEALTH_CHECK_URL="http://localhost:18789/health" MAX_RETRIES=3 check_health() { local response=$(curl -s -w "%{http_code}" -o /dev/null "$HEALTH_CHECK_URL") [[ "$response" == "200" ]] } restart_gateway() { echo "$(date): Anomaly detected, preparing to restart..." openclaw gateway stop --graceful --timeout 30s sleep 5 openclaw gateway start --background sleep 10 if check_health; then echo "$(date): Gateway restart successful" return 0 else echo "$(date): Gateway restart failed" return 1 fi } # Main loop while true; do if ! check_health; then for i in $(seq 1 $MAX_RETRIES); do echo "$(date): Restart attempt ($i/$MAX_RETRIES)" if restart_gateway; then break; fi if [[ $i -eq $MAX_RETRIES ]]; then echo "$(date): Restart failed, sending alert" curl -X POST "$ALERT_WEBHOOK" \ -d '{"level":"critical","message":"OpenClaw Gateway restart failed"}' fi sleep 300 done fi sleep 60 done COMMAND_BLOCK: #!/bin/bash # auto-heal.sh — OpenClaw self-healing script HEALTH_CHECK_URL="http://localhost:18789/health" MAX_RETRIES=3 check_health() { local response=$(curl -s -w "%{http_code}" -o /dev/null "$HEALTH_CHECK_URL") [[ "$response" == "200" ]] } restart_gateway() { echo "$(date): Anomaly detected, preparing to restart..." openclaw gateway stop --graceful --timeout 30s sleep 5 openclaw gateway start --background sleep 10 if check_health; then echo "$(date): Gateway restart successful" return 0 else echo "$(date): Gateway restart failed" return 1 fi } # Main loop while true; do if ! check_health; then for i in $(seq 1 $MAX_RETRIES); do echo "$(date): Restart attempt ($i/$MAX_RETRIES)" if restart_gateway; then break; fi if [[ $i -eq $MAX_RETRIES ]]; then echo "$(date): Restart failed, sending alert" curl -X POST "$ALERT_WEBHOOK" \ -d '{"level":"critical","message":"OpenClaw Gateway restart failed"}' fi sleep 300 done fi sleep 60 done - 🔍 Real-Time Monitoring: System status, performance metrics, error rates - 📝 Log Management: Structured logging, centralized collection, intelligent analysis - ⚠️ Alerting: Anomaly detection, tiered alerts, automated response - 📈 Visualization: Dashboards, trend analysis, capacity planning - Layered Monitoring: Application, Gateway, System, Infrastructure layers - Full Observability: Metrics, Logs, Traces — the three pillars - Intelligent Alerting: Tiered alerts, escalation, silence windows - Automated Ops: Self-healing, auto-scaling, backup & recovery - ✅ Basic Metrics: CPU, memory, disk, network - ✅ Application Metrics: Request rate, response time, error rate, concurrency - ✅ Business Metrics: User activity, token usage, channel distribution - ✅ Security Metrics: Auth failures, anomalous access, permission changes - Start with basic monitoring, then progressively enhance - Establish standardized alert response procedures - Regularly conduct disaster recovery drills - Continuously optimize monitoring strategies and alert thresholds - OpenClaw Monitoring Docs - Prometheus OpenClaw Exporter - Grafana Dashboard Templates

🏷️ Tags

how-totutorialguidedev.toaimlserverbashnetworkdockernode