Tools: Latest: Monitoring Your Laravel App in Production: What to Watch and When to Panic

Tools: Latest: Monitoring Your Laravel App in Production: What to Watch and When to Panic

The Four Golden Signals

Latency: How Long Requests Take

Traffic: How Many Requests You Are Handling

Errors: What Is Failing

Server-Level Metrics

CPU Usage

Memory Usage

Disk Usage

Application-Level Metrics

Queue Health

Database Performance

Cache Hit Ratio

Setting Up Meaningful Alerts

The Alert Hierarchy

Deploynix Health Alerts

Avoiding Alert Fatigue

Building a Monitoring Dashboard

Conclusion Your Laravel application is deployed, SSL is green, and the first users are signing up. Everything looks fine. Then at 2 AM, your phone buzzes with a Slack notification: "Site is down." You SSH into the server, discover the disk is 100% full because log files grew unchecked for three weeks, and spend the next hour frantically clearing space and restarting services. This scenario plays out daily across thousands of production applications. The fix is not heroic debugging at 2 AM — it is monitoring that would have flagged the growing disk usage days before it became critical. Monitoring is not about collecting data. It is about knowing which data matters, setting thresholds that predict problems before they cause outages, and building enough confidence in your alerts that you can sleep through the night when nothing fires. This guide covers exactly what to monitor in a production Laravel application, what thresholds to set, and which alerts deserve your immediate attention. Google's Site Reliability Engineering book defines four golden signals that capture the health of any service: latency, traffic, errors, and saturation. Every monitoring strategy should start here. Latency is the time between a request arriving and the response being sent. For a Laravel application, this includes PHP processing time, database queries, cache lookups, and any external API calls. Healthy thresholds for a typical Laravel app: When to panic: If P95 suddenly doubles, something has changed. Common causes include a missing database index (a new feature introduced an N+1 query), an external API slowing down, or PHP-FPM running out of workers and queuing requests. Traffic volume gives you context for other metrics. High CPU usage during a traffic spike is expected behavior, not a problem. High CPU usage with normal traffic is a problem. When to panic: A sudden drop in traffic often indicates something is broken upstream — a DNS issue, a load balancer misconfiguration, or your application returning errors so fast that clients stop retrying. Not all errors are equal. A 404 is a user typing a wrong URL. A 500 is your application failing to handle a request. A spike in 429 responses means your rate limiter is working. Context matters. When to panic: A 5xx rate above 1% means something significant is broken. If it is above 5%, your application is effectively down for a meaningful percentage of users. Saturation tells you how close your system is to its limits. A server at 90% CPU utilization is not "90% efficient" — it is on the edge of failing. This is where Deploynix's real-time monitoring becomes critical. The platform tracks CPU, memory, disk usage, and load average for every managed server, giving you visibility without setting up third-party monitoring tools. CPU utilization is the most watched metric, but also the most misunderstood. A brief spike to 100% during a deployment is normal. Sustained CPU above 80% during normal traffic is a problem. Key insight: High iowait (CPU waiting for disk I/O) usually means your disk is the bottleneck, not your CPU. This is common with database-heavy workloads on servers with slow storage. The fix is often faster storage (NVMe) or moving the database to a dedicated server, not adding more CPU cores. PHP-FPM workers each consume memory, and the number of workers multiplied by the memory per worker must fit within your server's RAM. If total memory usage exceeds physical RAM, the system starts swapping to disk, and performance craters. Common culprit: A Laravel application that processes large datasets in memory (importing CSVs, generating reports) can cause memory spikes that crash PHP-FPM workers. Use chunked processing (LazyCollection, cursor(), chunk()) for large datasets. Disks fill up silently. Log files grow, database binary logs accumulate, failed deployment artifacts pile up, and temporary files are not cleaned. Then everything stops at once. Laravel queues are where work goes to happen asynchronously. When queues back up, your users experience delays in emails, notifications, webhook deliveries, and any other background processing. When to panic: A steadily growing queue means your workers cannot keep up. Either add more worker processes (Deploynix lets you manage queue workers from the site dashboard) or investigate whether a specific job type is taking too long. Your database is almost always the bottleneck. Monitoring it effectively prevents the majority of production incidents. Common issues in Laravel applications: If you are using Valkey (or Redis) for caching, your cache hit ratio tells you how effective your caching strategy is. Not all alerts are equal. Structure your alerts into tiers: Critical (page someone immediately): Warning (investigate within business hours): Informational (review weekly): Deploynix provides built-in health alerts for your managed servers. These alerts monitor critical server metrics and notify the server owner via email when thresholds are exceeded. Alerts are categorized as warning or critical — critical alerts trigger email notifications immediately so you can respond before a minor issue becomes an outage. The advantage of platform-integrated monitoring is that alerts have context. A health alert from Deploynix includes the server identifier, the metric that triggered the alert, the current value, and the threshold that was exceeded, along with a link to the health dashboard where you can investigate further. Alert fatigue is the number one reason monitoring fails. When every minor fluctuation triggers a notification, people start ignoring all notifications — including the critical ones. Rules for avoiding alert fatigue: A good monitoring dashboard answers the question "Is everything okay?" in five seconds. Here is what to include: Row 1: The Big Picture Row 3: Application Health Row 4: Business Metrics Business metrics on a technical dashboard might seem odd, but they provide crucial context. A spike in errors that correlates with a spike in signups tells a different story than a spike in errors with flat traffic. Monitoring is not about installing a tool and forgetting about it. It is about building a mental model of your application's normal behavior so you can quickly identify when something deviates. Start with the four golden signals — latency, traffic, errors, and saturation. Add server-level metrics for CPU, memory, and disk. Layer on application-level monitoring for queues, database performance, and cache effectiveness. Set alerts that are meaningful and actionable. Use Deploynix's built-in health monitoring as your foundation, and add application-level monitoring as your application grows. Review your alerts regularly, tune thresholds based on real-world patterns, and fix the underlying causes of recurring alerts rather than adjusting thresholds to make them stop. The goal is not to eliminate all production incidents — that is impossible. The goal is to detect problems before your users do, understand them quickly when they occur, and resolve them before they escalate. Good monitoring turns a 2 AM crisis into a 2 PM task, and that is worth every minute you invest in setting it up. Templates let you quickly answer FAQs or store snippets for re-use. as well , this person and/or - P50 (median) response time — your typical user experience

- P95 response time — the experience for your slowest 5% of requests- P99 response time — your worst-case user experience - P50 under 200ms- P95 under 500ms- P99 under 1 second - Requests per second (RPS)- Requests per second by endpoint (identifies hot paths)- Background job throughput (jobs processed per minute) - 5xx error rate as a percentage of total requests- Exception frequency by type- Failed queue jobs- Log entries at error and critical levels - 5xx error rate under 0.1% of total requests- Zero unhandled exceptions (every exception should be caught and handled meaningfully) - Overall CPU utilization (user, system, iowait)- CPU utilization by process (PHP-FPM, MySQL, Nginx)- Load average (1-minute, 5-minute, 15-minute) - Warning at 70% sustained for 5 minutes- Critical at 90% sustained for 2 minutes - Total memory usage vs. available RAM- Memory usage per PHP-FPM worker- Swap usage (any swap usage is a warning sign)- OOM killer activity (check dmesg for killed processes) - Warning at 80% memory usage- Critical at 90% memory usage- Any swap usage above 100MB deserves investigation - Disk usage percentage- Disk usage growth rate (predicts when you will run out)- Inode usage (you can run out of inodes before running out of space)- Individual directory sizes (/var/log, /tmp, database data directory) - Warning at 70% disk usage- Critical at 85% disk usage- Growth rate alerts: if disk usage will reach 90% within 48 hours based on current growth - Configure Laravel's log rotation (daily channel with days limit)- Set up MySQL binary log expiration (binlog_expire_logs_seconds)- Use Deploynix's backup feature to offload backups to external storage (S3, DigitalOcean Spaces, Wasabi)- Clean old deployment releases (Deploynix manages this automatically) - Queue size (number of pending jobs)- Queue processing rate (jobs completed per minute)- Failed job count- Time in queue (how long jobs wait before processing) - Warning when queue size exceeds 1,000 jobs (adjust based on your normal volume)- Critical when queue size is growing faster than processing rate for more than 10 minutes- Any failed job should trigger a notification - Active connections vs. maximum connections- Slow query count (queries exceeding your threshold, typically 1 second)- Query throughput (queries per second)- Replication lag (if using read replicas)- Table lock waits- Buffer pool hit ratio (for MySQL/MariaDB) - Warning when active connections exceed 70% of max_connections- Critical when active connections exceed 85%- Any query consistently taking more than 2 seconds in production- Replication lag exceeding 10 seconds - N+1 queries: A single page load triggers hundreds of queries. Use preventLazyLoading() in development to catch these.- Missing indexes: New features add where clauses on unindexed columns. Monitor slow query logs after every deployment.- Connection exhaustion: Each PHP-FPM worker holds a database connection. If you have 50 workers and max_connections is 100, you are already at 50% before counting queue workers, cron jobs, and admin connections. - Hit ratio (hits / (hits + misses))- Memory usage vs. maximum memory- Eviction rate (keys removed to make room for new ones)- Connected clients - Hit ratio above 90% (for a well-cached application)- Eviction rate should be near zero (if keys are being evicted, increase cache memory) - Server unreachable- 5xx error rate above 5%- Disk usage above 90%- SSL certificate expires within 7 days- Database connections exhausted - CPU sustained above 70%- Memory above 80%- Disk above 70%- Queue growing for more than 10 minutes- P95 latency doubled from baseline- Failed jobs in the queue - Deployment completed- Backup completed successfully- Traffic trends- Resource usage trends - Every alert must be actionable. If there is nothing you can do about it, it is not an alert — it is a metric.- Set thresholds based on sustained behavior, not instantaneous spikes. A CPU spike to 95% for 10 seconds during a deployment is fine.- Review and tune alerts monthly. If an alert fires weekly and you always dismiss it, either fix the underlying issue or raise the threshold.- Use different notification channels for different severity levels. Critical alerts go to phone calls or high-priority Slack channels. Warnings go to a monitoring channel. - Current request rate- Current error rate- P95 latency- Active server count - CPU utilization per server- Memory utilization per server- Disk utilization per server - Queue size and processing rate- Database connections and query latency- Cache hit ratio- Failed jobs count - Active users- Signup rate- Key feature usage