Monitoring Was Green, Users Were Down
Our dashboards showed everything was healthy while users couldn't access the site. A hard lesson in what to monitor vs what actually matters to users.
Monitoring Was Green, Users Were Down
3:17 AM. My phone lights up with angry messages from the on-call manager. "The site is completely down. Why didn't we get alerted?"
I check our monitoring dashboards. Everything is green. CPU usage: normal. Memory: normal. Error rates: 0.05%. Request latency: under 100ms. All systems operational.
I refresh the site in my browser. HTTP 502: Bad Gateway.
The site is definitely down. But according to our monitoring, everything is fine.
This was one of the most humbling incidents of my career. It taught me the hard truth: monitoring what's easy isn't the same as monitoring what matters.
What We Were Monitoring
Our monitoring setup looked impressive on paper:
- Infrastructure metrics: CPU, memory, disk I/O, network throughput
- Application metrics: Request rates, error rates, latency percentiles
- Database metrics: Connection pool usage, query performance, replication lag
- Container health: Pod status, restart counts, resource limits
We had Prometheus scraping hundreds of metrics. We had Grafana dashboards that looked beautiful. We had alerts for everything we thought was important.
And yet, we completely missed that our users couldn't access the site.
What Actually Happened
Here's the timeline:
2:43 AM - The Certificate Expires
Our SSL certificate expired. This wasn't a secret—we'd known about the expiration for weeks. We had automation to renew it. But the automation ran in a Jenkins pipeline that had failed silently three weeks ago due to an AWS credentials rotation.
Nobody noticed because the pipeline failure wasn't alerting anyone.
2:43 AM - Load Balancer Rejects Traffic
Our AWS Application Load Balancer (ALB) was configured to only accept HTTPS traffic. With no valid certificate, it started returning 502 errors for all incoming requests.
2:43 AM - Monitoring Shows Everything Is Fine
Here's the critical part: our monitoring was all internal. We monitored:
- The containers running behind the load balancer (healthy)
- The application handling requests that reached it (working perfectly)
- The database and cache layers (no problems)
What we didn't monitor was whether external users could actually reach our application through the load balancer.
Our monitoring stack was inside the VPC, behind the load balancer. It couldn't see what external users were experiencing.
3:15 AM - Users Start Complaining
For 32 minutes, the site was completely inaccessible. No alerts fired. The on-call engineer had no idea anything was wrong. We found out when angry users started messaging our support team and tagging us on social media.
Why Our Monitoring Failed
Looking back, we made several fundamental mistakes:
1. We Monitored Components, Not User Experience
We measured individual system components: servers, databases, caches. Each component was healthy. But the user experience—the only thing that actually matters—was broken.
The lesson: Monitor from your users' perspective, not just from inside your infrastructure.
2. We Monitored What Was Easy
It's easy to scrape Prometheus metrics from your application. It's easy to get CloudWatch metrics from AWS. It's harder to set up external synthetic monitoring that actually hits your site like a user would.
We took the easy path.
The lesson: If something is critical to user experience but hard to monitor, find a way to monitor it anyway.
3. We Didn't Test Our Monitoring
We had never asked: "What happens if our monitoring stack is healthy but users can't reach us?" We assumed that if everything internal looked good, users were fine.
Bad assumption.
The lesson: Test your monitoring. Simulate failures. Make sure your alerts actually fire when they should.
4. We Didn't Monitor the Whole Path
Our traffic path was:
Internet → DNS → Load Balancer → WAF → Application → Database
We monitored the application and database. We had basic checks on the load balancer. But we never verified end-to-end that an external user could successfully complete a request.
The lesson: Monitor every hop in your critical path, especially the external ones.
The Fix
We fixed this in multiple layers:
Immediate (During the Incident)
- Manually renewed the SSL certificate: Through the AWS console
- Verified external access: Using curl from outside our network
- Fixed the Jenkins pipeline: And verified the renewal automation worked
Short-term (That Week)
- Added external synthetic monitoring: Using Pingdom and AWS CloudWatch Synthetics
- Set up certificate expiration alerts: 30, 14, and 7 days before expiration
- Created external health checks: That hit our load balancer from multiple regions
- Added SSL certificate validity monitoring: Daily checks from external monitors
Long-term (Following Month)
- Implemented comprehensive synthetics: Multi-step user journeys checked every 5 minutes from multiple locations
- Created SLIs based on user experience: Success rate of synthetic tests, not internal metrics
- Set up external alerting: Separate from our internal monitoring, so if our VPC goes down, we still get alerted
- Automated certificate management: With multiple layers of monitoring and backup renewal processes
- Built a status page: Fed by external monitors, not internal ones
What You Should Monitor
Based on this and other incidents, here's my recommended monitoring strategy:
User-Centric Metrics (Most Important)
- Synthetic transactions: Can users actually use your site/app?
- Real user monitoring (RUM): What are actual users experiencing?
- Success rate: What percentage of user requests succeed?
- Availability from outside: Can users reach you from the internet?
Application Metrics
- Request rates: How much traffic are you handling?
- Error rates: What percentage of requests fail?
- Latency: How fast are you responding? (P50, P95, P99)
- Saturation: How close to capacity are you?
Infrastructure Metrics (Least Important, But Still Useful)
- CPU/Memory/Disk: Is your infrastructure healthy?
- Container health: Are your pods running?
- Database performance: Query times, connection pool usage
Notice the order: user experience first, application metrics second, infrastructure metrics last.
Most teams (including my past self) do this backwards. They obsess over infrastructure metrics and never verify that users can actually use the product.
Key Principles for Effective Monitoring
1. Monitor Outcomes, Not Outputs
Don't just monitor that your application is running. Monitor that it's doing what users need it to do.
2. Monitor from the Outside
If you only monitor from inside your network, you'll miss issues that affect external users (like our certificate problem).
3. Your Alerts Should Match User Impact
If users are affected, you should get alerted. If users aren't affected, you shouldn't. Any mismatch means your monitoring needs work.
4. Test Your Monitoring
Regularly break things to ensure your monitoring catches it. If you've never tested whether your monitoring works, you don't know if it works.
5. Keep It Simple
You don't need to monitor everything. Focus on what matters to users. More dashboards doesn't mean better monitoring.
A Better Approach: Site Reliability Engineering (SRE)
After this incident, we adopted SRE principles:
Service Level Indicators (SLIs)
Metrics that matter to users:
- Availability: Can users access the site?
- Latency: How fast does it respond?
- Quality: Does it work correctly?
Service Level Objectives (SLOs)
Targets for those metrics:
- 99.9% of requests should succeed
- 95% of requests should complete in under 200ms
- 99% of transactions should complete successfully
Alerting on SLO Budget Burn
We alert when we're burning through our error budget too quickly, not on arbitrary threshold breaches.
This shifted our focus from "is the server CPU high?" to "are users getting the experience they expect?"
Conclusion
Green dashboards don't mean happy users. They mean your dashboards are green.
The only monitoring that truly matters is monitoring that reflects user experience. Everything else is just data.
After implementing external synthetic monitoring and real user monitoring, we've caught dozens of issues before they impacted users. We've also reduced alert fatigue by focusing on what matters.
Our dashboards might not look as impressive with fewer metrics, but they tell us something far more important: whether our users can actually use our product.
And that's all that really matters.
What monitoring blind spots have you discovered the hard way? I'd love to hear your stories. Reach out or share in the comments.