Level Down your Alerts with User Monitoring

Today, let’s dive into transactional user experience monitoring.  First, I want to start with a bit of philosophy.  The reason we want our systems to be reliable is to provide value to our businesses.  Business value means the site remains available and performant for...

Happiness is a measured user journey

We mess with Jenkins configs.  We struggle with failing automated tests.  We finally get Kubernetes doing what we want it to do.  We wake up at 2am to restart a failing API gateway.  Why?  So the user gets the best experience possible.  How do we know if our efforts...

SRE Pain Points

I want to spend a little time reflecting on the Cabot journey over the past few months.  As I mentioned at the outset of this blog, many years ago, I was touched by how much Lakshay’s life was impacted by his job.  Of course, everyone’s job impacts their life, but...

Abnormal vs. Bad

Is your solution detecting actual business threats? Reflecting on the alert fatigue problem, I think a lot of the problem comes down to conflating abnormal metric values with bad user experiences.  Many monitoring products reinforce the confusion by making it easy...

Alert Fatigue

One of the issues that I’ve run across over the years is alert fatigue.  As the linked article points out, it’s not just a problem for SREs, but we’re definitely victims of it.  I can’t count the number of times the question, “Hey, what is that alert about?” is...