Share with:

TwitterGoogleStumbleUponLinkedIn


My name is Mark and I’ve been in the site reliability game for a while now – going on about fifteen years. A lot has changed since I was a fresh-faced consultant joining Wily Technology back in 2000; the rise of AWS, Docker, APM, Nagios, Slack, and on and on. The one consistent thing is that the job of site reliability folks hasn’t gotten better. I mean, although sites have become more reliable and that’s a testimony to just how hard all of you work, it still takes a tremendous amount of work to keep it that way!

One my Wily consulting gigs always comes to mind when I hear about truly painful production incidents. My job was to fly out to potential customers when they were in the middle of bad site performance problems, install Introscope, find the problem, and help them get things running again. There was nothing particularly special about the technical problem this particular customer was having. This gig stands out for me because it was the first time I really saw the human cost of the incident.

This was a potentially huge deal for us so I flew out on a Sunday to be ready to engage with the client on Monday. I went in Monday morning and after the usual security dance to get into the data center (yes, this was before AWS so most of our customers were running in their own data centers), I was ushered back to a row of cubes where a bunch of people (easily a dozen or more) were crowded around a few cubes trying to resolve a problem that had been going on for a couple of weeks: the site would be just fine for a day or two and then, seemingly out of nowhere, it would stop accepting HTTP requests and become totally unusable. I won’t give away the client but I can say that the site handled a complex sign-up form for a service. The last day to apply for the service was approaching and the client knew that a bunch of their customers would be signing up soon. They only had a few days left to get this problem fixed.


I got back to the row of cubes where most everyone was huddled where I was introduced to their lead engineer trying to figure out the problem. He had been part of their production support organization for a few years and had recently moved to the US from India with his wife and newborn son. Unfortunately, I don’t remember his name but he made such an impression on me. I’ll call him Lakshay. This guy was at the center of it all. He had hastily drawn architecture diagrams on his whiteboard, prettily-printed (and certainly out of date) network diagrams posted on his cube wall, a couple of WebLogic books open on his desk, and a bunch of scribbled notes posted all over. He was the only one who seemed to have a handle on the entire system. I came to find out that Lakshay didn’t have a background in development but he would speak authoritatively on how the whole thing was put together. He was really the only one who could.

Lakshay had been working the problem for almost a week. I’m sure I don’t need to tell you what “working the problem for almost a week” really means: late nights, little sleep, lots of status updates to managers who care but can’t really help. And the stress. You haven’t really experienced stress like this until the CEO calls and asks when you can have the site up again.

So, we got to work. After installing Wily Introscope, Lakshay began to explain the system to me and go through everything they knew so far. It wasn’t much. About all we could do was wait for the problem to happen again and hope that the new monitoring would catch it and give us a clue to the underlying cause. By this time, it was getting to be around 6pm but the crowd of people hanging around Lakshay’s cube wasn’t going home and he didn’t feel comfortable leaving either – even though there was really nothing anyone could do but wait for the problem to reoccur. We were chatting and I got to know him a little bit. Around 7pm, his wife called. When he got off the phone, he told me his son had just taken his first steps.

His son just took his first steps and Lakshay was at work at 7pm trying to solve a production incident. That floored me. I know in the grand scheme of things, this is small, but the small things matter. I have no idea how many small life moments are stolen by production incidents but I picture Lakshay every time I hear about one.

Fortunately, Lakshay got lucky: the problem reoccurred that night around midnight (yes, we were still there waiting for it) and the new monitoring spotted the source the problem almost immediately. A few screenshots and some logs sent to the developers got them working on the problem the next day. I don’t know if they got the problem fixed by their deadline but I do know Lakshay took the following two days off. Well deserved.

Stories like Lakshay’s happen too frequently. This is why Cabot exists. We’re looking at the problem of production incidents with fresh eyes so production problems don’t end up taking down more of life’s small moments.

Want to help? Sign up for an interview to share your SRE war stories or take the SRE survey.

Share with:

TwitterGoogleStumbleUponLinkedIn