December 19, 2019.
During an incident at Digg, a coworker once quipped, “We serve funny cat pictures, who cares if we’re down for a little while?” If that’s your attitude towards reliability, then you probably don’t need to formalize handling incidents, but if you believe what you’re doing matters -- and maybe today’s a good time to start planning how to walk out that door if you don’t -- then at some point your company is going to have to become predictably reliable.
November 25, 2019.
Recently I reread Steve Yegge's 2011 Google Platforms Rant, which states "the Golden Rule of platforms is that you Eat Your Own Dogfood," arguing that APIs are good to the extent that the folks developing them depend on their quality.
November 18, 2019.
One of my foundational learning experiences occurred in 2014, when I designed and rolled out Uber’s original Site Reliability Engineering role and organization.While I’d make many decisions a bit differently if I could rewind and try again, for the most part I’m proud when reviewing the reel of rewound memories.
November 5, 2019.
Imagine you woke up one day and found yourself responsible for a Site Reliability Engineering team.By 10AM, you’ve downloaded a free copy the SRE book, and are starting to get the hang of things.Then an incident strikes: oh no!Folks rally to mitigate user impact and later diagnosis and remediate the underlying cause, but a bunch of your users have a very bad day.Your shoulders are a bit heavier than just a few hours ago.You sit down with your team and declare your bold leader-y goal: next quarter we’ll have _zero_ _incidents_.
October 31, 2019.
A few weeks ago I got the chance to speak at SRECon EMEA 2019, and the videos are up!This is the video of my talk, Investing in technical infrastructure.