During an incident at Digg, a coworker once quipped, “We serve funny cat pictures, who cares if we’re down for a little while?” If that’s your attitude towards reliability, then you probably don’t need to formalize handling incidents, but if you believe what you’re doing matters – and maybe today’s a good time to start planning how to walk out that door if you don’t – then at some point your company is going to have to become predictably reliable.
Having worked with a number of companies as they transition from heroic reliability to nonchalant reliability, there are a handful of transitions that能够take years to navigate, but能够also go much faster. Forewarning streamlines the timeline, skipping the frustration.
This friction stems from the demands on incident programs shifting as a company scales. They start out with an emphasis on effectively responding to incidents, but with growth they become increasingly responsible for defining the structure and rationale of the company’s investment into reliability. If you haven’t previously navigated these shifts, it can be bewildering to see what worked so well early on become the source of ire.
Here are my tips to fast-forward past the frustration.
即使是最小的公司也应该有一个on-call rotationpowered by a tool likePagerDuty。A pageable rotation is the starting point for incident response. In the best case, some sort of monitoring triggers an alert; in the slightly worse case, a human – often a human reading support tickets – will learn about a problem and automatically trigger an incident. In either case, that alert triggers a shared on-call rotation for the entire company, and folks leap into mitigation.
This approach works well in a company with a small surface area, as every engineer is familiar enough with the software that having a shared rotation still leads to effective response. It starts to fray as the company scales, where only an increasingly small pool of long-tenured engineers understand enough of the stack to respond to complex indicidents.
This overreliance on a small, tenured group leads to resentment both directions. Long-tenured folks feel put upon and unsupported by the newer team, and newer folks feel like they’re not being given space to contribute. You’ll often start splitting into multiple on-call rotations for areas with clear interfaces and isolation, but most early codebases follow the泥浆图案和抗蚀剂分解。
Dependence on the long-tenured cohort becomes untenable as the company keeps growing, which eventually leads to the transition from this ad-hoc approach to a more structured system that promises to effectively incorporate more folks into incident response. By the time you reach two hundred engineers, it’sdefinitely是时候进行下一步，但即使是非常小的公司也将从放弃ad-hoc放弃到他们提供关键服务的程度。
Adopting a formal incident program doesn’t require a huge amount of process, but it certainly requires more process than the ad-hoc reimagination of incident response that previously occurred at the beginning of each incident. This can be challenging as early startups tend to equate structure and inefficiency.
The structure-is-inefficient mindset is particularly challenging in this case because the transition from “excellence in ad-hoc response” to “excellence in structured reponse” is hard. Hard enough that you’ll initially get signficantly worse at response.
Consequently, the first step in this transition is building awareness that it’s going to get worse before it gets better. It’s going to get worse, but the alternative is that it stays bad, and in a short while you careen over a cliff as the long-tenured team you’re depending on revolts. There is simply no way for a small team of long-tenured engineers to support a company through continued rapid growth, and every company must make this transition.
Once you’ve built awareness that it’s going to be a rocky switch, you can start laying the foundation of a durable, scalable incident program. This program will be responsible for both responding to incidents as they happen (incident response), but also the full lifecycle of incidents from detection to mitigation (reducing or eliminating the user impact) to remediation (reducing or eliminating the impact of a repeat incident).
The pillars of your initial incident program rollout are:
- Incident trainingensures every工程师通过培训入射响应如何处理，允许他们学习标准化方法。同样重要，确保每一个人employeegoes through training on how to open an incident.
- 事件工具to streamline opening and responding to incidents, as well as supporting the creation of a high-quality incident dataset. Early on you can conflate alerts with incidents, but as your systems’ complexity scales, you’ll have far more alerts than incidents, and some incidents that don’t trigger any alerts.Blameless与大多数公司最终建筑的内部工具非常相似。
- 事件角色在事件期间定义混凝土角色，例如incident commander, to create a clear separation of concerns between folks mitigating and folks coordinating. It also streamlines getting access to folks to fill each role for a given incident, for example an on-call rotation of incident commanders, an on-call rotation for folks who can do external communication during incidents, and so on.
- Incident reviewsupports learning from every incident by authoring and discussing incident reports. I’ve found it best to review individual incidents in on-demand meetings that include folks with the most context (and time to think!), and use recurring meetings to review batches of similar incidents to extract broader learnings. (As an aside, I think there is a bit of a low-key industry crisis around whether typical approaches to incident review are actually very valuable. If you were to find one place to innovate instead of replicate, this might be it.)
Many companies roll out those components through a project working group, which works well for the first version, but excellence comes fromiteratively adapting your approach到您的公司和系统。那种持续的迭代是最好的专用团队，这就是为什么导航这个过渡的公司最有效地这样做incident tooling team和incident program manager。
If you roll out all of these pieces, you can likely get to about five hundred engineers before you run into the next challenge: the emergence of incident lawyers.
Incidents are very stressful, and often prompt difficult questions like, “Why are we having so many incidents?” Often folks take that question to the team working on the incident program and tooling, curious to understand what they need to prevent more – or perhaps all – incidents. It’s quite common for the teams working on incidents to reply that what they need is compliance to the existing process: the system works, if only folks would follow it.
This leads to a phase of incident programs that I think of as the emergence of事件法, facilitated by a cohort of strict incident lawyers, who become so focused on executing the existing process that they lose track of whether the process works.
When you put too much pressure on any process or team, they start to double down on doingmoreof what they’re doingas opposed to changing what they’re doing一种可能更有效的新方法。这种在胁迫下陷入困境，陷入了许多事件的程序，将其全部焦点转向遵守现有过程。如果您不按时提交门票，则会严格跟进。如果您的事件报告缺少细节，则会提醒您。随着时间的推移，设计事件流程和遵循事故进程的人们的人们之间的裂缝开始扩大。
尽管我非常同情的人d up in the role of driving compliance, it’s misguided in the case of incident response. There is an asymmetry in exposure between the folks designing and following incident process.
The folks designing the process are deep in the details, looking at dozens of incident reports each week and drilling into the process gaps that hampered response. Conversely, the folks following incident response generally experience so few incidents that each incident feels like learning the process from scratch (because the process and tooling has often shifted since their last, infrequent, incident), and they bring good intent but little familiarity to the process. It simply isn’t practical to assume folks who rarely perform a frequently changing process will perform it with mastery, instead you have to assume that they’ll respond as experts on their own code but amateurs at the incident process itself.
出于这种沼泽的路径正在切换心态owning a processto instead拥有产品。而不是在机票完成率附近设定目标，而是将目标设置为工作流的可用性。这只能与专用工程团队进行产品和用于工具的可靠性思维团队，以及事件管理团队，他们可以将准确的数据和用户声音带入该产品以确保它有效。
Until you make the process to product evolution, you’ll increasingly focus on measures of compliance, which are linked to reliability by定义rather than byresults。
As important as it is to escape this “form over function” approach, it’s not quite enough on its own. You also need to evolve from reactively investing in reliability to an approach that supports proactive investment. Unlike the previous phases, there is no organization size threshold you need to reach before you can leave incident law behind you, rather escaping these legalities is a matter of changing your perspective: sooner is much better than later.
Advocates of centralized incident response will note that you能够通过创建响应所有（关键？）事件的集中式事件响应或可靠性工程组织来掌握事件响应者。I’ll talk about this further down, but roughly my argument is (a) that is quite hard to do well at any scale, (b) it’s impossible to do well until your business lines start to mature, which isn’t yet the case for most 500 person engineering organization undergoinghypergrowth。
Each incident is a gift of learning. From that learning, you’ll identify a bunch of work you could do to increase your reliability. Then you have another incident, and you learn more. You also identify more work. Then you haveanotherincident and pretty soon there is far more work to be done then you’re able to commit resources towards. This often leads to the antipattern ofidentifying the same remediations for every incident。If the same remediations come up for every incident, doesn’t that mean they’re not useful remediation? Then why do we keep talking about them?
任何事件程序操作情况在规模ate more work than the organization can rationally resource at a given time. Reliabilityisa critical feature for your users, but it is not the _only _feature. As a result, you have to learn how to prioritize reliability work from a proactive perspective across all incidents, and to rely less on prioritizing reliability work reactively from a single-incident perspective.
There are four ingredients for doing this well: aninvestment thesis,_ incident management_,_ reliable architecture_, and_ quality assertions._
Yourinvestment thesis是A.结构化的目标和基线(video version from SRECon) that you’ll use to determine the extent you should invest into reliability. This gives you a framework to tradeoff against other business priorities like feature development, market expansion, etc.
Incident management正在审查事件in bulkas opposed to individually. Understanding the trends in your incidents allows you to make improvements that prevent entire categories of issues rather than narrow one-off improvements. This allows you to shift from whackamole to creating technical leverage.
Reliable architecture识别了这一点fundamental properties that make systems reliableand investing to bake those properties into the systems, workflows and tools that your engineers rely upon. This is the transition from asking folks to identify remediations a priori, to letting them benefit from your reliability team’s years of learning.
质量断言are一种维护质量基线的方法, allowing you to maintain important properties in your systems that make it possible to rely on common mitigation tools (e.g. to have confidence that requests are idempotent and it’s safe to shift traffic away from a failing node and reattempt it against a different node when canarying new code).
The language of process optimization simply doesn’t resonate with folks who don’t have a deep intuition around creating and operating reliable systems, which is why this transition is so important to make as your company grows to thousands of engineers. Many organizations never quite land this evolution, which is a shame, because reliability will never get the investment it needs until you’re speaking simultaneously in business impact and technical architecture.
一旦你指导可靠性是否努力h these evolutions, you’ll find yourself at an interesting place. You’re making calculated investments into reliability that resonate with the business leaders. Your architecture channels systems towards predictable reliability. You have a healthy product powering incident response and review, that continues to evolve over time.
Things are looking good.
许多组织选择停止在这里，并且个人觉得这是一个相当良好的停车场。其他人认为停机时间的财务或声誉影响是如此极端，他们希望继续通过引入一个集中式团队来减少他们的时间来缓解，这些团队为其核心函数而不是辅助功能来执行事件响应而不是辅助功能。这些团队，often called Site Reliability Engineering teams，成为关键，成熟系统的第一响应线。
The advantage of these teams is that they have enough repetition in incident response to develop mastery in the processes and tools. You no longer have to build incident response tools that are solely optimized for infrequent responders, but can instead expect a high level of consistency, practice and training. They’re also valuable for organizations which never created an incident tooling team to begin which, which often causes the impact of SRE teams and incident tooling teams to be conflated.
Expanding a bit on the first of those downsides, companies which are skilled at incident response focus their incident response onmitigatingincidents rather thanremediatingincidents. It’s critical that the impact to your users or business is resolved, and if at all possible you’d much rather eliminate the impact before going through the steps to understand the contributing factors. As such, having a centralized team forces good hygiene as they’re less familiar with the code and rely more on standardized mitigation techniques like traffic shifting, circuit breaking and so on.
The most important thing to keep in mind as you invest into reliability is to maintain an intentional balance between the many different perspectives that come into play. You should think of reliability from组织计划视角。你应该从中想到它an investment thesis perspective。你应该从中想到它lens of code quality。你应该从中想到它aproduct perspective。
你唯一的事情must不是do is to lean so heavily on any one perspective that you diminish the others. Keep an open mind, keep a broad set of skills on your team, and keep evolving: excellence is transient because good companies never stop changing.