高生长年龄的生产力。
管理(130)你没有听到这个词超生长那么你做了几年前。肯定的是,you might hear it any given week, but you might open up Techmeme and not see it, which is a monumental return to a kinder, gentler past. (Or perhaps we’re just独角兽现在。)
幸运的是,对于各地的工程管理人员,迅速管理公司的挑战仍然存在。
当我开始Uber时,我们近1000名员工每六个月一次,每六个月加倍。旧计时器总结了他们的经验:
We’re growing so quickly, that every six months we’re a new company.
旁观者迅速添加了一个必然结果:
Which means our process is always six months behind our headcount.
Helping your team be successful when defunct process merges with a constant influx of new engineers and system load has been one of the most rewarding opportunities I’ve had in my career, and this is an attempt to explore the challenges and some strategies I’ve seen for mitigating and overcoming them.
更多工程师,更多的问题
All real-world systems have some degree of inherent self-healing properties: an overloaded database will slow down enough that someone fixes it, overwhelmed employees will get slow at finishing work until someone finds a way to help.
很少有现实世界系统具有高效和刻意的自我修复特性,这就是当您双倍工程师和客户持续一年中的令人兴奋的情况。
生产的大量工程师很难。
Just how challenging depends on how quickly you can ramp engineers up to self-sufficient productivity, but if you’re doubling every six months and it takes six to twelve months to ramp up, then you can quickly find a scenario where untrained engineers increasing outnumber the trained engineers, and each trained engineer is devoting much of their time to training a couple of newer engineers.
想象一个场景在哪里:
- 每个训练有素的工程师每周训练大约需要十个小时,
- 未经训练的工程师是培训工程师的1/3,
and you reach the above chart’s (admittedly, pretty worst-case scenario) ratio of two untrained to one trained. Worse, for those three people you’re only getting the productivity of1.16
trained engineers (2 * .33
对于未训练的工程师加上.5 * 1
for the trainer).
您还需要考虑在招聘上花费的时间。
If you’re trying to double every six months, and about ten percent of candidates undertaking phone screens eventually join, then you need to do ten interviews per existing engineer in that time period, with each interview taking about two hours to prep, perform and debrief.
That’s only four hours per month if you can leverage your entire existing team, but training comes up again here: if it takes you six months to get the average engineer onto your interview loop, each trained engineer is now doing three to four hours of hiring related work a week, and your trained engineers are down to0.4
效率。整个团队正在得到1.06
每三名工程师的工作工程师。
这不仅仅是培训和招聘:
- 对于您需要设计和维护新的管理层的每增加一级工程级。
- 对于每〜10名工程师,您需要额外的团队,需要more coordination。
- 每个工程师都意味着每天更多的提交和部署,在开发工具上创建负载。
- 大多数中断都是由部署引起的,因此更多部署驱动更多的中断,这反过来需要事件管理,减轻和后期。
- 更多工程师导致更专业的团队和系统,这需要越来越小的onCall旋转,为您的oncall工程师有足够的系统上下文来调试和解决生产问题,因此在oncall中投入的相对时间上升。
让我们做更多的手势数学来定位这些数学。
只有你训练有素的工程师可以坐下来,他们每月一周才能oncall,并且忙着大约一半的时间oncall。因此,每周为您的培训工程师为每周五小时的全部影响,现在是谁0.275
效率,您的团队整体现在越来越少于您雇用的每三名工程师的一次培训工程师的输出。
(This is admittedly an unfair comparison because it’s not accounting for the oncall load on the smaller initial teams, but if you accept the premise that oncall load grows as engineer headcount grows and load grows as the number of rotation grows, then the conclusion should still roughly hold.)
Although it’s rarely quite this extreme, this is where the oft raised concern that hiring is slowing us down comes from: at high enough rates, the marginal added value of hiring gets very slow, especially if your training process is weak.
有时候很低意味着负面!
系统存活了一个幅度的增长
我们已经看起来有点有效地与工程挂单的折磨关系,所以现在让我们也思考你系统上的负载都在增长。
了解增加负荷的整体影响归结为几个重要趋势:
- 大多数系统实现的系统旨在支持从电流负载的一个到两个增长的一个级别。甚至设计用于更多增长的系统往往会在一到两级的序列内陷入局限。
- 如果您的交通每六个月加倍,那么您的负载每十八个月增加一个数量级。(有时新功能或产品导致负荷增加更快。)
- 支持系统的基数随着组成团队而随着时间的推移而增加,并作为“琐碎”的系统从未支持的追随者到整个团队的焦点,因为他们达到了缩放的平台(像Kafka这样的东西,邮件交付,REDIS等)。
如果您的公司正在将系统设计为持续一个数量级并每六个月加倍,那么您必须每三三年重新实现每种系统。这会创造大量的风险 - 几乎每个平台团队正在研究一个关键的缩放项目 - 也可以创造大量的资源争用来完成这些并发重写。
However, the real productivity killer is not system rewrites, but the migrations which follow those rewrites. Poorly designed migrations expand the consequences of this rewrite loop from the individual teams supporting the systems to the entire surrounding organization.
If each migration takes a week, each team is eight engineers, and you’re doing four migrations a year, then you’re losing about 1% of your company’s total productivity. If each of those migrations takes closer to a month, or if they are only possible for your small cadre of trained engineers whose time is already tightly contended for, then the impact becomes far more pronounced.
There is a lot more that could be said here–companies that mature rapidly often have tight and urgent deadlines around moving to multiple datacenters, to active-active designs, to new international regions and other critical projects–but I think we’ve covered our bases on how increasing system load can become a drag on overall engineering throughput.
真正的问题是,我们对此有何了解?
管理熵的方法
我最喜欢的观察凤凰项目是,当他们完成时,您只能从项目中获取价值:在其他方面取得进展,您必须确保您的一些项目完成。
That might imply that there is an easy solution, but finishing projects is pretty hard when most of your time is consumed by other demands.
Let’s tackle hiring first, as hiring and training are often a team’s biggest time investment.
When your company has decided it is going to grow, you cannot stop it from growing, but on the other hand you absolutely can concentrate that growth, such that your teams alternate between periods of rapid hiring and periods of consolidation and gelling. Most teams work best when scoped to approximately eight engineers, so as each team gets to that point, you can move the hiring spigot to another team (or to a new team). As the post-hiring team gels, eventually the entire group will be trained and able to push projects forward.
You can do something similar on an individual basis, rotating engineers off of interviewing periodically to give them time to recouperate. With high interview loads, you’ll sometimes notice last year’s solid interviewer giving a poor candidate experience or rejecting every incoming candidate. If they’re doing more than three interviews a week, it is a useful act of mercy to give them a month off every three or four months.
我的证据表明如何解决这个问题的培训组成部分,但通常您开始看到大公司为新雇用训练营和经常性教育课进行重大投资。
I’m optimistically confident that we’re not entirely cargo-culting this idea from each other, so it probably works, but I hope to get an opportunity to spend more time understanding how effective those programs can be. If you could get training down to four weeks, imagine how quickly you could hire without overwhelming the existing team!
我发现的第二个最有效的时间小偷是adhoc中断:在肩膀上徘徊或松弛,在肩膀上轻拍,从您的oncall系统,高批量电子邮件列表等警报。
The strategy here is to funnel interrupts into an increasingly small surface area, and then automate that surface area as much as possible. Asking people to file tickets, creating chatbots which automate filing tickets, creating a服务食谱, and so on.
通过该设置到位,为可用于回答问题的人创造旋转,并培训团队不回答其他形式的中断。这是非常不舒服的,因为我们想要有用的人类,但随着中断的数量爬得更高。
我发现这里发现非常有用的一个特定工具是一个所有权注册表,它允许您查找谁拥有什么,消除频繁的“拥有X?”各种问题。你需要这种事情来自动分页正确的oncall旋转,所以也可能得到两个有用的工具!
A similar variant of this is adhoc meeting requests. The best tool I’ve found for this is to block out a few large blocks of time each week to focus. This can range from working from home on Thursday, to blocking out Monday and Wednesday afternoons, to blocking out from eight to eleven each morning. Experiment a bit and find something that works well for you.
最后,我发现的一件事是在非常少数中断的公司,几乎无处可去的是:真的很棒,一贯有用的文件。It’s probably even harder to bootstrap documentation into a non-documenting company than it is to bootstrap unittests into a non-testing company, but the best solution to frequent interruptions I’ve seen is a culture of documentation, documentation reading, and a documentation search that actually works.
有一个非零数量的公司,内部文件很好,但我不太肯定是否有没有超过二十台工程师的公司数量。如果你知道,请告诉我,所以我可以挑选他们的大脑。
在我看来,可能是最重要的机会正在设计你的软件是灵活的。我有described this as “fail open and layer policy”; the best system rewrite is the one that didn’t happen, and if you can avoid baking in arbitrary policy decisions which will change frequently over time, then you are much more likely to be able to keep using a system for the long term.
如果您将每隔几年重写您的系统,因此避免任何不必要的重写,雅知道?
沿着这些行,如果您可以将接口通用保留,则可以跳过系统重新实现的迁移阶段,这往往是最长且最棘手的阶段,并且可以更快地遍历并保持更快的并发版本。There is absolutely a cost to maintaining this extra layer of indirection, but if you’ve already rewritten a system twice, take the time to abstract the interface as part of the third rewrite and thank yourself later (by the time you’d do the fourth rewrite you’d be dealing with migrating six times the engineers).
Finally, a related antipattern is the gatekeeper pattern. Having humans who perform gatekeeping activities creates very odd social dynamics, and is rarely a great use of a human’s time. When at all possible, build systems with sufficient isolation that you can allow most actions to go forward and when they do occasion to fail, make sure they fail with a limited blast radius.
有一些情况有法律或合规性原因所必需的守门员,或者因为系统非常虚弱,但我认为我们通常应该将网守视为一个重要的实现错误而不是要仿真的稳定性功能。
结束思想
这里的想法都没有即时获胜。我的感觉是,在堆积小胜的线上比识别银子,这是一个感觉。我已经使用了所有这些技术,并且在某种程度上使用了大多数人,所以希望他们至少给你一些想法。
东西有点忽略了一点就是how to handle urgent projects requests when you’re already underwater with your existing work and maintenance. The most valuable skill there is learning to say no in a way that is appropriate to your company’s culture, which probably deserves it’s own article. There are probably some companies where saying no is culturally impossible, and in those places I guess you either learn to say your no’s as yes’s or maybe you find a slightly easier environment to participate in.
How do you remain productive in times of hypergrowth?