你没有听到这个词超生长那么你做了几年前。肯定的是,you might hear it any given week, but you might open up Techmeme and not see it, which is a monumental return to a kinder, gentler past. (Or perhaps we’re just独角兽现在。）
We’re growing so quickly, that every six months we’re a new company.
Which means our process is always six months behind our headcount.
Helping your team be successful when defunct process merges with a constant influx of new engineers and system load has been one of the most rewarding opportunities I’ve had in my career, and this is an attempt to explore the challenges and some strategies I’ve seen for mitigating and overcoming them.
All real-world systems have some degree of inherent self-healing properties: an overloaded database will slow down enough that someone fixes it, overwhelmed employees will get slow at finishing work until someone finds a way to help.
Just how challenging depends on how quickly you can ramp engineers up to self-sufficient productivity, but if you’re doubling every six months and it takes six to twelve months to ramp up, then you can quickly find a scenario where untrained engineers increasing outnumber the trained engineers, and each trained engineer is devoting much of their time to training a couple of newer engineers.
and you reach the above chart’s (admittedly, pretty worst-case scenario) ratio of two untrained to one trained. Worse, for those three people you’re only getting the productivity of
1.16trained engineers (
2 * .33对于未训练的工程师加上
.5 * 1for the trainer).
If you’re trying to double every six months, and about ten percent of candidates undertaking phone screens eventually join, then you need to do ten interviews per existing engineer in that time period, with each interview taking about two hours to prep, perform and debrief.
That’s only four hours per month if you can leverage your entire existing team, but training comes up again here: if it takes you six months to get the average engineer onto your interview loop, each trained engineer is now doing three to four hours of hiring related work a week, and your trained engineers are down to
- 对于每〜10名工程师，您需要额外的团队，需要more coordination。
(This is admittedly an unfair comparison because it’s not accounting for the oncall load on the smaller initial teams, but if you accept the premise that oncall load grows as engineer headcount grows and load grows as the number of rotation grows, then the conclusion should still roughly hold.)
Although it’s rarely quite this extreme, this is where the oft raised concern that hiring is slowing us down comes from: at high enough rates, the marginal added value of hiring gets very slow, especially if your training process is weak.
如果您的公司正在将系统设计为持续一个数量级并每六个月加倍，那么您必须每三三年重新实现每种系统。这会创造大量的风险 - 几乎每个平台团队正在研究一个关键的缩放项目 - 也可以创造大量的资源争用来完成这些并发重写。
However, the real productivity killer is not system rewrites, but the migrations which follow those rewrites. Poorly designed migrations expand the consequences of this rewrite loop from the individual teams supporting the systems to the entire surrounding organization.
If each migration takes a week, each team is eight engineers, and you’re doing four migrations a year, then you’re losing about 1% of your company’s total productivity. If each of those migrations takes closer to a month, or if they are only possible for your small cadre of trained engineers whose time is already tightly contended for, then the impact becomes far more pronounced.
There is a lot more that could be said here–companies that mature rapidly often have tight and urgent deadlines around moving to multiple datacenters, to active-active designs, to new international regions and other critical projects–but I think we’ve covered our bases on how increasing system load can become a drag on overall engineering throughput.
That might imply that there is an easy solution, but finishing projects is pretty hard when most of your time is consumed by other demands.
Let’s tackle hiring first, as hiring and training are often a team’s biggest time investment.
When your company has decided it is going to grow, you cannot stop it from growing, but on the other hand you absolutely can concentrate that growth, such that your teams alternate between periods of rapid hiring and periods of consolidation and gelling. Most teams work best when scoped to approximately eight engineers, so as each team gets to that point, you can move the hiring spigot to another team (or to a new team). As the post-hiring team gels, eventually the entire group will be trained and able to push projects forward.
You can do something similar on an individual basis, rotating engineers off of interviewing periodically to give them time to recouperate. With high interview loads, you’ll sometimes notice last year’s solid interviewer giving a poor candidate experience or rejecting every incoming candidate. If they’re doing more than three interviews a week, it is a useful act of mercy to give them a month off every three or four months.
I’m optimistically confident that we’re not entirely cargo-culting this idea from each other, so it probably works, but I hope to get an opportunity to spend more time understanding how effective those programs can be. If you could get training down to four weeks, imagine how quickly you could hire without overwhelming the existing team!
The strategy here is to funnel interrupts into an increasingly small surface area, and then automate that surface area as much as possible. Asking people to file tickets, creating chatbots which automate filing tickets, creating a服务食谱, and so on.
A similar variant of this is adhoc meeting requests. The best tool I’ve found for this is to block out a few large blocks of time each week to focus. This can range from working from home on Thursday, to blocking out Monday and Wednesday afternoons, to blocking out from eight to eleven each morning. Experiment a bit and find something that works well for you.
最后，我发现的一件事是在非常少数中断的公司，几乎无处可去的是：真的很棒，一贯有用的文件。It’s probably even harder to bootstrap documentation into a non-documenting company than it is to bootstrap unittests into a non-testing company, but the best solution to frequent interruptions I’ve seen is a culture of documentation, documentation reading, and a documentation search that actually works.
在我看来，可能是最重要的机会正在设计你的软件是灵活的。我有described this as “fail open and layer policy”; the best system rewrite is the one that didn’t happen, and if you can avoid baking in arbitrary policy decisions which will change frequently over time, then you are much more likely to be able to keep using a system for the long term.
沿着这些行，如果您可以将接口通用保留，则可以跳过系统重新实现的迁移阶段，这往往是最长且最棘手的阶段，并且可以更快地遍历并保持更快的并发版本。There is absolutely a cost to maintaining this extra layer of indirection, but if you’ve already rewritten a system twice, take the time to abstract the interface as part of the third rewrite and thank yourself later (by the time you’d do the fourth rewrite you’d be dealing with migrating six times the engineers).
Finally, a related antipattern is the gatekeeper pattern. Having humans who perform gatekeeping activities creates very odd social dynamics, and is rarely a great use of a human’s time. When at all possible, build systems with sufficient isolation that you can allow most actions to go forward and when they do occasion to fail, make sure they fail with a limited blast radius.
东西有点忽略了一点就是how to handle urgent projects requests when you’re already underwater with your existing work and maintenance. The most valuable skill there is learning to say no in a way that is appropriate to your company’s culture, which probably deserves it’s own article. There are probably some companies where saying no is culturally impossible, and in those places I guess you either learn to say your no’s as yes’s or maybe you find a slightly easier environment to participate in.
How do you remain productive in times of hypergrowth?