It’s hard to write about engineering leadership in 2020 and not mention the research from加速和多拉。它们提供了一种关于如何提高开发人员生产力的数据驱动的视角，这是一个非常魔法的东西。他们为什么不再被使用？
There are three core problems I see:
I’m personally convinced that these companies are selling products that harm the companies that use them rather than help them. Using productivity metrics to measure individuals this way is akin to incident retrospects that identify human error as the root cause. It’s performative, and if you want to blame someone then just go ahead and blame someone, don’t waste your time getting arbitrary metrics to support it.
As long as tooling keeps privileging the manager who wants to grade their team rather than learn from the development process, these tools will be actively distrusted by the engineers who instrument them and create false confidence in managers using them. The right tool here should be designed exclusively from a learning perspective.
Need observability over monitoring
In the monitoring and observability space, Honeycomb and Lightstep and have pushed a definition of observability centered on supporting ad-hoc rather than precomputed queries. Monitoring tools like Grafana/Graphite/Statsd might push you to emit measures that are pre-aggregated to support certain queries like “what’s the p50 latency by service, by datacenter?” Observability tools like Honeycomb and Lightstep push you to emit events which can are then aggregated at query-time, which supports answering the same question you’d ask Grafana, but also questions like “what are the latencies for requests to servers canarying this new software commit?” or even “show me full traces for all requests running this commit.”
Too many of these developer meta-productivity tools focus on monitoring-style solutions which are a mediocre fit for measuring productivity and an even poorer fit to support learning. This is a shame because the infrastructure challenges that drive infrastructure monitoring towards monitoring simply don’t exist when looking at human-scale events.
For example, if you’re measuring server response times across a fleet of Kubernetes nodes running ten pods per node, then you might be looking at 100 requests per second * 10 pods per node * 1000 nodes * 3 availability zones, which is three million measurements per second that you need to record. This drives tradeoffs towards reducing the quantity of data being stored. The observability infrastructure required to store all those events without aggregatingat a reasonable price是复杂的和定制的。（大致我的理解是，大多数可观察性工具，而是从环形缓冲区捕获事件，并且从环形缓冲区有趣的事件中驱逐并汇总数据转换到更耐用的存储。）
However, the scale that these developer meta-productivity tools operate at isso much smallerthat there’s no need to solve the underlying infrastructure problems. Just write it to MySQL or Kafka (streaming to S3 for historical data) or something. There’s no reason that the underlying events should be available to support deeper understanding.
基本工作流程我想看到这些系统tems offer is the same as a request trace. Each pull request has a unique identifier that’s passed along the developer productivity tooling, and you can see each step of their journey in your dashboarding tooling. Critically, this lets you follow the typical trace instrumentation workflow of starting broad (maybe just the PR being created, being merged, and then being deployed) and add more spans into that trace over time to increase insight about the problematic segments.
So, who cares?
I’m writing this as (a) the hope that there are folks out there working on this problem from this perspective, and (b) a reusable explanation of what sort of developer meta-productivity tools I’m excited about when folks email me for feedback/angel investment/etc.
For what it’s worth, I’m not necessarily saying I think this will be a good商业。I’m confident it’s the tool that engineering leadership teams need to more effectively invest into developer productivity, but I’m less confident it’s the sort of thing people want to pay for; figuring out the go-to-market and distribution strategy is probably the hardest part of this sort of product. This shows another advantage of the observability/traces/spans approach is that you can import the event history and show value立即地instead of having to have folks use the tool to build out new metric aggregations.