But what do you do with this newfound knowledge? Why, you kick off a Cost Accounting or Efficiency project to start optimzing your spend! These projects tend to take on a two pronged approach of:
First, let’s start with the fundamental unit of infrastructure cost: a server. For each server you probably have a role (your
mysql01-west.is probably in a
mysqlrole), and for each role you can probably assign an owner to it (perhaps your
databaseteam in this case). Now you write a quick script which queries your server metadata store every hour and emits the current servers and owners into石墨or some such.
做得很好的工作！在这里，你的工作结束了......直到那个星期，当你与刚从专门的主机迁移到Mesos或Kubernetes的团队聊天时，他们聊天，他们向Mesos或Kubernetes展示了一个有趣的问题：“当然，我们在调度团队运行的光荣工程师kubernetes.簇那but we don’t write any of the apps. Why arewe负责their费用？“
So you head back to your desk and write down the resources provided by each machine:
思考这些资源和现有的每个主机成本，您可以为每个资源建立更粒度的定价模型。Then you take that model and add it as a layer on top of your per-host where the Scheduling team is able attribute their server costs downstream to the processes which run on their servers, as long as they’re able to start capturing high-fidelity per-process utilization metrics and maintaining a process-to-team mapping (that was on their roadmap anyway).
- team server costs,
- team process costs.
Over the following weeks, you’re surprised as every infrastructure team pings you to chat. The team running Kafka wants to track traffic per topic and to attribute cost back to publishers and consumers by utilization; that’s fine with you, it fits into your existing attribution model. The Database team wants to do the same with their MySQL database which is a little bit more confusing because they want to build a model which attributes disk space, IOPs and CPU, but you’re eventually able to figure out some heuristics that are passable enough to create visibility. (What are query comments meant for if not injecting increasingly complex structured data into every query?)
The new SRE manager started scheduling weekly incident review meetings, and you listen absentmindedly while the Kafka team talks about an outage caused by a publisher which started generating far more load than usual. It’s a bummer that Kafka keeps going down, but at least their spend is going down, nothing to do with you. Later, you awake in a panic when someone suggests that we just massively overprovision the Kafka cluster to avoid these problems. You sputter out an incoherent squak of rage at this suggestion–we’ve made too much progress on reducing costs to regress now!–and leave the meeting shaken.
Next week, the MySQL team is in the incident review meeting because they ran out of disk space and had a catastrophic outage. A sense of indigestion starts to creep into your gut as you see the same person as last week gears up to speak, and then she says it, she says itagain: “Shouldn’t we spend our way to reliability here?”
Demoralized on the way to get your fifth LaCroix for the day, you see the CFO walking your way. He’s been one of your biggest supporters on the cost initiative, and you perk up anticipating a compliment. Unfortunately, he starts to drill into why infrastructure costs have returned to the same growth curve and almost the same levels they were at when you started your project.
Maybe, you ponder to yourself on the commute home, no one is even looking at their cost data. Could a couple of thoughtful nudges can fix this?
You program up an automated monthly cost reports for each team, showing how much they are spending, and also where their costs fit into the overall team spend distribution. Teams with low spends start asking you if they can trade their savings in for a fancy offsite, and teams with high spends start trying to reduce their spend again. (You even rollout a daily report that only fires if it detects anomalously high spend after a new EMR job spends a couple million dollars analyzing Nginx logs.)
在您从最新技术手机屏幕中打字笔记时，您可以通过薄，仓促建造的会议室墙壁听到争论。The product manager is saying that the best user experience is to track and retain all incoming user messages in their mailbox forever, and the engineer is yelling that it isn’t possible: only bounded queues have predictable behavior, an unbounded queue is just a queue with undefined failure conditions!
For the Kafka team, constrainted on throughput, you decide on strict per-application per-minute ratelimits. The MySQL team, where bad queries are saturating IOPs, startscircuit breakingapplications generating poor queries. To work around Splunk’s strict enforcement of daily indexing quotas, you roll out a simple服务质量策略：应用程序在其Syslogs的结构化数据段中指定日志优先级，并且根据需要，您将更低的优先级日志避免以避免过跨度。对于用户挂钩外部API的用户，您可以开始注入延迟，这导致他们表现出与退避的Kinder客户一样，即使他们没有改变一件事。（假设您可以处理大量的并发，空闲，连接。否则您刚刚给予自己，你会在后来实现伟大的Chagrin。）