A couple days ago Lyft releasedEnvoy，这是一个非常令人兴奋的人layer 7proxy. This is an area that I’ve been thinking about a fair amount, both in terms of rolling out widespreadquality of service以及大量多点环境中的请求路由。
Increasingly these seem to be going out of style.
部分原因是他们的成本和复杂性，但我的手段是，他们缺乏普及的缺乏缺乏柔韧性和控制。作为程序员，以编程方式管理相对直观haproxy.，以最集中的负载均衡器不提供的方式。（它甚至不是那easy to manage HAProxy programmatically, although it’s making steady progress.)
在连续体的另一端，每次流程负载平衡Finagle., where each one of your processes maintains its own health-checking and routing tables. This tends to allow a truly epic level of customization and control, but requires that you solve the fairly intimidating domain of library management.
对于以一种语言或运行时运行的公司，通常是JVM，以及谁经营monorepo, the challenges of library management are pretty tractable. For most everyone else–those with three or more languages or with service per repo–these problems require very significant ecosystem investment.
一些致力于支持许多语言和许多存储库（如Netflix，优步）的公司正在慢慢建立基础设施，但总的来说这不是你的问题wantto take on, which has been the prime motivation to develop thesidecarapproach.
The sidecar model is to have a process running per server, and to have all processes on a server communicate through that local process. Many companies–including any company that ever hired a Digg systems engineer–have been running HAProxy in this mode for quite a while, and it gives a good combination of allowing flexibility without requiring that you solve the library management problem.
Sidecar路由器的缺点传统上，传统上对HTTP / 1.0和HTTP / 1.1以外的任何内容的支持不佳，并且很少有效，这是高性能和高度可定制的选项（HAProxy具有惊人的性能，但在您尝试时变得相当脆弱做更复杂的东西）。
That is where Envoy fits in.
一，是docsare simply excellent. They do a wonderful job in particular of explaining the high-level architecture and giving a sense of the design constraints which steered them towards that architecture. I imagine it was a hard discussion to decide to delay releasing the project until the documentation was polished to this state, but I’m really glad they did.
The approach to supporting ratelimiting is also great. So many systems require adopting their ecosystem of supporting tools (Kafka, requiring that you also roll out Zookeeper, is a good example of this), but Envoy defines aGRPCinterface that a ratelimiting server needs to provide, and thus abstracts itself from the details of how ratelimiting is implemented. I’d love to see more systems take this approach to allow greater interoperability. (Very long term, I look forward to a future where we have shared standard interfaces for common functionality like ratelimiting, updating routing tables, login,追踪, and so on.)
热重启using a unix domain socket is interesting as well, in particular how containers have become a first-class design constraint. Instead of something hacked on later, “How will this work with container deployment?” is asked at the start of new system design!
For a long time it felt like Thrift was the front-runningidl.最近,但是很难to argue with GRPC: it has good language support and the decision to built on HTTP/2 means it has been able to introduce streaming bidirectional communication as a foundational building block for services. It’s very exciting to see Envoy betting on GRPC in a way which will reduce the overhead of adopting GRPC and HTTP/2 for companies which are already running sidecar routers.
All of those excellent things aside, I think what’s most interesting is the possibility of a broader ecosystem developing on top of the HTTP and TCP filter mechanisms. The approach seems sufficiently generalized that someone will be able to add more protocols and to customize behavior to their needs (to trade out the centralized ratelimiting for a local ratelimiter or the local circuit breaker for a centralized circuit breaker, etc).
您可以想象一个带有MySQL和PostgreSQL协议的生态系统，了解TCP过滤器，更全面的背压工具套件，用于延迟注入，全局电路沿线断开Hystrix(would be pretty amazing to integrate with Hystrix’s dashboard versus needing to recreate something similar), and even the ability to manually block specific IPs to provide easier protection from badly behaving clients (certainly not at the level of a DDoS, rather a runaway script or some such).
In terms of a specific example of a gap in the initial offering, my experience is that distributed healthchecking–even with local caching of health check results–is too expensive at some scale (as a shot in the dark, let’s say somewhere around the scale of 10,000 servers with 10 services per server). Providing a centralized healthcheck cache would be an extremely valuable extension to Envoy for very large-scale deployments. (Many people will never run into this issue, either due to not needing that many servers and services, or because you scope your failure zones to something smaller. It’s probably true that a proactive engineering team with enough resources would never需要this functionality. If someone does want to take a stab at this,the RAFT protocol对分散的健康检查有一些非常好的想法。）