Service 間通信を Envoy 経由で行い、Metrics を取得する

はじめに

現職では Application はすべて Kubernetes 上で動いている。その場合、インターネットからの通信経路は以下のようになる。

Internet -> Reverse Proxy(Nginx) -> Service Router(Nginx) -> Kubernetes Service -> Pod

で、後半の Service Router から先が Kubernetes Cluster となっている。Type: LoadBalancer で受けたあと、forwarding して service-router と呼んでいる Nginx に飛ばしている。

現状、一部で Microservices が動いていたり、Distributed Monolith を Microservices に切ろうとしていたりするが、Service Router 以降の、Pod（Application）同士の通信は Kubernetes Service 作成時に kube-dns に自動的に登録される domain 名で、直接通信している。

インターネット経由でのアクセスの場合は Reverse Proxy や Service Router での Metrics は取得できるが、そうではない、Service 間の通信は現状 Application Layer でしか Metrics を取得できないという問題がある。

それの何が問題かというと、Microservices 単位で SLI/SLO を設定するときに困る。NewRelic などの APM でも取得できるのだが、現状 SLI/SLO は DataDog Widget で管理しているので、なんにせよ Datadog のほうで SLI の計測は行いたいのだ。

どうする

直接通信する代わりに、間に Envoy を Proxy として挟む。

Before

Service A --(http://service_b)--> Service B

After

Service A --(http://localhost:10000)-->Envoy-->(http://service_b)-->Service B

Service A の Container と Envoy の Container は同一 Pod として動いている。いわゆる Sidecar である。

具体的な設定差分はこんな感じになる。

configmap でも Deployment の env でもなんでもいいが、（最悪直接書いてもいいが）Service A の Container の環境変数に、Service B への URL をいれる。

            - name: SERVICE_B_URL
              value: localhost:10000

Envoy の config は以下のようになる。

apiVersion: v1
kind: ConfigMap
metadata:
  name: "${SERVICE_NAME}-envoy-config"
data:
  envoy.yaml: |
    admin:
      access_log_path: /dev/stdout
      address:
        socket_address: { address: 127.0.0.1, port_value: 9901 }
    static_resources:
      listeners:
      - name: listener_0
        address:
          socket_address: { address: 0.0.0.0, port_value: 10000 }
        filter_chains:
        - filters:
          - name: envoy.http_connection_manager
            config:
              stat_prefix: ingress_http
              codec_type: AUTO
              route_config:
                name: local_route
                virtual_hosts:
                - name: service1_grpc
                  domains: ["*"]
                  routes:
                  # Envoy admin endpoints
                  - match: { prefix: "/server_info" }
                    route: { cluster: envoy_admin }
                  - match: { prefix: "/stats" }
                    route: { cluster: envoy_admin }
                  # HTTP endpoint
                  - match: { prefix: "/v1/example/todo" }
                    route: { cluster: service_b_http }
                  - match: { prefix: "/v2/example/todo" }
                    route: { cluster: service_b_http }
              http_filters:
              - name: envoy.router
                config: {}
      clusters:
      - name: service_b_http
        connect_timeout: 5s
        type: STRICT_DNS
        lb_policy: ROUND_ROBIN
        dns_lookup_family: V4_ONLY
        load_assignment:
          cluster_name: service_b_http
          endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: service_b
                    port_value: 80
      - name: envoy_admin
        connect_timeout: 0.250s
        type: LOGICAL_DNS
        lb_policy: ROUND_ROBIN
        hosts:
        - socket_address:
            protocol: TCP
            address: 127.0.0.1
            port_value: 9901

実はこれ以外にも gRPC の Client Load Balancing としても Envoy を以前から利用していた。今回は http を proxy することで metrics を取得した。envoy の metris はこの設定のように /stats をあけておいて、annotation を設定すればよい。

docs.datadoghq.com

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: "${SERVICE_NAME}"
spec:
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: "${SERVICE_NAME}"
  template:
    metadata:
      name: "${SERVICE_NAME}"
      labels:
        app: "${SERVICE_NAME}"
        version: v0.1
      annotations:
        ad.datadoghq.com/${SERVICE_NAME}-envoy.check_names: |
          ["envoy"]
        ad.datadoghq.com/${SERVICE_NAME}-envoy.init_configs: |
          [{}]
        ad.datadoghq.com/${SERVICE_NAME}-envoy.instances: |
          [
            {
              "stats_url": "http://%%host%%:10000/stats"
            }
          ]

Envoy の config

以下のドキュメントを参考にすれば簡単に動いた。

www.envoyproxy.io

google.com へ https で proxy してる部分を http にしただけだ。

ただ、知っておくべき概念だけ簡単に復習しておく。

static_resources の対になるところは dynamic_resources だと思うが、それはまたそのうち。

Listeners

listeners (Listener) Static Listeners. These listeners are available regardless of LDS configuration.

はい。

www.envoyproxy.io

Listeners 以下の設定例はこれだが、そもそも何かというと、名前の通り、どういう基準で通信を受け付けるか、ということを記述するところである。

address のみが Required になっている。

Listener は複数設定できる。今回の例では address はこうなっている。

      listeners:
      - name: listener_0
        address:
          socket_address: { address: 0.0.0.0, port_value: 10000 }

任意の IP Address の port 10000 で受け付ける。

さらに、条件を絞るために filter_chains という設定がある。

www.envoyproxy.io

filter_chain_match でその criteria を記載する。

www.envoyproxy.io

ここの apply ordering は重要そうである。

The following order applies:

Destination port.
Destination IP address.
Server name (e.g. SNI for TLS protocol),
Transport protocol.
Application protocols (e.g. ALPN for TLS protocol).
Source type (e.g. any, local or external network).
Source IP address.
Source port.

今回はこの filter_chains ではなく、 filters を使っている。

Order matters as the filters are processed sequentially as connection events happen とある通り順番重要。

さらに typed_config に潜っていきたいが、それ自体の説明はない。

もう一度 Getting Started の config を見てみると、

listeners:
- name: listener_0
  address:
    socket_address: { address: 0.0.0.0, port_value: 10000 }
  filter_chains:
  - filters:
    - name: envoy.http_connection_manager
      typed_config:
        "@type": type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager
        stat_prefix: ingress_http
        codec_type: AUTO
        route_config:
          name: local_route
          virtual_hosts:
          - name: local_service
            domains: ["*"]
            routes:
            - match: { prefix: "/" }
              route: { host_rewrite: www.google.com, cluster: service_google }

http_connection_manager とやらを使っていて、そこの設定を使っているのだろう。

http_connection_manager とは何かというと

www.envoyproxy.io

やばい！飽きてきた。まぁこんな風に filters として使えるものがプロトコルによって異なるものを採用できる、プラガブルになっていることがわかった。

Network Filters の一覧はこのへん。

www.envoyproxy.io

例えば MySQL とか MongoDB とか Redis とかありますね。これらへの接続も Envoy Proxy 経由で行うことでいい感じにすることができそう。今は掘らない。

Clusters

Clusters は Listeners で受けたものを流す対象。Listener は Clusters のみを知っていればよく、実際の通信先は Clusters がいい感じに知っているという仕組みになっているようだ。Service Discovery を Clusters が担っている。

www.envoyproxy.io

基本的には通信を proxy する upstream やらをどうやって discovery するかを書けば良さそうだ。

もう一度 Getting Started の Example から。

clusters:
- name: service_google
  connect_timeout: 0.25s
  type: LOGICAL_DNS
  # Comment out the following line to test on v6 networks
  dns_lookup_family: V4_ONLY
  lb_policy: ROUND_ROBIN
  load_assignment:
    cluster_name: service_google
    endpoints:
    - lb_endpoints:
      - endpoint:
          address:
            socket_address:
              address: www.google.com
              port_value: 443
  transport_socket:
    name: envoy.transport_sockets.tls
    typed_config:
      "@type": type.googleapis.com/envoy.api.v2.auth.UpstreamTlsContext
      sni: www.google.com

ここで多分きっと重要な type を深掘りしておこう。

type (Cluster.DiscoveryType) The service discovery type to use for resolving the cluster.

Only one of type, cluster_type may be set.

service discovery type だと言っている。

さて詳細はこちらの Document にのっている。

www.envoyproxy.io

5種類の Type がある。

Static

Static is the simplest service discovery type. The configuration explicitly specifies the resolved network name (IP address/port, unix domain socket, etc.) of each upstream host.

その名の通り IP Address なり Range なりを Static に記述するタイプですね。具体的な設定例はこんな感じになりそう。

  clusters:
    - name: service_ipsum_echo_1
      connect_timeout: 0.25s
      http_protocol_options: {}
      lb_policy: ROUND_ROBIN
      type: STATIC
      hosts:
        - socket_address: {address: 172.16.221.30, port_value: 8000 }

Strict DNS

When using strict DNS service discovery, Envoy will continuously and asynchronously resolve the specified DNS targets. Each returned IP address in the DNS result will be considered an explicit host in the upstream cluster. This means that if the query returns three IP addresses, Envoy will assume the cluster has three hosts, and all three should be load balanced to. If a host is removed from the result Envoy assumes it no longer exists and will drain traffic from any existing connection pools. Note that Envoy never synchronously resolves DNS in the forwarding path. At the expense of eventual consistency, there is never a worry of blocking on a long running DNS query.

決められたタイミングで DNS 問い合わせを行い、その結果を採用する。3つの IP アドレスが解決されれば 3つホストがあると認識するし、いなくなればいなくなったと判断する。毎回の DNS Query にて service discovery を行うように読み取れる。ふむという感じ。

Logical DNS

Logical DNS uses a similar asynchronous resolution mechanism to strict DNS. However, instead of strictly taking the results of the DNS query and assuming that they comprise the entire upstream cluster, a logical DNS cluster only uses the first IP address returned when a new connection needs to be initiated. Thus, a single logical connection pool may contain physical connections to a variety of different upstream hosts. Connections are never drained. This service discovery type is optimal for large scale web services that must be accessed via DNS. Such services typically use round robin DNS to return many different IP addresses. Typically a different result is returned for each query. If strict DNS were used in this scenario, Envoy would assume that the cluster’s members were changing during every resolution interval which would lead to draining connection pools, connection cycling, etc. Instead, with logical DNS, connections stay alive until they get cycled. When interacting with large scale web services, this is the best of all possible worlds: asynchronous/eventually consistent DNS resolution, long lived connections, and zero blocking in the forwarding path.

非同期的に DNS 問い合わせを行う点は Strict DNS と同じだが、Logical DNS は新規接続時に DNS 問い合わせを行い、得た結果の最初の1つのみを採用して通信を行う。

ポイントは Connections are never drained というところで、一度 Connection を確立した場合、再度問い合わせを行ったりせずにずっとその IP アドレスを使うという点だろう。

イマイチ When interacting with large scale web services, this is the best of all possible worlds なポイントがしっくりきていない。Strict だと Member が変わった時一気に接続先が更新されて Gradually じゃない点がよくないってことなんだろうか、？

Original destination

Original destination cluster can be used when incoming connections are redirected to Envoy either via an iptables REDIRECT or TPROXY target or with Proxy Protocol. In these cases requests routed to an original destination cluster are forwarded to upstream hosts as addressed by the redirection metadata, without any explicit host configuration or upstream host discovery. Connections to upstream hosts are pooled and unused hosts are flushed out when they have been idle longer than cleanup_interval, which defaults to 5000ms. If the original destination address is not available, no upstream connection is opened. Envoy can also pickup the original destination from a HTTP header. Original destination service discovery must be used with the original destination load balancer.

when incoming connections are redirected to Envoy either via an iptables REDIRECT or TPROXY target or with Proxy Protocol

iptables REDIRECT か TPROXY target で受けた場合に機能すると言っている。のでそうなんだろう。Istio とかがやってるのはこれなのかもしれない。わからない。

Endpoint discovery service (EDS)

The endpoint discovery service is a xDS management server based on gRPC or REST-JSON API server used by Envoy to fetch cluster members. The cluster members are called “endpoint” in Envoy terminology. For each cluster, Envoy fetch the endpoints from the discovery service. EDS is the preferred service discovery mechanism for a few reasons:

xDS management server を使って dynamic に routing 先を取得したりするんだろうか。

今後

軽く公式ドキュメント追いかけながらメモしようと思ってはじめたが思いの外長くなったのでこのへんで終わる。

年末に http で通信している Service 間通信に Envoy をいれる PR を出したので年明け早々に Production に出すつもりである。

まだまだ Envoy は入門したばかりなので、今後も定期的に Official Document を見て機能を理解していきたい。具体的には、Learn Envoy をざっと見て、Envoy でできることをざっくり理解する。

www.envoyproxy.io

目下現実的に優先度高くやりたいと思っているのは Circuit Breaking 。さっさと当たり前の世界にしていきたい。あとは Dynamic Configuration もおさえておきたい。

あとは Kakku さんがやってる Try Envoy も一通りみておきたい。

kakakakakku.hatenablog.com

タイトル眺めた感じ。。。どれも見ておいたほうがよさそうだな。

最終的には全 Service 間通信に Envoy を経由するようにしたい。手元でやるには限界がくるので、そのタイミングでより上位の Service Mesh の Software を検討することになると思う。

Envoy でやれることをちゃんと理解した上で、Istio / AppMesh も動作検証していきたい。

追記

タイトルにある通り、ちゃんと Metrics はとれました。

代表的な HTTP での Availability / Latency の SLI となる Metrics がとれたので安心。

f:id:take_she12:20200105183137p:plain

ツナワタリマイライフ

日常ネタから技術ネタ、音楽ネタまで何でも書きます。

Service 間通信を Envoy 経由で行い、Metrics を取得する

はじめに

どうする

Before

After

Envoy の config

Listeners

Clusters

Static

Strict DNS

Logical DNS

Original destination

Endpoint discovery service (EDS)

今後

追記