Deploying Envoy and Kafka to collect broker-level metrics

Adam Kotwasinski
3 min readFeb 20, 2020

--

Envoy 1.13 provides Kafka broker-level filter, that allows us to collect the request/response metric for a given broker.

The filter decodes the received requests/responses, and updates the correct metrics — this way we can find out how many requests were received by given broker, how many responses were sent, and how much time was spent for processing.

In this short article I will be covering two ways how Envoy+Kafka can be deployed:

  • routing all Kafka-related traffic through Envoy (including internal cluster communication),
  • routing only Kafka-client traffic through Envoy.

In the examples I will be using very simple local deployment — 2 Kafka brokers (listening on 9092 & 9093) and 1 Envoy proxy instance that proxies them on ports 19092 & 19093. To avoid a single point of failure, in production scenario it might be preferable to have one Envoy instance per Kafka broker.

Routing all traffic through Envoy (including replication)

On Envoy side, all we need to do, is to route incoming traffic to correct broker ports. We are going to route all traffic coming to 19092 to 9092, and 19093 to 9093.

Kafka clients that would want to connect to the cluster should use bootstrap.servers=localhost:19092,localhost:19093 to establish initial connections.

configuration for Envoy

The server configuration is really simple, the only trick here is that our Kafka servers need to advertise Envoy’s ports (advertised.listeners). As it’s the only value that will get published to Zookeeper — everything, both Kafka clients and other Kafka brokers will need to go through Envoy to reach our broker.

configuration for Kafka broker 1
configuration for Kafka broker 2

When we get everything running, we can see that some internal traffic has been collected by our metrics-collecting filter:

kafka.broker1.request.update_metadata_request: 2
kafka.broker1.response.update_metadata_response: 2
kafka.broker2.request.update_metadata_request: 1
kafka.broker2.response.update_metadata_response: 1

After we create a topic (e.g. via kafka-topics tool), we can see that replication is happening — the fetch_request/fetch_response metrics increase (even if there are no clients!) — this is caused by broker replicator threads that periodically sending fetch requests to partition leaders.

Routing only client traffic through Envoy

As an alternative, we might want to have only Kafka client traffic routed through Envoy, while the internal cluster communication should happen without any proxying.

Our Envoy broker configuration is not changing — the clients still want to reach out to ports 19092 & 19093, and they will be routed as usual.

On broker side, we need to make our brokers listen on new “replication” port (we will use 9192 & 9193), and make sure that brokers know which address should they use for replication (with inter.broker.listener.name):

configuration for Kafka broker 1
configuration for Kafka broker 2

After starting all the services, but before starting any of the clients, we can see that Kafka metrics are not present (as Envoy is not involved in the traffic).

Only after we decide to connect to Kafka with some kind of client, we can see something happening (let’s use kafka-console-consumer):

# create topic 'mytesttopic' and make sure it is replicated:
bin/kafka-topics.sh \
--bootstrap-server localhost:19092,localhost:19093 \
--create \
--topic mytesttopic \
--replication-factor 2 \
--partitions 20
# start the consumer:
bin/kafka-console-consumer.sh \
--bootstrap-server localhost:19092,localhost:19093 \
--topic mytesttopic

When the consumer is active (what internally means sending FetchRequest objects in a loop), we can see the metrics increase in Envoy:

# caused by `kafka-topics` invocation
kafka.broker1.request.create_topics_request: 1
# constantly increasing while consumer is alive
kafka.broker1.request.fetch_request: 107
kafka.broker1.response.fetch_response: 107
kafka.broker2.request.fetch_request: 107
kafka.broker2.response.fetch_response: 107

But after we stop the consumer, we can see that fetch_* metrics no longer grow.

--

--