Istio monitoring with New Relic

What is Istio?

Istio is an open-source service mesh that uses Kubernetes to help you connect, monitor, and secure microservices. With Istio, you can move a microservice to a different server or cluster without having to rewrite your application. However, this flexibility means that you have to add multiple Envoy proxies per service, making it more challenging to manage and monitor your network. In this post, you’ll learn how to monitor the performance of your Istio service mesh with Prometheus and New Relic, making it easier to find and fix issues when something goes wrong.

Istio uses Prometheus to report data from your service mesh. If you're not familiar with Prometheus yet, check out How to monitor with Prometheus. Prometheus is a powerful open source monitoring tool, but it can be difficult to scale and analyze your data. You can overcome those challenges by sending your Prometheus metrics to New Relic, and our Istio quickstart makes it simple to monitor the performance of Istio using Prometheus, including:

Communication between your Istio proxies and services.
Istio Envoy upstream requests, including identifying services with high and low latencies and 5xx response codes.
Istio Envoy downstream requests, including 5xx and 404 requests as well as the average performance of your micro services.
Overall health of your service mesh, including total requests and the total amount of raw data being transmitted.

Key metrics for Istio monitoring

Prometheus collects various metrics to provide insights into the performance and behavior of your microservices. Here are some key metrics for Istio monitoring:

istio_requests_total

This metric provides insight into the overall traffic volume handled by the Istio proxy. Monitoring the total number of requests helps you understand the scale of interactions between your microservices.

istio_request_duration_seconds

Understanding the duration of requests is essential for assessing the latency and performance of your services. This metric allows you to identify potential bottlenecks and optimize the response time of your microservices.

istio_upstream_rq_5xx

Monitoring the number of 5xx responses (server errors) is critical for identifying issues and potential service outages. This metric helps you quickly detect and respond to service failures.

istio_requests_in_flight

Keeping track of the number of requests in flight provides real-time information about the current load on your services. Monitoring this metric helps prevent overloading and ensures the stability of your microservices.

istio_destination_service_request_total

This metric allows you to analyze the traffic patterns and workload distribution among different destination services. Understanding how requests are distributed across services is crucial for optimizing resource allocation and maintaining a balanced system.

istio_tcp_connections_opened_total

Monitoring TCP connections is vital for understanding the network-level interactions between services. This metric helps identify potential issues related to network connectivity and resource utilization.

Monitoring Istio's Prometheus endpoints

Istio uses Envoy, a high-performance service proxy, to handle inbound and outbound traffic through a service mesh. Istio’s Envoy proxies automatically collect and report detailed metrics that provide high-level application information (since they are reported for every service proxy) via a Prometheus endpoint.

To query data from Istio’s Prometheus endpoints, you need to stand up a Prometheus server. However, as you scale your application, it becomes harder to maintain Prometheus infrastructure. With New Relic’s Prometheus OpenMetrics integration, you no longer need to maintain your own Prometheus servers. When you install the integration via New Relic’s Kubernetes agent, it automatically detects Istio’s native Prometheus endpoints and imports the data into New Relic.

If you already have a Prometheus server running, you can use our remote-write integration to forward your data to New Relic.

Monitoring service communication between proxies

Istio provides the following Prometheus metrics for monitoring communication between proxies.

The counter metric istio_requests_total measures the total number of requests handled by an Istio proxy.
The distribution metric istio_request_duration_milliseconds measures the time it takes for the Istio proxy to process HTTP requests.
The distribution metric istio_request_bytes and istio_response_bytes measures HTTP request and response body sizes.

These metrics provide information on behaviors such as the overall volume of traffic being handled by the service mesh, the error rates within the traffic, and the response times for requests. The overview page of the Istio quickstart has an overview of the service mesh’s throughput, latency, traffic volume, and response codes for all of the requests.

Using Data explorer in New Relic, you can view the attributes of istio_requests_total and other metrics from Istio’s Prometheus endpoint.

As you’ll see in the next section, attributes of istio_requests_total such as source_workload and reporter can help you pinpoint issues.

Debugging the Service Mesh

When you’re trying to pinpoint communication errors in the service mesh, you have to look at the requests and responses of your microservices. As the image below shows, separating the two different directions of communication allows you to see the entire flow of traffic through the responses and requests of each microservice: the Frontend, API gateway, and the backend Inventory Service.

In this example, we see a request from an application frontend to an API gateway, which then makes another request to an inventory service. The inventory service responds with an error, which passes through the API gateway back to the frontend. To trace the root of the frontend error (such as a 503 error), you need to trace the downstream requests by looking at the source_workload attribute of istio_requests_total for each microservice returning the error response. Then, you can also trace upstream requests to find the inputs to each of the services to perform root cause analysis.

With this context, let’s look at which metrics from upstream and downstream requests will best help you perform root cause analysis and find areas for optimization.

Monitoring Istio Envoy upstream requests

You can measure all requests inbound to Istio Envoy proxies in the service mesh by filtering istio_requests_total by those with attribute reporter set to destination.

The histogram visualization in the next image of istio_request_duration_milliseconds is useful for troubleshooting issues and optimizing performance linked to particular services.

Here's the NRQL query for the histogram:

FROM Metric SELECT histogram(istio_request_duration_milliseconds_bucket, 1000) WHERE reporter = 'destination' FACET source_canonical_service

Unexpectedly low latencies (super fast queries) can indicate an issue with the source service, which could be returning error messages or failing to fetch the requested data, even if they are also returning response codes of 200. You can also use the histogram to identify services with abnormally high latencies and optimize their performance.

The next image shows visualizing inbound errors faceted by service, which allows you to see which services are returning HTTP requests with response code 5xx.

Here's the NRQL query that generates the visualization:

FROM Metric SELECT (filter(rate(sum(istio_requests_total), 1 minute), WHERE response_code LIKE '5%')) AS 'Errors' WHERE reporter = 'destination' FACET source_workload TIMESERIES

Monitoring Istio Envoy downstream requests

You can measure all requests inbound to all Istio Envoy proxies in the service mesh by filtering istio_requests_total by those with attribute reporter set to source.

The next image shows a time series visualization of outbound requests faceted by source and response code. This visualization shows the most common responses from your microservices. If there is a spike in 5xx or 404 responses, you can quickly pinpoint which microservice is the culprit.

Here's the NRQL query that generates the previous visualization:

FROM Metric
 SELECT rate(sum(istio_requests_total), 1 SECOND) AS 'Req/Sec' where reporter = 'source' FACET destination_canonical_service, response_code TIMESERIES

The next image is a visualization of the client request duration, which gives you a bird's-eye view of the average performance of all your microservices.

Here's the NRQL query for the previous histogram:

FROM Metric SELECT histogram(istio_request_duration_milliseconds_bucket, 1000) AS 'Requests' WHERE reporter = 'source'

Filtering by destination service or source service

If you want to analyze metrics for a particular service, use the filter bar at the top of the quickstart dashboard to specify the attribute source_workload or destination_workload as shown in the next image.

Monitoring the ingress gateway (service mesh)

With Istio, you manage your inbound traffic with a gateway, a set of envoy proxies that act as load balancers for ingress traffic, which is traffic from an external public network to a private network. With the gateway as the central load balancer, you can leverage features like advanced routing configurations such as traffic splitting, redirects, and retry logic.

Istio gateways forward metrics like latency, throughput, and error rate just like sidecar proxies. On the Ingress Gateways tab of the Istio quickstart, you can visualize total requests.

With Istio, you get raw data directly from Envoy. By measuring envoy_cluster_upstream_cx_rx_bytes_total of the ingress gateway faceted by cluster and namespace, you can see the total number of connection bytes for each cluster and namespace. Here's the NRQL request:

FROM Metric SELECT average(envoy_cluster_upstream_cx_rx_bytes_total) WHERE label.app='istio-ingressgateway' FACET clusterName, namespaceName TIMESERIES AUTO

The next image shows a visualization of this query.

Istio monitoring best practices

Monitoring Istio effectively involves implementing best practices to ensure comprehensive observability and proactive issue identification. Here are some Istio monitoring best practices:

Define monitoring objectives

Clearly define your monitoring objectives and key performance indicators (KPIs) based on the specific requirements of your microservices architecture. Understand the critical aspects such as latency, error rates, and overall system health.

Establish service level objectives (SLOs)

Define SLOs for critical services and set up alerts based on these objectives. This proactive approach allows you to identify and address potential issues before they impact the user experience.

Monitor key metrics

Focus on key Istio metrics such as request rates, success rates, latency, error rates, and resource utilization. Regularly review and adjust your monitoring strategy based on the evolving needs of your applications.

Implement alerts and notifications

Set up alerts and notifications for critical metrics to receive timely notifications when certain thresholds are breached. This allows for proactive issue resolution and reduces downtime.

Monitor traffic routing and load balancing

Keep a close eye on Istio's traffic routing and load balancing features. Monitoring these aspects helps ensure that traffic is distributed evenly and that any changes to routing policies do not negatively impact the system.

Regularly review and tune configurations

Periodically review and tune Istio configurations, especially as your microservices architecture evolves. Ensure that the configurations align with the changing requirements of your applications.

Achieving full-scale Istio observability with New Relic

Install New Relic’s Istio quickstart and start visualizing Istio’s Prometheus data. Send us your Prometheus data in two ways:

Are you tired of storing, maintaining, and scaling your Prometheus data? Try New Relic’s Prometheus OpenMetrics Integration, which automatically scrapes, stores, and scales your data.

Already have a Prometheus server and want to send data to New Relic? Try New Relic’s Prometheus Remote Write integration.

If you don't already have a free New Relic account, sign up now. Let New Relic manage your Prometheus data while you focus on innovation. Your free account includes 100 GB/month of free data ingest, one full-platform user, and unlimited basic users.

By Daniel Kim

Daniel Kim (He/Him) is a Principal Developer Relations Engineer at New Relic and the founder of Bit Project, a 501(c)(3) nonprofit dedicated make tech accessible to underserved communities. He wants to inspire generations of students in tech to be the best they can be through inclusive, accessible developer education. He is passionate about diversity and inclusion in tech, good food, and dad jokes.

The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.

780+ integrations to start monitoring your stack for free.

See All Integrations See All Integrations

In this article