Istio Observability Series (1/2): Golden Signals for the Data Plane — HTTP, TLS, and gRPC
This is Part 1 of 2 in the Istio Observability series. This post covers the data plane (Envoy sidecars). Part 2 covers the control plane (istiod).
Why Golden Signals for the Data Plane?
Google’s SRE book defined the four golden signals — latency, traffic, errors, and saturation — as the minimum set of metrics you need to understand the health of any system.
In Istio’s data plane, every request between services passes through Envoy sidecars that emit rich telemetry. But the signals aren’t one-size-fits-all — different protocols expose different dimensions:
- HTTP gives you status codes, request duration, and body sizes
- TLS gives you handshake metrics, certificate errors, and session reuse
- gRPC has its own status code system that hides behind HTTP 200, plus streaming semantics that break traditional RPS metrics
This post breaks down all four golden signals for each protocol, with the exact Prometheus metrics, PromQL queries, and alert rules you need for production.
HTTP Protocol
HTTP traffic is the most common in a service mesh. Envoy emits detailed L7 metrics for every HTTP request/response.
Golden Signal 1: Latency
What it measures: How long it takes for the upstream service to respond to an HTTP request, as observed by the sidecar.
Key metrics:
| Metric | Type | Description |
|---|---|---|
istio_request_duration_milliseconds | Histogram | Request duration from the Envoy sidecar’s perspective |
envoy_cluster_upstream_rq_time | Histogram | Time from Envoy sending the request upstream to receiving the first byte of the response |
PromQL queries:
# P50 latency for a specific service (server-side, inbound)
histogram_quantile(0.50,
sum(rate(istio_request_duration_milliseconds_bucket{
reporter="destination",
destination_service_name="my-service",
request_protocol="http"
}[5m])) by (le)
)
# P99 latency (the tail that catches slowdowns)
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{
reporter="destination",
destination_service_name="my-service",
request_protocol="http"
}[5m])) by (le)
)
# P99 latency broken down by source → destination
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{
reporter="destination",
request_protocol="http"
}[5m])) by (le, source_workload, destination_workload)
)What to alert on:
# Alert: P99 latency exceeds 500ms for 5 minutes
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{
reporter="destination",
request_protocol="http"
}[5m])) by (le, destination_service_name)
) > 500
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency on {{ $labels.destination_service_name }}"Important nuance: Istio reports latency from two perspectives:
reporter="source"— measured by the client-side sidecar (includes network latency + server processing)reporter="destination"— measured by the server-side sidecar (mostly server processing time)
The difference between source and destination latency tells you how much time is spent in the network (including mTLS handshake overhead).
Golden Signal 2: Traffic (Throughput)
What it measures: The volume of HTTP requests flowing through the mesh.
Key metrics:
| Metric | Type | Description |
|---|---|---|
istio_requests_total | Counter | Total count of requests, labeled by source, destination, response code, method |
istio_request_bytes | Histogram | Request body size in bytes |
istio_response_bytes | Histogram | Response body size in bytes |
PromQL queries:
# Requests per second to a service
sum(rate(istio_requests_total{
reporter="destination",
destination_service_name="my-service",
request_protocol="http"
}[5m]))
# RPS broken down by HTTP method
sum(rate(istio_requests_total{
reporter="destination",
destination_service_name="my-service",
request_protocol="http"
}[5m])) by (request_method)
# Bandwidth: bytes/sec received by a service
sum(rate(istio_request_bytes_sum{
reporter="destination",
destination_service_name="my-service"
}[5m]))
# Top 10 busiest service-to-service edges
topk(10,
sum(rate(istio_requests_total{reporter="source"}[5m]))
by (source_workload, destination_workload)
)What to alert on:
# Alert: Traffic drop > 50% compared to same time last week
- alert: TrafficDrop
expr: |
sum(rate(istio_requests_total{
reporter="destination",
destination_service_name="my-service"
}[5m]))
/
sum(rate(istio_requests_total{
reporter="destination",
destination_service_name="my-service"
}[5m] offset 7d))
< 0.5
for: 10m
labels:
severity: critical
annotations:
summary: "Traffic to {{ $labels.destination_service_name }} dropped >50% vs last week"Golden Signal 3: Errors
What it measures: The rate of failed HTTP requests — 4xx client errors and 5xx server errors.
Key metrics:
| Metric | Type | Description |
|---|---|---|
istio_requests_total | Counter | Labeled with response_code (200, 404, 500, etc.) and response_flags |
envoy_cluster_upstream_rq_xx | Counter | Aggregated by response class (2xx, 4xx, 5xx) |
PromQL queries:
# Error rate (5xx only) as a percentage
sum(rate(istio_requests_total{
reporter="destination",
destination_service_name="my-service",
response_code=~"5.*",
request_protocol="http"
}[5m]))
/
sum(rate(istio_requests_total{
reporter="destination",
destination_service_name="my-service",
request_protocol="http"
}[5m])) * 100
# Error rate broken down by response code
sum(rate(istio_requests_total{
reporter="destination",
destination_service_name="my-service",
response_code=~"[45].*"
}[5m])) by (response_code)
# Errors caused by Envoy itself (not the upstream)
# response_flags tells you WHY a request failed
sum(rate(istio_requests_total{
reporter="destination",
destination_service_name="my-service",
response_flags!=""
}[5m])) by (response_flags)Envoy response flags you should know:
| Flag | Meaning | Indicates |
|---|---|---|
UH | No healthy upstream | All endpoints are unhealthy |
UF | Upstream connection failure | TCP connect failed |
UT | Upstream request timeout | Request exceeded timeout |
UC | Upstream connection termination | Connection reset mid-request |
NR | No route configured | Missing VirtualService or routing rule |
URX | Upstream retry limit exceeded | All retries exhausted |
DC | Downstream connection termination | Client disconnected |
RL | Rate limited | Local or global rate limit hit |
What to alert on:
# Alert: 5xx error rate exceeds 1% for 5 minutes
- alert: HighErrorRate
expr: |
(
sum(rate(istio_requests_total{
reporter="destination",
response_code=~"5.*",
request_protocol="http"
}[5m])) by (destination_service_name)
/
sum(rate(istio_requests_total{
reporter="destination",
request_protocol="http"
}[5m])) by (destination_service_name)
) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "5xx error rate >1% on {{ $labels.destination_service_name }}"Golden Signal 4: Saturation
What it measures: How close the data plane is to its resource limits.
Key metrics:
| Metric | Type | Description |
|---|---|---|
envoy_cluster_upstream_cx_active | Gauge | Active connections to upstream |
envoy_cluster_upstream_cx_overflow | Counter | Connection pool overflows (circuit breaker tripped) |
envoy_cluster_upstream_rq_pending_active | Gauge | Requests waiting for a connection |
envoy_cluster_upstream_rq_pending_overflow | Counter | Requests rejected due to pending limit |
envoy_server_memory_allocated | Gauge | Envoy process memory usage |
envoy_server_concurrency | Gauge | Number of worker threads |
PromQL queries:
# Connection pool utilization per cluster
envoy_cluster_upstream_cx_active{
cluster_name="outbound|8080||my-service.default.svc.cluster.local"
}
# Circuit breaker trips (connection overflow)
sum(rate(envoy_cluster_upstream_cx_overflow{
cluster_name=~"outbound.*my-service.*"
}[5m]))
# Pending request queue depth (requests waiting for a connection)
envoy_cluster_upstream_rq_pending_active{
cluster_name=~"outbound.*my-service.*"
}
# Envoy memory usage per pod
envoy_server_memory_allocated / 1024 / 1024 # in MBWhat to alert on:
# Alert: Circuit breaker is tripping (connection overflow)
- alert: CircuitBreakerTripping
expr: |
sum(rate(envoy_cluster_upstream_cx_overflow[5m]))
by (pod, cluster_name) > 0
for: 2m
labels:
severity: warning
annotations:
summary: "Circuit breaker overflow on {{ $labels.pod }} for {{ $labels.cluster_name }}"
# Alert: Envoy memory usage exceeds 200MB
- alert: EnvoyHighMemory
expr: envoy_server_memory_allocated > 200 * 1024 * 1024
for: 10m
labels:
severity: warningHTTPS / TLS Protocol
When traffic is encrypted (either mTLS between sidecars or TLS origination to external services), additional metrics become critical. TLS failures are one of the most common issues in a service mesh, and without the right signals they are invisible.
Golden Signal 1: Latency
TLS adds handshake overhead on top of HTTP latency. You need to measure both.
Key metrics:
| Metric | Type | Description |
|---|---|---|
istio_request_duration_milliseconds | Histogram | Same as HTTP — includes TLS termination time |
envoy_listener_ssl_handshake_duration | Histogram | Time spent in the TLS handshake alone |
envoy_cluster_ssl_handshake | Counter | Total TLS handshakes completed to upstream |
PromQL queries:
# mTLS handshake overhead: compare source vs destination latency
# The delta approximates network + TLS handshake time
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{
reporter="source",
destination_service_name="my-service"
}[5m])) by (le)
)
-
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{
reporter="destination",
destination_service_name="my-service"
}[5m])) by (le)
)What to watch for: If the source-destination latency gap is consistently >5ms, you may have TLS configuration issues (cipher negotiation, certificate chain validation), or the sidecar is doing too many new handshakes (check connection reuse).
Golden Signal 2: Traffic
TLS traffic volume is best tracked through connection-level metrics since TLS operates at the connection layer.
Key metrics:
| Metric | Type | Description |
|---|---|---|
envoy_cluster_ssl_connection_error | Counter | TLS connection failures |
envoy_listener_ssl_connection_error | Counter | Inbound TLS connection failures |
envoy_cluster_upstream_cx_total | Counter | Total connections (TLS and non-TLS) |
PromQL queries:
# New TLS connections per second (high rate = poor connection reuse)
sum(rate(envoy_cluster_ssl_handshake[5m])) by (pod)
# TLS connection reuse ratio
# Low ratio means many new handshakes (expensive)
1 - (
rate(envoy_cluster_ssl_handshake[5m])
/
rate(istio_requests_total{reporter="source"}[5m])
)Golden Signal 3: Errors
TLS errors are distinct from HTTP errors — they happen before any HTTP exchange occurs.
Key metrics:
| Metric | Type | Description |
|---|---|---|
envoy_cluster_ssl_connection_error | Counter | Outbound TLS handshake failures |
envoy_listener_ssl_connection_error | Counter | Inbound TLS handshake failures |
envoy_cluster_ssl_fail_verify_san | Counter | SAN verification failures (certificate doesn’t match expected identity) |
envoy_cluster_ssl_fail_verify_cert_hash | Counter | Certificate hash mismatch |
envoy_cluster_ssl_fail_verify_no_cert | Counter | Peer didn’t present a certificate (mTLS enforcement) |
istio_requests_total{response_flags="UF"} | Counter | Upstream connection failures (often TLS-related) |
PromQL queries:
# TLS handshake failure rate
sum(rate(envoy_cluster_ssl_connection_error[5m])) by (pod)
# SAN verification failures (wrong identity / cert mismatch)
sum(rate(envoy_cluster_ssl_fail_verify_san[5m])) by (pod)
# Requests failing due to upstream connection failure (often TLS)
sum(rate(istio_requests_total{
response_flags="UF",
reporter="source"
}[5m])) by (source_workload, destination_workload)
# No-cert failures (client didn't present mTLS cert)
sum(rate(envoy_cluster_ssl_fail_verify_no_cert[5m])) by (pod)What to alert on:
# Alert: TLS handshake failures are occurring
- alert: TLSHandshakeFailures
expr: |
sum(rate(envoy_cluster_ssl_connection_error[5m])) by (pod) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "TLS handshake failures on {{ $labels.pod }}"
# Alert: SAN verification failures (certificate identity mismatch)
- alert: TLSSANVerificationFailure
expr: |
sum(rate(envoy_cluster_ssl_fail_verify_san[5m])) by (pod) > 0
for: 2m
labels:
severity: critical
annotations:
summary: "TLS SAN verification failing on {{ $labels.pod }} — check certificate identities"Golden Signal 4: Saturation
TLS-specific saturation relates to certificate management and connection overhead.
Key metrics:
| Metric | Type | Description |
|---|---|---|
envoy_server_ssl_context_session_cache_hits | Counter | TLS session cache hits (connection reuse) |
envoy_server_ssl_context_session_cache_misses | Counter | TLS session cache misses (new handshakes) |
PromQL queries:
# TLS session cache hit ratio (higher = better, less handshake overhead)
sum(rate(envoy_server_ssl_context_session_cache_hits[5m])) by (pod)
/
(
sum(rate(envoy_server_ssl_context_session_cache_hits[5m])) by (pod)
+
sum(rate(envoy_server_ssl_context_session_cache_misses[5m])) by (pod)
)gRPC Protocol
gRPC uses HTTP/2 under the hood, so it shares many HTTP metrics. But gRPC has its own status code system and streaming semantics that require dedicated signals.
Golden Signal 1: Latency
Key metrics:
| Metric | Type | Description |
|---|---|---|
istio_request_duration_milliseconds | Histogram | Same metric, filtered by request_protocol="grpc" |
envoy_cluster_grpc_upstream_rq_time | Histogram | gRPC-specific upstream request time |
PromQL queries:
# P99 latency for gRPC services
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{
reporter="destination",
destination_service_name="my-grpc-service",
request_protocol="grpc"
}[5m])) by (le)
)
# Latency by gRPC method (requires method to be in labels)
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{
reporter="destination",
request_protocol="grpc"
}[5m])) by (le, destination_service_name, request_path)
)gRPC latency nuance: For unary RPCs, latency is straightforward — time from request to response. For streaming RPCs, istio_request_duration_milliseconds measures the total stream duration, which can be minutes or hours for long-lived streams. You need to interpret this differently:
# For streaming gRPC, track stream duration separately
# Long durations are NORMAL for streams — don't alert on these
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{
reporter="destination",
request_protocol="grpc",
grpc_response_status="0" # only successful streams
}[5m])) by (le, destination_service_name)
)Golden Signal 2: Traffic
Key metrics:
| Metric | Type | Description |
|---|---|---|
istio_requests_total | Counter | Request count with request_protocol="grpc" |
envoy_cluster_grpc_total | Counter | Total gRPC requests by cluster |
istio_request_messages_total | Counter | Total gRPC messages sent (for streaming) |
istio_response_messages_total | Counter | Total gRPC messages received (for streaming) |
PromQL queries:
# gRPC RPS by service
sum(rate(istio_requests_total{
reporter="destination",
request_protocol="grpc"
}[5m])) by (destination_service_name)
# gRPC message rate for streaming services
# (messages per second, not requests per second)
sum(rate(istio_request_messages_total{
reporter="destination",
destination_service_name="my-streaming-service"
}[5m]))
# gRPC traffic by method path
sum(rate(istio_requests_total{
reporter="destination",
request_protocol="grpc"
}[5m])) by (destination_service_name, request_path)Why message rate matters for gRPC: A single gRPC stream can carry thousands of messages. If you only track istio_requests_total, you’ll see 1 RPS for a stream that’s processing 10,000 messages/sec. istio_request_messages_total and istio_response_messages_total give the true throughput.
Golden Signal 3: Errors
gRPC has its own status code system that is independent of HTTP status codes. A gRPC error returns HTTP 200 with a gRPC status in the grpc-status trailer.
gRPC status codes you must monitor:
| Code | Name | Meaning |
|---|---|---|
| 0 | OK | Success |
| 1 | CANCELLED | Client cancelled the request |
| 2 | UNKNOWN | Unknown error (often a panic in the server) |
| 4 | DEADLINE_EXCEEDED | Timeout |
| 5 | NOT_FOUND | Resource not found |
| 7 | PERMISSION_DENIED | Auth failed |
| 8 | RESOURCE_EXHAUSTED | Rate limited or quota exceeded |
| 13 | INTERNAL | Server bug |
| 14 | UNAVAILABLE | Service temporarily unavailable (retryable) |
Key metrics:
| Metric | Type | Description |
|---|---|---|
istio_requests_total | Counter | Labeled with grpc_response_status (0, 1, 2, …) |
envoy_cluster_grpc_request_message_count | Counter | gRPC messages sent |
envoy_cluster_grpc_response_message_count | Counter | gRPC messages received |
PromQL queries:
# gRPC error rate (all non-OK status codes)
sum(rate(istio_requests_total{
reporter="destination",
request_protocol="grpc",
grpc_response_status!="0"
}[5m])) by (destination_service_name)
/
sum(rate(istio_requests_total{
reporter="destination",
request_protocol="grpc"
}[5m])) by (destination_service_name) * 100
# gRPC errors broken down by status code
sum(rate(istio_requests_total{
reporter="destination",
request_protocol="grpc",
grpc_response_status!="0"
}[5m])) by (destination_service_name, grpc_response_status)
# DEADLINE_EXCEEDED specifically (timeout issues)
sum(rate(istio_requests_total{
reporter="destination",
request_protocol="grpc",
grpc_response_status="4"
}[5m])) by (source_workload, destination_workload)
# UNAVAILABLE errors (service overloaded or crashing)
sum(rate(istio_requests_total{
reporter="destination",
request_protocol="grpc",
grpc_response_status="14"
}[5m])) by (destination_service_name)The HTTP 200 trap: A common mistake is alerting only on HTTP 5xx errors for gRPC services. gRPC errors come back as HTTP 200 with a non-zero grpc-status trailer. If your monitoring only looks at response_code, you’ll miss all gRPC errors. Always use grpc_response_status for gRPC services.
What to alert on:
# Alert: gRPC error rate exceeds 5%
- alert: GRPCHighErrorRate
expr: |
(
sum(rate(istio_requests_total{
reporter="destination",
request_protocol="grpc",
grpc_response_status!="0",
grpc_response_status!="1"
}[5m])) by (destination_service_name)
/
sum(rate(istio_requests_total{
reporter="destination",
request_protocol="grpc"
}[5m])) by (destination_service_name)
) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "gRPC error rate >5% on {{ $labels.destination_service_name }}"
# Alert: DEADLINE_EXCEEDED spike (timeout issues)
- alert: GRPCDeadlineExceeded
expr: |
sum(rate(istio_requests_total{
reporter="destination",
request_protocol="grpc",
grpc_response_status="4"
}[5m])) by (destination_service_name) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "gRPC DEADLINE_EXCEEDED errors on {{ $labels.destination_service_name }}"Note: We exclude status code 1 (CANCELLED) from the error rate because client-initiated cancellations are normal behavior, not server errors.
Golden Signal 4: Saturation
gRPC uses HTTP/2 multiplexing, so connection-level saturation looks different from HTTP/1.1.
Key metrics:
| Metric | Type | Description |
|---|---|---|
envoy_cluster_upstream_cx_active | Gauge | Active HTTP/2 connections |
envoy_cluster_upstream_rq_active | Gauge | Active requests (streams) across all connections |
envoy_cluster_upstream_cx_rx_bytes_total | Counter | Bytes received |
envoy_cluster_upstream_cx_tx_bytes_total | Counter | Bytes sent |
envoy_cluster_http2_pending_send_bytes | Gauge | HTTP/2 data buffered waiting to send (flow control backpressure) |
PromQL queries:
# Active streams per connection (high = good multiplexing, but watch limits)
envoy_cluster_upstream_rq_active / envoy_cluster_upstream_cx_active
# HTTP/2 flow control backpressure (buffered data waiting to send)
envoy_cluster_http2_pending_send_bytes > 0
# Max concurrent streams approaching limit
# Default Envoy max concurrent streams per connection: 2147483647
# If your server sets a lower limit, watch this
envoy_cluster_upstream_rq_activeData Plane Summary Dashboard
Here’s a consolidated view of the essential panels for a data plane Grafana dashboard:
| Panel | Query | Protocol |
|---|---|---|
| Request Rate | sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name) | All |
| P99 Latency | histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service_name)) | All |
| Error Rate % | sum(rate(istio_requests_total{response_code=~"5.*"}[5m])) / sum(rate(istio_requests_total[5m])) * 100 | HTTP |
| gRPC Error Rate % | sum(rate(istio_requests_total{grpc_response_status!="0"}[5m])) / sum(rate(istio_requests_total{request_protocol="grpc"}[5m])) * 100 | gRPC |
| TLS Errors | sum(rate(envoy_cluster_ssl_connection_error[5m])) by (pod) | TLS |
| Circuit Breaker Trips | sum(rate(envoy_cluster_upstream_cx_overflow[5m])) by (cluster_name) | All |
| Envoy Memory | envoy_server_memory_allocated | All |
| Active Connections | envoy_cluster_upstream_cx_active | All |
Common Pitfalls
1. Only Monitoring HTTP Status Codes for gRPC
gRPC errors return HTTP 200. If your dashboards only show response_code=~"5.*", you’re blind to gRPC failures. Always use grpc_response_status for gRPC services.
2. Ignoring the Source vs Destination Reporter
reporter="source" and reporter="destination" give different perspectives. Source includes network latency; destination doesn’t. If you mix them in the same query, you’ll get double-counted or misleading results.
3. Missing TLS Error Metrics
mTLS issues are the #1 source of mysterious 503 errors in Istio. Without TLS handshake error metrics, you’ll spend hours debugging what looks like an application error but is actually a certificate problem.
4. Not Using Envoy Response Flags
response_flags tells you why Envoy returned an error, not just what the error code was. A 503 with UH (no healthy upstream) is a completely different problem from a 503 with UC (upstream connection termination). Always include response flags in your error dashboards.
5. Ignoring Saturation Until It’s Too Late
Circuit breaker trips (envoy_cluster_upstream_cx_overflow) are the clearest early warning that a service is under stress. By the time you see elevated error rates, the circuit breaker has already been tripping for a while.
What’s Next
This covered the data plane — the Envoy sidecars processing your traffic. But a healthy data plane depends entirely on a healthy control plane.
Continue to Part 2: Golden Signals for the Control Plane (istiod) — where we cover xDS push latency, config convergence time, xDS rejections, certificate signing health, and istiod scaling thresholds.
Related posts: