IT Dribble

Mutterings, inconsistant tips, rants and randomness

Opensearch pre-emptive monitoring with zabbix

by

10 Pre-emptive Health Checks for an OpenSearch Cluster

Why each metric matters and how to implement it in Zabbix 7

The common failure mode in cluster monitoring is alerting on symptoms rather than causes. status: red tells you something already broke; it doesn’t give you the 30-minute window to act before users notice. The checks below are biased toward leading indicators - metrics that predict problems - and are ordered by API endpoint to make implementation cleaner.

All checks follow the same Zabbix architecture: one master HTTP agent item per endpoint, an LLD rule to discover nodes from the response, per-node dependent items with JSONPath preprocessing, and *_foreach aggregate triggers where you need fleet-wide views. This avoids polling the cluster 25 times per scrape interval and keeps all node metrics in sync.

Architecture pattern (read this first)

Before building any individual check, establish this structure once:

  • Master items - one HTTP agent item per endpoint (e.g., _cluster/stats, _nodes/stats/jvm). These do the actual HTTP request.
  • LLD rules - parse the master item’s JSON to discover node IDs/names and create per-node dependent items automatically.
  • Dependent items - extract a single field per item using JSONPath preprocessing. Zero additional HTTP requests.
  • Triggers - operate on dependent items directly, or use *_foreach aggregate functions across the discovered set.

Two preprocessing gotchas that will bite you:

Counter resets: GC collection_time_in_millis, collection_count, etc. are cumulative and reset when a node restarts. “Simple change” preprocessing will produce a large negative spike on restart. Either add a custom JS step to discard negative values, or prefer “Change per second” which handles this more gracefully.

*_foreach syntax: If all OpenSearch nodes are discovered under a single Zabbix host via LLD, use item key patterns in your foreach expression. If each node is its own Zabbix host, use host group patterns. The syntax differs and the wrong one produces silently empty results.

1. Unassigned Shard Trend

Why this matters

Cluster status (green/yellow/red) is a lagging, binary indicator - by the time it flips, shards have already failed to assign. More useful is watching indices.shards.unassigned from _cluster/stats over time. A slow, steady climb in unassigned shard count - even while cluster status remains yellow - often precedes a node falling out of the cluster entirely. Catching the trend early gives you time to investigate disk pressure, heap exhaustion, or network partitioning before the situation becomes a full red.

Zabbix implementation

  • Master item: HTTP agent polling GET /_cluster/stats, stored as text.
  • Dependent item: JSONPath $.indices.shards.unassigned, numeric unsigned.
  • Trigger (absolute): Fire at Warning if unassigned count exceeds your shard-per-node baseline (e.g., >10 for a 25-node cluster).
  • Trigger (trend): Use change() to alert if the count is increasing across consecutive checks. Combine with avg(/host/opensearch.shards.unassigned,30m) to reduce noise from transient relocation.
  • Supplement: A second dependent item on $.status with a nodata trigger catches complete API unavailability.

2. Heap Pressure Across the Fleet

Why this matters

Averaging heap utilisation across 25 nodes hides the problem. Three nodes sitting at 87% heap while the rest idle at 55% is a critical situation - those three nodes are close to triggering aggressive GC, potential circuit breaker trips, or OOM kills - but a cluster-average metric reads as unremarkable. The alert needs to fire on the number of nodes above threshold, not the mean.

Zabbix implementation

  • Master item: HTTP agent polling GET /_nodes/stats/jvm.
  • LLD rule: Discover node IDs from $.nodes.* keys, create dependent items per node.
  • Dependent item per node: Calculated item: jvm.mem.heap_used_in_bytes / jvm.mem.heap_max_in_bytes * 100, or extract both fields and use a calculated item.
  • Aggregate trigger (Zabbix 7 *_foreach):
count(last_foreach(/*/opensearch.heap.pct?[group="opensearch"]),"gt(85)") >= 3

Adjust the node count threshold to suit your fleet size and SLA.

  • Severity: Warning at >= 3 nodes above 85%, High at >= 5 nodes or any single node above 92%.

3. Disk Space - Absolute and Projected

Why this matters

OpenSearch’s low/high/flood watermarks mean that disk exhaustion doesn’t just cause data loss - it triggers automatic shard relocation (high watermark), then blocks all new shard assignment (flood-stage watermark), effectively freezing the cluster. The absolute alert catches situations where you’re already close to a watermark. The projection alert catches runaway index growth - a new data source ingesting at 3 times the expected rate - before it trips a watermark and starts forcing relocation.

Zabbix implementation

  • Source: $.nodes.*.fs.total.available_in_bytes from _nodes/stats/fs.
  • Dependent item per node: Available bytes and total bytes; derived calculated item for percentage free.
  • Trigger (absolute): last(/host/opensearch.fs.pct_free) < 15 - High severity. Set a second trigger at < 8 for Disaster.
  • Trigger (projected): Zabbix’s built-in timeleft() function handles this cleanly:
timeleft(/host/opensearch.fs.available_bytes,1h,0) < 86400

Fires when linear regression on the last hour of data projects reaching zero within 24 hours. Adjust the history window (1h) if your ingestion is bursty.

4. Old-Gen GC Time as a Fraction of Wall Clock

Why this matters

Individual GC metrics - “collection count” or “last collection duration” - miss the compound picture. A node running 200 short old-gen GCs per minute is just as degraded as one running 2 long ones, but the duration-per-collection metric looks very different in each case. The better signal is: what fraction of wall-clock time is this node spending stopped for GC? Compute Ξ”gc.collectors.old.collection_time_in_millis / scrape_interval_ms. Above ~5% means the node is materially impaired; above ~15% means it’s effectively in GC spiral and should be investigated immediately.

Zabbix implementation

  • Dependent item per node: Extract $.nodes..jvm.gc.collectors.old.collection_time_in_millis, numeric unsigned.
  • Preprocessing: Apply “Change per second” to get ms/s of GC time. Multiply by 0.1 to express as a percentage of wall clock (since 1000ms/s = 100%).
  • Discard negative step: Add a custom JS preprocessing step after “Change per second”: return value < 0 ? null : value; - this silently drops restart-reset artefacts.
  • Trigger: avg(/host/opensearch.gc.old.pct,5m) > 5 - High. Tune to > 15 for Disaster.

5. Young-Gen + Old-Gen GC Frequency Trending Together

Why this matters

Young-gen GC alone is expected and healthy - it’s the JVM doing normal object lifecycle management. What signals heap trouble is when young-gen collection rate and old-gen collection rate both trend upward together. This is the canonical signature of either a heap size that’s too small for the workload, or a memory leak promoting live objects into old-gen faster than GC can reclaim them. Either condition will eventually cause the node to spiral into GC thrash. Catching both rates climbing in the same window gives you 10-30 minutes of warning before the node becomes unresponsive.

Zabbix implementation

  • Items per node: Two “Change per second” items - one on gc.collectors.young.collection_count, one on gc.collectors.old.collection_count.
  • Individual triggers: Alert if young-gen rate exceeds a baseline (tune per cluster) or old-gen rate exceeds ~0.5/min - these are independent useful alerts.
  • Compound trigger: Use an expression that requires both to be elevated simultaneously:
avg(/host/opensearch.gc.young.rate,10m) > {$GC_YOUNG_THRESHOLD}
and
avg(/host/opensearch.gc.old.rate,10m) > {$GC_OLD_THRESHOLD}

Use macros for thresholds so they can be tuned per host without modifying template expressions.

6. Open File Descriptors Approaching Limit

Why this matters

OpenSearch is file-descriptor-intensive: each shard segment, network connection, and log file consumes one. The default OS limit of 65535 sounds generous until a shard explosion (too many small indices, a misconfigured ILM policy, or a runaway dynamic template) multiplies open files rapidly. Hitting the FD limit causes cryptic errors - failed connections, shard assignment failures, index corruption - that are difficult to trace back to their source. This check is unglamorous but reliably catches both misconfiguration and organic growth before they cause an outage.

Zabbix implementation

  • Source: $.nodes..process.open_file_descriptors and $.nodes..process.max_file_descriptors from _nodes/stats/process.
  • Calculated item per node: open / max * 100 for percentage utilisation.
  • Triggers:
    • last(/host/opensearch.fd.pct) > 70 - Warning
    • last(/host/opensearch.fd.pct) > 85 - High

7. Sustained Per-Node CPU

Why this matters

In a well-balanced 25-node cluster, individual nodes shouldn’t sustain high CPU under normal load - OpenSearch distributes query and indexing work across shards. When a single node sits at >80% CPU sustained, the likely causes are: a hot shard (query traffic concentrated on shards only that node holds), a mapping explosion (wildcard or overly dynamic mappings generating enormous field counts), or a runaway aggregation query. None of these will resolve on their own, and all of them will cascade - the hot node slows, its response times increase, the coordinator retries, making it hotter. A per-node sustained CPU alert gives you the signal to identify and redirect or kill the offending shard/query before the node falls out.

Zabbix implementation

  • Source: $.nodes..process.cpu.percent from _nodes/stats/process.
  • Trigger: avg(/host/opensearch.process.cpu.pct,5m) > 80 - High.
  • Use a 5-minute average, not last(), to avoid firing on brief spikes during segment merges or snapshot operations.

8. Admission Control Signals (resource_usage_stats)

Why this matters

OpenSearch’s admission control subsystem uses CPU utilisation, IO utilisation, and memory utilisation metrics to decide when to start rejecting requests to protect cluster stability. These are the same signals OpenSearch itself acts on - when they’re elevated, the cluster is actively considering shedding traffic. Alerting on them gives you a warning window before clients start seeing 503s from admission control rejection. This is particularly important for CPU and IO, which can spike faster than GC metrics surface.

Zabbix implementation

  • Master item: HTTP agent polling GET /_nodes/stats/resource_usage_stats.
  • LLD + dependent items: Same pattern as heap - discover nodes, extract cpu_utilization_percent, io_usage_percent, memory_utilization_percent per node.
  • Aggregate trigger: Use max_foreach or count_foreach to fire when multiple nodes show elevated signals simultaneously - a single node spiking during a merge is less interesting than three nodes simultaneously elevated.
  • Thresholds: These metrics are already percentage-of-capacity figures, so 70/85% thresholds are reasonable starting points, tuned against observed baseline.

9. Search Pipeline Mean Latency and Failure Rate

Why this matters

If you’re using OpenSearch search pipelines, failures and regressions in pipeline processing manifest as generalised query slowness that is difficult to distinguish from shard-level performance issues. The _nodes/stats/search_pipeline endpoint exposes cumulative time_in_millis and count per pipeline, plus a failed counter. Monitoring derived mean latency and failure rate catches pipeline regressions early and isolates the cause - an updated processor, a new pipeline config, or a downstream service the pipeline calls.

A note on p95: the endpoint does not expose histogram buckets, so true percentile latency (p95, p99) is not derivable from this data. Zabbix 7’s bucket_percentile() function requires actual histogram bucket items to operate on, which OpenSearch doesn’t provide here. Mean latency and failure rate are what’s actually available; don’t attempt to infer p95 from the cumulative counters.

Zabbix implementation

  • LLD: Discover pipeline names from the response.
  • Dependent items per pipeline: Two “Change per second” items - on total_request.time_in_millis and total_request.count. Derived calculated item: latency_rate / count_rate for mean latency (handle divide-by-zero with a JS preprocessing step).
  • Failure rate item: “Change per second” on total_request.failed.
  • Triggers:
    • Mean latency increasing beyond baseline: avg(/host/opensearch.pipeline.latency_ms,10m) > {$PIPELINE_LATENCY_THRESHOLD}
    • Any failures: change(/host/opensearch.pipeline.failures) > 0 - Warning; avg > 1/min - High.

10. Node Count Stability

Why this matters

On a 25-node cluster, losing a node is not immediately obvious. OpenSearch will attempt to relocate the absent node’s shards, cluster status may stay yellow rather than red, and depending on replica configuration, no data loss occurs. The operational risk is that this can go unnoticed for hours or days - and during that window, the cluster is running with degraded redundancy. A second node loss during that period could be catastrophic. This check is trivial to implement and closes the gap between “the cluster is still serving requests” and “the cluster is operating as expected.”

Zabbix implementation

  • Source: $.nodes.count.total from _cluster/stats (same master item as check #1).
  • Dependent item: Single JSONPath extraction, numeric unsigned.
  • Trigger: change(/host/opensearch.nodes.count) <> 0 - High.
  • Optionally set the expected count as a host macro {$OPENSEARCH_NODE_COUNT} and trigger on last() < {$OPENSEARCH_NODE_COUNT} for a more explicit check that also catches a newly added node (if that’s operationally significant).
  • Add a nodata(300) trigger to catch scenarios where the entire _cluster/stats endpoint becomes unreachable.

Putting It Together

# Metric API Endpoint Zabbix Pattern
1 Unassigned shard trend _cluster/stats HTTP agent - dependent item - change() trigger
2 Heap % on N nodes _nodes/stats/jvm LLD - per-node items - count_foreach() aggregate
3 Disk absolute + projected _nodes/stats/fs LLD - per-node items - last() + timeleft()
4 Old-gen GC % wall clock _nodes/stats/jvm LLD - change/s preprocessing - avg trigger
5 Young + old GC trending _nodes/stats/jvm LLD - two rate items - compound AND trigger
6 File descriptor % _nodes/stats/process LLD - calculated item - threshold trigger
7 Sustained per-node CPU _nodes/stats/process LLD - avg(5m) trigger
8 Admission control signals _nodes/stats/resource_usage_stats LLD - per-node items - max_foreach()
9 Pipeline latency + failures _nodes/stats/search_pipeline LLD - rate preprocessing - compound triggers
10 Node count stability _cluster/stats Dependent item - change() <> 0 trigger

The master HTTP agent items cover four distinct endpoints. Everything else derives from those four requests. At a 30-second scrape interval on a 25-node cluster, that’s four HTTP requests per interval to OpenSearch - manageable overhead with no per-node polling.