10 Pre-emptive Health Checks for an OpenSearch Cluster
Why each metric matters and how to implement it in Zabbix 7
The common failure mode in cluster monitoring is alerting on symptoms rather than causes. status: red tells you something already broke; it doesn’t give you the 30-minute window to act before users notice. The checks below are biased toward leading indicators - metrics that predict problems - and are ordered by API endpoint to make implementation cleaner.
All checks follow the same Zabbix architecture: one master HTTP agent item per endpoint, an LLD rule to discover nodes from the response, per-node dependent items with JSONPath preprocessing, and *_foreach aggregate triggers where you need fleet-wide views. This avoids polling the cluster 25 times per scrape interval and keeps all node metrics in sync.
—
Architecture pattern (read this first)
Before building any individual check, establish this structure once:
- Master items - one HTTP agent item per endpoint (e.g.,
_cluster/stats,_nodes/stats/jvm). These do the actual HTTP request. - LLD rules - parse the master item’s JSON to discover node IDs/names and create per-node dependent items automatically.
- Dependent items - extract a single field per item using JSONPath preprocessing. Zero additional HTTP requests.
- Triggers - operate on dependent items directly, or use
*_foreachaggregate functions across the discovered set.
Two preprocessing gotchas that will bite you:
Counter resets: GC collection_time_in_millis, collection_count, etc. are cumulative and reset when a node restarts. “Simple change” preprocessing will produce a large negative spike on restart. Either add a custom JS step to discard negative values, or prefer “Change per second” which handles this more gracefully.
*_foreach syntax: If all OpenSearch nodes are discovered under a single Zabbix host via LLD, use item key patterns in your foreach expression. If each node is its own Zabbix host, use host group patterns. The syntax differs and the wrong one produces silently empty results.
—
1. Unassigned Shard Trend
Why this matters
Cluster status (green/yellow/red) is a lagging, binary indicator - by the time it flips, shards have already failed to assign. More useful is watching indices.shards.unassigned from _cluster/stats over time. A slow, steady climb in unassigned shard count - even while cluster status remains yellow - often precedes a node falling out of the cluster entirely. Catching the trend early gives you time to investigate disk pressure, heap exhaustion, or network partitioning before the situation becomes a full red.
Zabbix implementation
- Master item: HTTP agent polling
GET /_cluster/stats, stored as text. - Dependent item: JSONPath
$.indices.shards.unassigned, numeric unsigned. - Trigger (absolute): Fire at
Warningif unassigned count exceeds your shard-per-node baseline (e.g., >10 for a 25-node cluster). - Trigger (trend): Use
change()to alert if the count is increasing across consecutive checks. Combine withavg(/host/opensearch.shards.unassigned,30m)to reduce noise from transient relocation. - Supplement: A second dependent item on
$.statuswith anodatatrigger catches complete API unavailability.
—
2. Heap Pressure Across the Fleet
Why this matters
Averaging heap utilisation across 25 nodes hides the problem. Three nodes sitting at 87% heap while the rest idle at 55% is a critical situation - those three nodes are close to triggering aggressive GC, potential circuit breaker trips, or OOM kills - but a cluster-average metric reads as unremarkable. The alert needs to fire on the number of nodes above threshold, not the mean.
Zabbix implementation
- Master item: HTTP agent polling
GET /_nodes/stats/jvm. - LLD rule: Discover node IDs from
$.nodes.*keys, create dependent items per node. - Dependent item per node: Calculated item:
jvm.mem.heap_used_in_bytes / jvm.mem.heap_max_in_bytes * 100, or extract both fields and use a calculated item. - Aggregate trigger (Zabbix 7
*_foreach):
count(last_foreach(/*/opensearch.heap.pct?[group="opensearch"]),"gt(85)") >= 3
Adjust the node count threshold to suit your fleet size and SLA.
- Severity:
Warningat >= 3 nodes above 85%,Highat >= 5 nodes or any single node above 92%.
—
3. Disk Space - Absolute and Projected
Why this matters
OpenSearch’s low/high/flood watermarks mean that disk exhaustion doesn’t just cause data loss - it triggers automatic shard relocation (high watermark), then blocks all new shard assignment (flood-stage watermark), effectively freezing the cluster. The absolute alert catches situations where you’re already close to a watermark. The projection alert catches runaway index growth - a new data source ingesting at 3 times the expected rate - before it trips a watermark and starts forcing relocation.
Zabbix implementation
- Source:
$.nodes.*.fs.total.available_in_bytesfrom_nodes/stats/fs. - Dependent item per node: Available bytes and total bytes; derived calculated item for percentage free.
- Trigger (absolute):
last(/host/opensearch.fs.pct_free) < 15-Highseverity. Set a second trigger at< 8forDisaster. - Trigger (projected): Zabbix’s built-in
timeleft()function handles this cleanly:
timeleft(/host/opensearch.fs.available_bytes,1h,0) < 86400
Fires when linear regression on the last hour of data projects reaching zero within 24 hours. Adjust the history window (1h) if your ingestion is bursty.
—
4. Old-Gen GC Time as a Fraction of Wall Clock
Why this matters
Individual GC metrics - “collection count” or “last collection duration” - miss the compound picture. A node running 200 short old-gen GCs per minute is just as degraded as one running 2 long ones, but the duration-per-collection metric looks very different in each case. The better signal is: what fraction of wall-clock time is this node spending stopped for GC? Compute Ξgc.collectors.old.collection_time_in_millis / scrape_interval_ms. Above ~5% means the node is materially impaired; above ~15% means it’s effectively in GC spiral and should be investigated immediately.
Zabbix implementation
- Dependent item per node: Extract
$.nodes., numeric unsigned..jvm.gc.collectors.old.collection_time_in_millis - Preprocessing: Apply “Change per second” to get ms/s of GC time. Multiply by 0.1 to express as a percentage of wall clock (since 1000ms/s = 100%).
- Discard negative step: Add a custom JS preprocessing step after “Change per second”:
return value < 0 ? null : value;- this silently drops restart-reset artefacts. - Trigger:
avg(/host/opensearch.gc.old.pct,5m) > 5-High. Tune to> 15forDisaster.
—
5. Young-Gen + Old-Gen GC Frequency Trending Together
Why this matters
Young-gen GC alone is expected and healthy - it’s the JVM doing normal object lifecycle management. What signals heap trouble is when young-gen collection rate and old-gen collection rate both trend upward together. This is the canonical signature of either a heap size that’s too small for the workload, or a memory leak promoting live objects into old-gen faster than GC can reclaim them. Either condition will eventually cause the node to spiral into GC thrash. Catching both rates climbing in the same window gives you 10-30 minutes of warning before the node becomes unresponsive.
Zabbix implementation
- Items per node: Two “Change per second” items - one on
gc.collectors.young.collection_count, one ongc.collectors.old.collection_count. - Individual triggers: Alert if young-gen rate exceeds a baseline (tune per cluster) or old-gen rate exceeds ~0.5/min - these are independent useful alerts.
- Compound trigger: Use an expression that requires both to be elevated simultaneously:
avg(/host/opensearch.gc.young.rate,10m) > {$GC_YOUNG_THRESHOLD}
and
avg(/host/opensearch.gc.old.rate,10m) > {$GC_OLD_THRESHOLD}
Use macros for thresholds so they can be tuned per host without modifying template expressions.
—
6. Open File Descriptors Approaching Limit
Why this matters
OpenSearch is file-descriptor-intensive: each shard segment, network connection, and log file consumes one. The default OS limit of 65535 sounds generous until a shard explosion (too many small indices, a misconfigured ILM policy, or a runaway dynamic template) multiplies open files rapidly. Hitting the FD limit causes cryptic errors - failed connections, shard assignment failures, index corruption - that are difficult to trace back to their source. This check is unglamorous but reliably catches both misconfiguration and organic growth before they cause an outage.
Zabbix implementation
- Source:
$.nodes.and.process.open_file_descriptors $.nodes.from.process.max_file_descriptors _nodes/stats/process. - Calculated item per node:
open / max * 100for percentage utilisation. - Triggers:
last(/host/opensearch.fd.pct) > 70-Warninglast(/host/opensearch.fd.pct) > 85-High
—
7. Sustained Per-Node CPU
Why this matters
In a well-balanced 25-node cluster, individual nodes shouldn’t sustain high CPU under normal load - OpenSearch distributes query and indexing work across shards. When a single node sits at >80% CPU sustained, the likely causes are: a hot shard (query traffic concentrated on shards only that node holds), a mapping explosion (wildcard or overly dynamic mappings generating enormous field counts), or a runaway aggregation query. None of these will resolve on their own, and all of them will cascade - the hot node slows, its response times increase, the coordinator retries, making it hotter. A per-node sustained CPU alert gives you the signal to identify and redirect or kill the offending shard/query before the node falls out.
Zabbix implementation
- Source:
$.nodes.from.process.cpu.percent _nodes/stats/process. - Trigger:
avg(/host/opensearch.process.cpu.pct,5m) > 80-High. - Use a 5-minute average, not
last(), to avoid firing on brief spikes during segment merges or snapshot operations.
—
8. Admission Control Signals (resource_usage_stats)
Why this matters
OpenSearch’s admission control subsystem uses CPU utilisation, IO utilisation, and memory utilisation metrics to decide when to start rejecting requests to protect cluster stability. These are the same signals OpenSearch itself acts on - when they’re elevated, the cluster is actively considering shedding traffic. Alerting on them gives you a warning window before clients start seeing 503s from admission control rejection. This is particularly important for CPU and IO, which can spike faster than GC metrics surface.
Zabbix implementation
- Master item: HTTP agent polling
GET /_nodes/stats/resource_usage_stats. - LLD + dependent items: Same pattern as heap - discover nodes, extract
cpu_utilization_percent,io_usage_percent,memory_utilization_percentper node. - Aggregate trigger: Use
max_foreachorcount_foreachto fire when multiple nodes show elevated signals simultaneously - a single node spiking during a merge is less interesting than three nodes simultaneously elevated. - Thresholds: These metrics are already percentage-of-capacity figures, so 70/85% thresholds are reasonable starting points, tuned against observed baseline.
—
9. Search Pipeline Mean Latency and Failure Rate
Why this matters
If you’re using OpenSearch search pipelines, failures and regressions in pipeline processing manifest as generalised query slowness that is difficult to distinguish from shard-level performance issues. The _nodes/stats/search_pipeline endpoint exposes cumulative time_in_millis and count per pipeline, plus a failed counter. Monitoring derived mean latency and failure rate catches pipeline regressions early and isolates the cause - an updated processor, a new pipeline config, or a downstream service the pipeline calls.
A note on p95: the endpoint does not expose histogram buckets, so true percentile latency (p95, p99) is not derivable from this data. Zabbix 7’s bucket_percentile() function requires actual histogram bucket items to operate on, which OpenSearch doesn’t provide here. Mean latency and failure rate are what’s actually available; don’t attempt to infer p95 from the cumulative counters.
Zabbix implementation
- LLD: Discover pipeline names from the response.
- Dependent items per pipeline: Two “Change per second” items - on
total_request.time_in_millisandtotal_request.count. Derived calculated item:latency_rate / count_ratefor mean latency (handle divide-by-zero with a JS preprocessing step). - Failure rate item: “Change per second” on
total_request.failed. - Triggers:
- Mean latency increasing beyond baseline:
avg(/host/opensearch.pipeline.latency_ms,10m) > {$PIPELINE_LATENCY_THRESHOLD} - Any failures:
change(/host/opensearch.pipeline.failures) > 0-Warning;avg > 1/min-High.
- Mean latency increasing beyond baseline:
—
10. Node Count Stability
Why this matters
On a 25-node cluster, losing a node is not immediately obvious. OpenSearch will attempt to relocate the absent node’s shards, cluster status may stay yellow rather than red, and depending on replica configuration, no data loss occurs. The operational risk is that this can go unnoticed for hours or days - and during that window, the cluster is running with degraded redundancy. A second node loss during that period could be catastrophic. This check is trivial to implement and closes the gap between “the cluster is still serving requests” and “the cluster is operating as expected.”
Zabbix implementation
- Source:
$.nodes.count.totalfrom_cluster/stats(same master item as check #1). - Dependent item: Single JSONPath extraction, numeric unsigned.
- Trigger:
change(/host/opensearch.nodes.count) <> 0-High. - Optionally set the expected count as a host macro
{$OPENSEARCH_NODE_COUNT}and trigger onlast() < {$OPENSEARCH_NODE_COUNT}for a more explicit check that also catches a newly added node (if that’s operationally significant). - Add a
nodata(300)trigger to catch scenarios where the entire_cluster/statsendpoint becomes unreachable.
—
Putting It Together
| # | Metric | API Endpoint | Zabbix Pattern |
|---|---|---|---|
| 1 | Unassigned shard trend | _cluster/stats |
HTTP agent - dependent item - change() trigger |
| 2 | Heap % on N nodes | _nodes/stats/jvm |
LLD - per-node items - count_foreach() aggregate |
| 3 | Disk absolute + projected | _nodes/stats/fs |
LLD - per-node items - last() + timeleft() |
| 4 | Old-gen GC % wall clock | _nodes/stats/jvm |
LLD - change/s preprocessing - avg trigger |
| 5 | Young + old GC trending | _nodes/stats/jvm |
LLD - two rate items - compound AND trigger |
| 6 | File descriptor % | _nodes/stats/process |
LLD - calculated item - threshold trigger |
| 7 | Sustained per-node CPU | _nodes/stats/process |
LLD - avg(5m) trigger |
| 8 | Admission control signals | _nodes/stats/resource_usage_stats |
LLD - per-node items - max_foreach() |
| 9 | Pipeline latency + failures | _nodes/stats/search_pipeline |
LLD - rate preprocessing - compound triggers |
| 10 | Node count stability | _cluster/stats |
Dependent item - change() <> 0 trigger |
The master HTTP agent items cover four distinct endpoints. Everything else derives from those four requests. At a 30-second scrape interval on a 25-node cluster, that’s four HTTP requests per interval to OpenSearch - manageable overhead with no per-node polling.