IT Dribble

Mutterings, inconsistant tips, rants and randomness

Opensearch pre-emptive monitoring with zabbix

by

10 Pre-emptive Health Checks for an OpenSearch Cluster

Why each metric matters and how to implement it in Zabbix 7

The common failure mode in cluster monitoring is alerting on symptoms rather than causes. status: red tells you something already broke; it doesn’t give you the 30-minute window to act before users notice. The checks below are biased toward leading indicators - metrics that predict problems - and are ordered by API endpoint to make implementation cleaner.

All checks follow the same Zabbix architecture: one master HTTP agent item per endpoint, an LLD rule to discover nodes from the response, per-node dependent items with JSONPath preprocessing, and *_foreach aggregate triggers where you need fleet-wide views. This avoids polling the cluster 25 times per scrape interval and keeps all node metrics in sync.

Architecture pattern (read this first)

Before building any individual check, establish this structure once:

  • Master items - one HTTP agent item per endpoint (e.g., _cluster/stats, _nodes/stats/jvm). These do the actual HTTP request.
  • LLD rules - parse the master item’s JSON to discover node IDs/names and create per-node dependent items automatically.
  • Dependent items - extract a single field per item using JSONPath preprocessing. Zero additional HTTP requests.
  • Triggers - operate on dependent items directly, or use *_foreach aggregate functions across the discovered set.

Two preprocessing gotchas that will bite you:

Counter resets: GC collection_time_in_millis, collection_count, etc. are cumulative and reset when a node restarts. “Simple change” preprocessing will produce a large negative spike on restart. Either add a custom JS step to discard negative values, or prefer “Change per second” which handles this more gracefully.

*_foreach syntax: If all OpenSearch nodes are discovered under a single Zabbix host via LLD, use item key patterns in your foreach expression. If each node is its own Zabbix host, use host group patterns. The syntax differs and the wrong one produces silently empty results.

1. Unassigned Shard Trend

Why this matters

Cluster status (green/yellow/red) is a lagging, binary indicator - by the time it flips, shards have already failed to assign. More useful is watching indices.shards.unassigned from _cluster/stats over time. A slow, steady climb in unassigned shard count - even while cluster status remains yellow - often precedes a node falling out of the cluster entirely. Catching the trend early gives you time to investigate disk pressure, heap exhaustion, or network partitioning before the situation becomes a full red.

Zabbix implementation

  • Master item: HTTP agent polling GET /_cluster/stats, stored as text.
  • Dependent item: JSONPath $.indices.shards.unassigned, numeric unsigned.
  • Trigger (absolute): Fire at Warning if unassigned count exceeds your shard-per-node baseline (e.g., >10 for a 25-node cluster).
  • Trigger (trend): Use change() to alert if the count is increasing across consecutive checks. Combine with avg(/host/opensearch.shards.unassigned,30m) to reduce noise from transient relocation.
  • Supplement: A second dependent item on $.status with a nodata trigger catches complete API unavailability.

2. Heap Pressure Across the Fleet

Why this matters

Averaging heap utilisation across 25 nodes hides the problem. Three nodes sitting at 87% heap while the rest idle at 55% is a critical situation - those three nodes are close to triggering aggressive GC, potential circuit breaker trips, or OOM kills - but a cluster-average metric reads as unremarkable. The alert needs to fire on the number of nodes above threshold, not the mean.

Zabbix implementation

  • Master item: HTTP agent polling GET /_nodes/stats/jvm.
  • LLD rule: Discover node IDs from $.nodes.* keys, create dependent items per node.
  • Dependent item per node: Calculated item: jvm.mem.heap_used_in_bytes / jvm.mem.heap_max_in_bytes * 100, or extract both fields and use a calculated item.
  • Aggregate trigger (Zabbix 7 *_foreach):
count(last_foreach(/*/opensearch.heap.pct?[group="opensearch"]),"gt(85)") >= 3

Adjust the node count threshold to suit your fleet size and SLA.

  • Severity: Warning at >= 3 nodes above 85%, High at >= 5 nodes or any single node above 92%.

3. Disk Space - Absolute and Projected

Why this matters

OpenSearch’s low/high/flood watermarks mean that disk exhaustion doesn’t just cause data loss - it triggers automatic shard relocation (high watermark), then blocks all new shard assignment (flood-stage watermark), effectively freezing the cluster. The absolute alert catches situations where you’re already close to a watermark. The projection alert catches runaway index growth - a new data source ingesting at 3 times the expected rate - before it trips a watermark and starts forcing relocation.

Zabbix implementation

  • Source: $.nodes.*.fs.total.available_in_bytes from _nodes/stats/fs.
  • Dependent item per node: Available bytes and total bytes; derived calculated item for percentage free.
  • Trigger (absolute): last(/host/opensearch.fs.pct_free) < 15 - High severity. Set a second trigger at < 8 for Disaster.
  • Trigger (projected): Zabbix’s built-in timeleft() function handles this cleanly:
timeleft(/host/opensearch.fs.available_bytes,1h,0) < 86400

Fires when linear regression on the last hour of data projects reaching zero within 24 hours. Adjust the history window (1h) if your ingestion is bursty.

4. Old-Gen GC Time as a Fraction of Wall Clock

Why this matters

Individual GC metrics - “collection count” or “last collection duration” - miss the compound picture. A node running 200 short old-gen GCs per minute is just as degraded as one running 2 long ones, but the duration-per-collection metric looks very different in each case. The better signal is: what fraction of wall-clock time is this node spending stopped for GC? Compute Ξ”gc.collectors.old.collection_time_in_millis / scrape_interval_ms. Above ~5% means the node is materially impaired; above ~15% means it’s effectively in GC spiral and should be investigated immediately.

Zabbix implementation

  • Dependent item per node: Extract $.nodes..jvm.gc.collectors.old.collection_time_in_millis, numeric unsigned.
  • Preprocessing: Apply “Change per second” to get ms/s of GC time. Multiply by 0.1 to express as a percentage of wall clock (since 1000ms/s = 100%).
  • Discard negative step: Add a custom JS preprocessing step after “Change per second”: return value < 0 ? null : value; - this silently drops restart-reset artefacts.
  • Trigger: avg(/host/opensearch.gc.old.pct,5m) > 5 - High. Tune to > 15 for Disaster.

5. Young-Gen + Old-Gen GC Frequency Trending Together

Why this matters

Young-gen GC alone is expected and healthy - it’s the JVM doing normal object lifecycle management. What signals heap trouble is when young-gen collection rate and old-gen collection rate both trend upward together. This is the canonical signature of either a heap size that’s too small for the workload, or a memory leak promoting live objects into old-gen faster than GC can reclaim them. Either condition will eventually cause the node to spiral into GC thrash. Catching both rates climbing in the same window gives you 10-30 minutes of warning before the node becomes unresponsive.

Zabbix implementation

  • Items per node: Two “Change per second” items - one on gc.collectors.young.collection_count, one on gc.collectors.old.collection_count.
  • Individual triggers: Alert if young-gen rate exceeds a baseline (tune per cluster) or old-gen rate exceeds ~0.5/min - these are independent useful alerts.
  • Compound trigger: Use an expression that requires both to be elevated simultaneously:
avg(/host/opensearch.gc.young.rate,10m) > {$GC_YOUNG_THRESHOLD}
and
avg(/host/opensearch.gc.old.rate,10m) > {$GC_OLD_THRESHOLD}

Use macros for thresholds so they can be tuned per host without modifying template expressions.

6. Open File Descriptors Approaching Limit

Why this matters

OpenSearch is file-descriptor-intensive: each shard segment, network connection, and log file consumes one. The default OS limit of 65535 sounds generous until a shard explosion (too many small indices, a misconfigured ILM policy, or a runaway dynamic template) multiplies open files rapidly. Hitting the FD limit causes cryptic errors - failed connections, shard assignment failures, index corruption - that are difficult to trace back to their source. This check is unglamorous but reliably catches both misconfiguration and organic growth before they cause an outage.

Zabbix implementation

  • Source: $.nodes..process.open_file_descriptors and $.nodes..process.max_file_descriptors from _nodes/stats/process.
  • Calculated item per node: open / max * 100 for percentage utilisation.
  • Triggers:
    • last(/host/opensearch.fd.pct) > 70 - Warning
    • last(/host/opensearch.fd.pct) > 85 - High

7. Sustained Per-Node CPU

Why this matters

In a well-balanced 25-node cluster, individual nodes shouldn’t sustain high CPU under normal load - OpenSearch distributes query and indexing work across shards. When a single node sits at >80% CPU sustained, the likely causes are: a hot shard (query traffic concentrated on shards only that node holds), a mapping explosion (wildcard or overly dynamic mappings generating enormous field counts), or a runaway aggregation query. None of these will resolve on their own, and all of them will cascade - the hot node slows, its response times increase, the coordinator retries, making it hotter. A per-node sustained CPU alert gives you the signal to identify and redirect or kill the offending shard/query before the node falls out.

Zabbix implementation

  • Source: $.nodes..process.cpu.percent from _nodes/stats/process.
  • Trigger: avg(/host/opensearch.process.cpu.pct,5m) > 80 - High.
  • Use a 5-minute average, not last(), to avoid firing on brief spikes during segment merges or snapshot operations.

8. Admission Control Signals (resource_usage_stats)

Why this matters

OpenSearch’s admission control subsystem uses CPU utilisation, IO utilisation, and memory utilisation metrics to decide when to start rejecting requests to protect cluster stability. These are the same signals OpenSearch itself acts on - when they’re elevated, the cluster is actively considering shedding traffic. Alerting on them gives you a warning window before clients start seeing 503s from admission control rejection. This is particularly important for CPU and IO, which can spike faster than GC metrics surface.

Zabbix implementation

  • Master item: HTTP agent polling GET /_nodes/stats/resource_usage_stats.
  • LLD + dependent items: Same pattern as heap - discover nodes, extract cpu_utilization_percent, io_usage_percent, memory_utilization_percent per node.
  • Aggregate trigger: Use max_foreach or count_foreach to fire when multiple nodes show elevated signals simultaneously - a single node spiking during a merge is less interesting than three nodes simultaneously elevated.
  • Thresholds: These metrics are already percentage-of-capacity figures, so 70/85% thresholds are reasonable starting points, tuned against observed baseline.

9. Search Pipeline Mean Latency and Failure Rate

Why this matters

If you’re using OpenSearch search pipelines, failures and regressions in pipeline processing manifest as generalised query slowness that is difficult to distinguish from shard-level performance issues. The _nodes/stats/search_pipeline endpoint exposes cumulative time_in_millis and count per pipeline, plus a failed counter. Monitoring derived mean latency and failure rate catches pipeline regressions early and isolates the cause - an updated processor, a new pipeline config, or a downstream service the pipeline calls.

A note on p95: the endpoint does not expose histogram buckets, so true percentile latency (p95, p99) is not derivable from this data. Zabbix 7’s bucket_percentile() function requires actual histogram bucket items to operate on, which OpenSearch doesn’t provide here. Mean latency and failure rate are what’s actually available; don’t attempt to infer p95 from the cumulative counters.

Zabbix implementation

  • LLD: Discover pipeline names from the response.
  • Dependent items per pipeline: Two “Change per second” items - on total_request.time_in_millis and total_request.count. Derived calculated item: latency_rate / count_rate for mean latency (handle divide-by-zero with a JS preprocessing step).
  • Failure rate item: “Change per second” on total_request.failed.
  • Triggers:
    • Mean latency increasing beyond baseline: avg(/host/opensearch.pipeline.latency_ms,10m) > {$PIPELINE_LATENCY_THRESHOLD}
    • Any failures: change(/host/opensearch.pipeline.failures) > 0 - Warning; avg > 1/min - High.

10. Node Count Stability

Why this matters

On a 25-node cluster, losing a node is not immediately obvious. OpenSearch will attempt to relocate the absent node’s shards, cluster status may stay yellow rather than red, and depending on replica configuration, no data loss occurs. The operational risk is that this can go unnoticed for hours or days - and during that window, the cluster is running with degraded redundancy. A second node loss during that period could be catastrophic. This check is trivial to implement and closes the gap between “the cluster is still serving requests” and “the cluster is operating as expected.”

Zabbix implementation

  • Source: $.nodes.count.total from _cluster/stats (same master item as check #1).
  • Dependent item: Single JSONPath extraction, numeric unsigned.
  • Trigger: change(/host/opensearch.nodes.count) <> 0 - High.
  • Optionally set the expected count as a host macro {$OPENSEARCH_NODE_COUNT} and trigger on last() < {$OPENSEARCH_NODE_COUNT} for a more explicit check that also catches a newly added node (if that’s operationally significant).
  • Add a nodata(300) trigger to catch scenarios where the entire _cluster/stats endpoint becomes unreachable.

Putting It Together

# Metric API Endpoint Zabbix Pattern
1 Unassigned shard trend _cluster/stats HTTP agent - dependent item - change() trigger
2 Heap % on N nodes _nodes/stats/jvm LLD - per-node items - count_foreach() aggregate
3 Disk absolute + projected _nodes/stats/fs LLD - per-node items - last() + timeleft()
4 Old-gen GC % wall clock _nodes/stats/jvm LLD - change/s preprocessing - avg trigger
5 Young + old GC trending _nodes/stats/jvm LLD - two rate items - compound AND trigger
6 File descriptor % _nodes/stats/process LLD - calculated item - threshold trigger
7 Sustained per-node CPU _nodes/stats/process LLD - avg(5m) trigger
8 Admission control signals _nodes/stats/resource_usage_stats LLD - per-node items - max_foreach()
9 Pipeline latency + failures _nodes/stats/search_pipeline LLD - rate preprocessing - compound triggers
10 Node count stability _cluster/stats Dependent item - change() <> 0 trigger

The master HTTP agent items cover four distinct endpoints. Everything else derives from those four requests. At a 30-second scrape interval on a 25-node cluster, that’s four HTTP requests per interval to OpenSearch - manageable overhead with no per-node polling.

Debian Lighttpd does infinite redirect loop and fails to connect

by

Just imagine your running a blog that requires zero maintenance and one day you access it and it doesn’t load!

You try Firefox and then Chrome and finally Edge (the new IE)

You notice that Firefox and Chrome seem to loop and then finally fail - You notice that Edge works….

You notice that cURL works.

Things are but aren’t working.

Finally you notice Firefox is trying to do TLS1.3! Interesting how do I disable that on Debian 9 with Lighttpd? You Can’t!

What’s the fix?

in lighttpd.conf in your SSL section input:

ssl.disable-client-renegotiation = “disable”

ssl.disable-client-renegotiation exists because of a bug back in 2009 - This bug has long been patch in newer versions of OpenSSL and is safe to turn back on.

Disabling this setting allowed you to find the answer to your troubles :-)

Squid HTTPS interception and filtering without client certificates

by

I had a requirement to filter (all) web traffic on a few servers. This is typically easy with Squid and using it’s transparent proxy function. Where it gets difficult is filtering domains for HTTPS traffic.
I don’t want to SSL intercept the traffic, I don’t want to install CA certificates on the clients, I only want to filter the URLs based on a whitelist to which it can access. This is how it is done:

yum install squid
# I used squid 3.5.20

/usr/lib64/squid/ssl_crtd -c -s /var/lib/ssl_db
chown -R squid.squid /var/lib/ssl_db

mkdir /etc/squid/ssl_cert/
chown -R squid.squid /etc/squid/ssl_cert/
cd /etc/squid/ssl_cert
openssl req -new -newkey rsa:1024 -days 1365 -nodes -x509 -keyout myca.pem -out myca.pem

echo "www.google.com" > /etc/squid/whitelist
chmod 640 /etc/squid/whitelist
chown root:squid /etc/squid/whitelist

/etc/squid/squid.conf:

acl localnet src 10.0.0.0/8	# RFC1918 possible internal network
acl localnet src 127.0.0.1/32	# RFC1918 possible internal network
acl localnet src 172.16.0.0/12	# RFC1918 possible internal network
acl localnet src 192.168.0.0/16	# RFC1918 possible internal network
acl localnet src fc00::/7       # RFC 4193 local private network range
acl localnet src fe80::/10      # RFC 4291 link-local (directly plugged) machines

acl SSL_ports port 443
acl Safe_ports port 80		# http
acl Safe_ports port 21		# ftp
acl Safe_ports port 443		# https
acl Safe_ports port 70		# gopher
acl Safe_ports port 210		# wais
acl Safe_ports port 1025-65535	# unregistered ports
acl Safe_ports port 280		# http-mgmt
acl Safe_ports port 488		# gss-http
acl Safe_ports port 591		# filemaker
acl Safe_ports port 777		# multiling http
acl CONNECT method CONNECT

http_access deny !Safe_ports

http_access deny CONNECT !SSL_ports

http_access allow localhost manager
http_access deny manager

acl step1 at_step SslBump1
acl whitelist_ssl ssl::server_name "/etc/squid/whitelist"
acl whitelist dstdomain "/etc/squid/whitelist"
acl port_80 port 80
acl http proto http

ssl_bump peek step1
ssl_bump splice whitelist_ssl
ssl_bump terminate all !whitelist_ssl

http_access deny http port_80 localnet !whitelist
http_access allow localnet
http_access deny all

https_port 3127 intercept ssl-bump generate-host-certificates=on dynamic_cert_mem_cache_size=4MB cert=/etc/squid/ssl_cert/myca.pem key=/etc/squid/ssl_cert/myca.pem
http_port 3128 transparent

coredump_dir /var/spool/squid

refresh_pattern ^ftp:		1440	20%	10080
refresh_pattern ^gopher:	1440	0%	1440
refresh_pattern -i (/cgi-bin/|\?) 0	0%	0
refresh_pattern .		0	20%	4320

# Test it with:

iptables -m owner --uid-owner cm -t nat -A OUTPUT -p tcp --dport 80 -j DNAT --to 127.0.0.1:3128
iptables -m owner --uid-owner cm -t nat -A OUTPUT -p tcp --dport 443 -j DNAT --to 127.0.0.1:3127

# Closing notes and thoughts

Around this section here:
http_access deny http port_80 localnet !whitelist
http_access allow localnet
http_access deny all

It looks a bit funny because we ‘allow localnet’ which typically allows our clients open access. However assessing:

ssl_bump terminate all !whitelist_ssl
http_access deny http port_80 localnet !whitelist

rules first, you see that we filter out all sites other than the whitelist with an explicit ‘deny’ or ssl ‘terminate’.

Also trying to use a proxy-aware application with the above configuration will not work because the proxy is configured in transparent / intercept mode ONLY. This is likely due to not having a normal http_port directive, this is good for me as it’s minimizing the abuse avenues.

Also for a final, final step, you need to configure your edge (or local) firewall to do destination NAT’ing back to the two Squid ports.

Block network traffic based on UID / User and GID / Group

by

I just found out that you can apply different IPTables rules based on UID and GID.

Just check that your kernel / iptables supports the module:

iptables -m owner --help

Which should output near the bottom like:

owner match options:
[!] --uid-owner userid[-userid]      Match local UID
[!] --gid-owner groupid[-groupid]    Match local GID
[!] --socket-exists                  Match if socket exists

Then make a rule as required. Eg. User ‘cm’ gets their web traffic transparently proxied via Squid.

iptables -m owner --uid-owner cm -t nat -A OUTPUT -i eth0 -p tcp --dport 80 -j DNAT --to 127.0.0.1:3128

Pretty cool!

sshd without-password vs prohibit-password

by

Upgrading a server from Debian 8 to Debian 9 - I noticed in /etc/ssh/sshd_config that ‘PermitRootLogin’ had the argument ‘prohibit-password’. Having not seen that before I wondered what the difference was between that and ‘without-password’.
Turns out that mean and do the same thing - but ‘prohibit-password’ was introduced to be less ambigous. So there you have it!

Check out the release notes here for proof :-)