Wednesday, May 27, 2026

Opensearch pre-emptive monitoring with zabbix

by cam

10 Pre-emptive Health Checks for an OpenSearch Cluster

Why each metric matters and how to implement it in Zabbix 7

The common failure mode in cluster monitoring is alerting on symptoms rather than causes. status: red tells you something already broke; it doesn’t give you the 30-minute window to act before users notice. The checks below are biased toward leading indicators - metrics that predict problems - and are ordered by API endpoint to make implementation cleaner.

All checks follow the same Zabbix architecture: one master HTTP agent item per endpoint, an LLD rule to discover nodes from the response, per-node dependent items with JSONPath preprocessing, and *_foreach aggregate triggers where you need fleet-wide views. This avoids polling the cluster 25 times per scrape interval and keeps all node metrics in sync.

—

Architecture pattern (read this first)

Before building any individual check, establish this structure once:

Master items - one HTTP agent item per endpoint (e.g., _cluster/stats, _nodes/stats/jvm). These do the actual HTTP request.
LLD rules - parse the master item’s JSON to discover node IDs/names and create per-node dependent items automatically.
Dependent items - extract a single field per item using JSONPath preprocessing. Zero additional HTTP requests.
Triggers - operate on dependent items directly, or use *_foreach aggregate functions across the discovered set.

Two preprocessing gotchas that will bite you:

Counter resets: GC collection_time_in_millis, collection_count, etc. are cumulative and reset when a node restarts. “Simple change” preprocessing will produce a large negative spike on restart. Either add a custom JS step to discard negative values, or prefer “Change per second” which handles this more gracefully.

*_foreach syntax: If all OpenSearch nodes are discovered under a single Zabbix host via LLD, use item key patterns in your foreach expression. If each node is its own Zabbix host, use host group patterns. The syntax differs and the wrong one produces silently empty results.

—

1. Unassigned Shard Trend

Why this matters

Cluster status (green/yellow/red) is a lagging, binary indicator - by the time it flips, shards have already failed to assign. More useful is watching indices.shards.unassigned from _cluster/stats over time. A slow, steady climb in unassigned shard count - even while cluster status remains yellow - often precedes a node falling out of the cluster entirely. Catching the trend early gives you time to investigate disk pressure, heap exhaustion, or network partitioning before the situation becomes a full red.

Zabbix implementation

Master item: HTTP agent polling GET /_cluster/stats, stored as text.
Dependent item: JSONPath $.indices.shards.unassigned, numeric unsigned.
Trigger (absolute): Fire at Warning if unassigned count exceeds your shard-per-node baseline (e.g., >10 for a 25-node cluster).
Trigger (trend): Use change() to alert if the count is increasing across consecutive checks. Combine with avg(/host/opensearch.shards.unassigned,30m) to reduce noise from transient relocation.
Supplement: A second dependent item on $.status with a nodata trigger catches complete API unavailability.

—

2. Heap Pressure Across the Fleet

Why this matters

Averaging heap utilisation across 25 nodes hides the problem. Three nodes sitting at 87% heap while the rest idle at 55% is a critical situation - those three nodes are close to triggering aggressive GC, potential circuit breaker trips, or OOM kills - but a cluster-average metric reads as unremarkable. The alert needs to fire on the number of nodes above threshold, not the mean.

Zabbix implementation

Master item: HTTP agent polling GET /_nodes/stats/jvm.
LLD rule: Discover node IDs from $.nodes.* keys, create dependent items per node.
Dependent item per node: Calculated item: jvm.mem.heap_used_in_bytes / jvm.mem.heap_max_in_bytes * 100, or extract both fields and use a calculated item.
Aggregate trigger (Zabbix 7 *_foreach):

count(last_foreach(/*/opensearch.heap.pct?[group="opensearch"]),"gt(85)") >= 3

Adjust the node count threshold to suit your fleet size and SLA.

Severity: Warning at >= 3 nodes above 85%, High at >= 5 nodes or any single node above 92%.

—

3. Disk Space - Absolute and Projected

Why this matters

OpenSearch’s low/high/flood watermarks mean that disk exhaustion doesn’t just cause data loss - it triggers automatic shard relocation (high watermark), then blocks all new shard assignment (flood-stage watermark), effectively freezing the cluster. The absolute alert catches situations where you’re already close to a watermark. The projection alert catches runaway index growth - a new data source ingesting at 3 times the expected rate - before it trips a watermark and starts forcing relocation.

Zabbix implementation

Source: $.nodes.*.fs.total.available_in_bytes from _nodes/stats/fs.
Dependent item per node: Available bytes and total bytes; derived calculated item for percentage free.
Trigger (absolute): last(/host/opensearch.fs.pct_free) < 15 - High severity. Set a second trigger at < 8 for Disaster.
Trigger (projected): Zabbix’s built-in timeleft() function handles this cleanly:

timeleft(/host/opensearch.fs.available_bytes,1h,0) < 86400

Fires when linear regression on the last hour of data projects reaching zero within 24 hours. Adjust the history window (1h) if your ingestion is bursty.

—

4. Old-Gen GC Time as a Fraction of Wall Clock

Why this matters

Individual GC metrics - “collection count” or “last collection duration” - miss the compound picture. A node running 200 short old-gen GCs per minute is just as degraded as one running 2 long ones, but the duration-per-collection metric looks very different in each case. The better signal is: what fraction of wall-clock time is this node spending stopped for GC? Compute Δgc.collectors.old.collection_time_in_millis / scrape_interval_ms. Above ~5% means the node is materially impaired; above ~15% means it’s effectively in GC spiral and should be investigated immediately.

Zabbix implementation

Dependent item per node: Extract $.nodes..jvm.gc.collectors.old.collection_time_in_millis, numeric unsigned.
Preprocessing: Apply “Change per second” to get ms/s of GC time. Multiply by 0.1 to express as a percentage of wall clock (since 1000ms/s = 100%).
Discard negative step: Add a custom JS preprocessing step after “Change per second”: return value < 0 ? null : value; - this silently drops restart-reset artefacts.
Trigger: avg(/host/opensearch.gc.old.pct,5m) > 5 - High. Tune to > 15 for Disaster.

—

5. Young-Gen + Old-Gen GC Frequency Trending Together

Why this matters

Young-gen GC alone is expected and healthy - it’s the JVM doing normal object lifecycle management. What signals heap trouble is when young-gen collection rate and old-gen collection rate both trend upward together. This is the canonical signature of either a heap size that’s too small for the workload, or a memory leak promoting live objects into old-gen faster than GC can reclaim them. Either condition will eventually cause the node to spiral into GC thrash. Catching both rates climbing in the same window gives you 10-30 minutes of warning before the node becomes unresponsive.

Zabbix implementation

Items per node: Two “Change per second” items - one on gc.collectors.young.collection_count, one on gc.collectors.old.collection_count.
Individual triggers: Alert if young-gen rate exceeds a baseline (tune per cluster) or old-gen rate exceeds ~0.5/min - these are independent useful alerts.
Compound trigger: Use an expression that requires both to be elevated simultaneously:

avg(/host/opensearch.gc.young.rate,10m) > {$GC_YOUNG_THRESHOLD}
and
avg(/host/opensearch.gc.old.rate,10m) > {$GC_OLD_THRESHOLD}

Use macros for thresholds so they can be tuned per host without modifying template expressions.

—

6. Open File Descriptors Approaching Limit

Why this matters

OpenSearch is file-descriptor-intensive: each shard segment, network connection, and log file consumes one. The default OS limit of 65535 sounds generous until a shard explosion (too many small indices, a misconfigured ILM policy, or a runaway dynamic template) multiplies open files rapidly. Hitting the FD limit causes cryptic errors - failed connections, shard assignment failures, index corruption - that are difficult to trace back to their source. This check is unglamorous but reliably catches both misconfiguration and organic growth before they cause an outage.

Zabbix implementation

Source: $.nodes..process.open_file_descriptors and $.nodes..process.max_file_descriptors from _nodes/stats/process.
Calculated item per node: open / max * 100 for percentage utilisation.
Triggers:
- last(/host/opensearch.fd.pct) > 70 - Warning
- last(/host/opensearch.fd.pct) > 85 - High

—

7. Sustained Per-Node CPU

Why this matters

In a well-balanced 25-node cluster, individual nodes shouldn’t sustain high CPU under normal load - OpenSearch distributes query and indexing work across shards. When a single node sits at >80% CPU sustained, the likely causes are: a hot shard (query traffic concentrated on shards only that node holds), a mapping explosion (wildcard or overly dynamic mappings generating enormous field counts), or a runaway aggregation query. None of these will resolve on their own, and all of them will cascade - the hot node slows, its response times increase, the coordinator retries, making it hotter. A per-node sustained CPU alert gives you the signal to identify and redirect or kill the offending shard/query before the node falls out.

Zabbix implementation

Source: $.nodes..process.cpu.percent from _nodes/stats/process.
Trigger: avg(/host/opensearch.process.cpu.pct,5m) > 80 - High.
Use a 5-minute average, not last(), to avoid firing on brief spikes during segment merges or snapshot operations.

—

8. Admission Control Signals (resource_usage_stats)

Why this matters

OpenSearch’s admission control subsystem uses CPU utilisation, IO utilisation, and memory utilisation metrics to decide when to start rejecting requests to protect cluster stability. These are the same signals OpenSearch itself acts on - when they’re elevated, the cluster is actively considering shedding traffic. Alerting on them gives you a warning window before clients start seeing 503s from admission control rejection. This is particularly important for CPU and IO, which can spike faster than GC metrics surface.

Zabbix implementation

Master item: HTTP agent polling GET /_nodes/stats/resource_usage_stats.
LLD + dependent items: Same pattern as heap - discover nodes, extract cpu_utilization_percent, io_usage_percent, memory_utilization_percent per node.
Aggregate trigger: Use max_foreach or count_foreach to fire when multiple nodes show elevated signals simultaneously - a single node spiking during a merge is less interesting than three nodes simultaneously elevated.
Thresholds: These metrics are already percentage-of-capacity figures, so 70/85% thresholds are reasonable starting points, tuned against observed baseline.

—

9. Search Pipeline Mean Latency and Failure Rate

Why this matters

If you’re using OpenSearch search pipelines, failures and regressions in pipeline processing manifest as generalised query slowness that is difficult to distinguish from shard-level performance issues. The _nodes/stats/search_pipeline endpoint exposes cumulative time_in_millis and count per pipeline, plus a failed counter. Monitoring derived mean latency and failure rate catches pipeline regressions early and isolates the cause - an updated processor, a new pipeline config, or a downstream service the pipeline calls.

A note on p95: the endpoint does not expose histogram buckets, so true percentile latency (p95, p99) is not derivable from this data. Zabbix 7’s bucket_percentile() function requires actual histogram bucket items to operate on, which OpenSearch doesn’t provide here. Mean latency and failure rate are what’s actually available; don’t attempt to infer p95 from the cumulative counters.

Zabbix implementation

LLD: Discover pipeline names from the response.
Dependent items per pipeline: Two “Change per second” items - on total_request.time_in_millis and total_request.count. Derived calculated item: latency_rate / count_rate for mean latency (handle divide-by-zero with a JS preprocessing step).
Failure rate item: “Change per second” on total_request.failed.
Triggers:
- Mean latency increasing beyond baseline: avg(/host/opensearch.pipeline.latency_ms,10m) > {$PIPELINE_LATENCY_THRESHOLD}
- Any failures: change(/host/opensearch.pipeline.failures) > 0 - Warning; avg > 1/min - High.

—

10. Node Count Stability

Why this matters

On a 25-node cluster, losing a node is not immediately obvious. OpenSearch will attempt to relocate the absent node’s shards, cluster status may stay yellow rather than red, and depending on replica configuration, no data loss occurs. The operational risk is that this can go unnoticed for hours or days - and during that window, the cluster is running with degraded redundancy. A second node loss during that period could be catastrophic. This check is trivial to implement and closes the gap between “the cluster is still serving requests” and “the cluster is operating as expected.”

Zabbix implementation

Source: $.nodes.count.total from _cluster/stats (same master item as check #1).
Dependent item: Single JSONPath extraction, numeric unsigned.
Trigger: change(/host/opensearch.nodes.count) <> 0 - High.
Optionally set the expected count as a host macro {$OPENSEARCH_NODE_COUNT} and trigger on last() < {$OPENSEARCH_NODE_COUNT} for a more explicit check that also catches a newly added node (if that’s operationally significant).
Add a nodata(300) trigger to catch scenarios where the entire _cluster/stats endpoint becomes unreachable.

—

Putting It Together

#	Metric	API Endpoint	Zabbix Pattern
1	Unassigned shard trend	`_cluster/stats`	HTTP agent - dependent item - `change()` trigger
2	Heap % on N nodes	`_nodes/stats/jvm`	LLD - per-node items - `count_foreach()` aggregate
3	Disk absolute + projected	`_nodes/stats/fs`	LLD - per-node items - `last()` + `timeleft()`
4	Old-gen GC % wall clock	`_nodes/stats/jvm`	LLD - change/s preprocessing - avg trigger
5	Young + old GC trending	`_nodes/stats/jvm`	LLD - two rate items - compound AND trigger
6	File descriptor %	`_nodes/stats/process`	LLD - calculated item - threshold trigger
7	Sustained per-node CPU	`_nodes/stats/process`	LLD - avg(5m) trigger
8	Admission control signals	`_nodes/stats/resource_usage_stats`	LLD - per-node items - `max_foreach()`
9	Pipeline latency + failures	`_nodes/stats/search_pipeline`	LLD - rate preprocessing - compound triggers
10	Node count stability	`_cluster/stats`	Dependent item - `change() <> 0` trigger

The master HTTP agent items cover four distinct endpoints. Everything else derives from those four requests. At a 30-second scrape interval on a 25-node cluster, that’s four HTTP requests per interval to OpenSearch - manageable overhead with no per-node polling.

Linux, Tips, Docker

Wednesday, March 4, 2020

The fun task of getting the results of Jenkins builds back into GitLab.

by cam

Today I had the fun task of getting the results of Jenkins builds back into GitLab.
This post hopefully describes some of the errors I saw (or didn’t see) and the multiple steps that were not clear or not noted in documentation or on the internetz.

So following the documentation here (https://github.com/jenkinsci/gitlab-plugin) I went through the normal steps of triggering builds from GitLab to Jenkins, in particular:

In Jenkins

* Create the Project
* Configure the SCM (Source Code Management) checkout of the gitlab repo as per normal for Jenkins (I’ll also add a pipeline example at the bottom)
* This Usually involves adding ‘deploy’ ssh key to the GitLab project or however you have SSH Keys configured in you Jenkins

In the Jenkins project tick ‘Build when a change is pushed to GitLab’

* Click Advanced > ‘Secret Token’ > ‘Generate’
* Note the ‘GitLab webhook URL’

In GitLab for the Project goto: Settings > Integrations

* Input the Gitlab webhook URL in the ‘URL’ box
* Input the ‘Secret Token’
* Trigger on ‘Push Events’
* Click ‘Add webhook’
* Then Test it

If your following along the doco - You will notice we are using the “Configuring per-project authentication” method.
That completes our “GitLab-to-Jenkins authentication”

The next bit will be where I experienced the most issues and hence this blog post - “Jenkins-to-gitlab-authentication”.

Now the documentation is correct - and I will repeat it here with some additional verbiage.

1. Create a new (Jenkins) user in GitLab

* I won’t go into details here, but set the password to something long and random and forget the password, as no-one should log in as this user.

2. Give this user ‘Developer’ permissions on each repo you want Jenkins to send build status to

* Yes BUT if using the default GitLab(?) configuration and you yolo your Git and always commit to the master branch then ‘Developer’ can not commit (build statuses) directly to master.
Failing to fix this results in the following errors in your Jenkins log:
“c.d.g.util.CommitStatusUpdater#updateCommitStatus: Failed to update Gitlab commit status for project ‘Your-Project-Name’”
“javax.ws.rs.ClientErrorException: HTTP 403 Forbidden” More Java Stack Trace.

* To resolve this error in GitLab goto your Project > Settings > Repository (Settings) > Protected Branches (Expand)
From there either “Unprotect” the master branch or change the permissions to something more suited.

3. Log in or ‘Impersonate’ that (Jenkins) user in GitLab, click the user’s icon/avatar and choose Settings
Click on ‘Impersonation Tokens’
Create a token named e.g. ‘jenkins’ with ‘api’ scope; expiration is optional
Copy/Note the token immediately, it cannot be accessed after you leave this page

4. On the Configure System (Manage Jenkins > Configure System) page in Jenkins, in the GitLab configuration section,
Supply the ‘Connection Name’ - I recommend using underscores instead of spaces, especially if you have more than one Connection.
Supply the GitLab host URL, e.g. https://your.gitlab.server
Click the ‘Add’ button to add a credential, choose ‘GitLab API token’ as the kind of credential
Scope select ‘Global (Jenkins, nodes, items, all child items etc)’
* If you select ‘System (Jenkins and nodes only)’ you will get the following error in your jobs: “Can’t submit build status: No GitLab connection configured”
Paste your GitLab user’s API key into the ‘API token’ field
I also recommend creating a human friendly ‘ID’ like jenkins-api-user-at-gitlab to use in your pipeline, especially if you have more than one Connection.
Click ‘Add’ to save the credentials
Click the ‘Test Connection’ button, it should succeed

5. Finally scroll the bottom of the page click ‘Save’.

It should be noted that re-visiting the ‘Configure System’ page it’s common for the GitLab section to report “API Token for Gitlab access required” - This doesnt seem to matter as the credentials work fine.

That’s the authentication both ways sorted and some errors I encountered and resolved, all thats left is to do is trigger your job - Either via a git commit or via Jenkins - And check for the update in GitLab!

As a bonus to anyone that got this far here is a ‘Jenkinfile’ that goes through a simple Docker container build and also triggers a script inside the container.
You will have to excuse the mess around dockerImage.withRun - as it’s rather difficult to get credentials into the container without committing them to the image etc.

#!groovy

def errorFriendly = "My Log Collector"
def imageName = "unittest"

def registryNamespace = "sysadmin"
// Our custom docker registry
def dockerRegistry = "reg.acme.com"
def dockerImageName = "${dockerRegistry}/${registryNamespace}/${imageName}"

def slackSendMessage(color, message) {
  slackSend channel: 'cicd-notifs',
  tokenCredentialId: 'slack-notifications-token',
  baseUrl: 'https://acme.slack.com/services/hooks/jenkins-ci/',
  color: color,
  message: message
}

node("sysadmin") {

  stage('GIT Clone DockerFile Config') {
    try {
      gitlabBuilds(builds: ['build_container', 'test_ansible_playbook']) {
      // Do nothing but notify GitLab about expected builds so that it sets them as pending
      }
      git branch: 'master',
          credentialsId: 'jenkins-gitlab-ssh-key',
              url: 'git@git.acme.com:SysAdmin/my-log-collector.git'
      sh "ls -lat"
      sh "pwd"
    } catch (Exception ex) {
        println("Unable to git clone: ${ex}")
        slackSendMessage("#FF0000", "Stage 1/3 - Git clone failed for ${errorFriendly}")
        error 'Git Clone failure'
    }
  }

  stage ('Build Docker Image') {
    try {
      updateGitlabCommitStatus name: 'build_container', state: 'running'
      dir("./${imageName}") {
        docker.withRegistry('reg.acme.com', 'sysadmin_jenkins_docker_reg') {
          def customImage = docker.build("${dockerImageName}:${env.BUILD_ID}")
          }
        }
      updateGitlabCommitStatus name: 'build_container', state: 'success'
    } catch (Exception ex) {
        println("Unable build image: ${ex}")
        slackSendMessage("#FF0000", "Stage 2/3 - Docker Container build failed for ${errorFriendly}")
        updateGitlabCommitStatus name: 'build_container', state: 'failed'
        error 'Docker Build failure'
    }
  }

  // Dont bother pushing the image as we are just testing a script and checking for results

  stage('Clone and Execute the playbook') {
    try {
        updateGitlabCommitStatus name: 'test_ansible_playbook', state: 'running'
        withCredentials(bindings: [sshUserPrivateKey(credentialsId: 'jenkins-gitlab-ssh-key', 
            keyFileVariable: 'myLogKey', 
            passphraseVariable: '', 
            usernameVariable: '')]) {
          // Damn Jenkins Docker devs - Cant stuff just work as advertised ie. .inside()!!?!!!
          docker.image("${dockerImageName}:${env.BUILD_ID}").withRun('-t -u root --entrypoint=cat -v "/srv/jenkins/workspace/$JOB_NAME@tmp:/srv/jenkins/workspace/$JOB_NAME@tmp" -e myLogKey=$myLogKey' ) { c ->
          sh "docker cp ./${imageName}/docker-entry.sh ${c.id}:/docker-entry.sh"
          sh "docker exec ${c.id} /docker-entry.sh"
          }
        }
        updateGitlabCommitStatus name: 'test_ansible_playbook', state: 'success'
    } catch (Exception ex) {
        println("Ansible-Playbook failure")
        slackSendMessage("#FF0000", "Stage 3/3 - UnitTest failed for ${errorFriendly}")
        updateGitlabCommitStatus name: 'test_ansible_playbook', state: 'failed'
        error 'Ansible-Playbook failure'
    }
  }
}

Linux, Tips, Hardware, Disks

Thursday, January 30, 2020

Mount image file under Linux

by cam

Sometimes you just *need* to mount an image file under Linux (ie. forensics and/or data recovery).
This isn’t always easy if you DD the disk, then you need to work out the partition maths.

Easiest way is to ‘fisk -l’ the image file:

root@HackerBox:~/forensics# fdisk -l /mnt/temp/ewf1
Disk /mnt/temp/ewf1: 10 GiB, 10737418240 bytes, 20971520 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x39bf39be

Device                   Boot   Start      End           Sectors      Size Id  Type
/mnt/temp/ewf1p1 *         63         20948759 20948697  10G  7   HPFS/NTFS/exFAT

From the above you should see ‘Sectors’ are 512 bytes (pretty normal for NTFS) and the partition starts at 63 (sectors) in.

So finally all we need to do is mount the image with the command:

mount /mnt/temp/ewf1 /mnt/temp1 -o ro,loop,show_sys_files,streams_interace=windows,offset=$((63*512))

You can possibly leave out the ’show_sys_files,streams_interace=windows’ parameters if you aren’t doing forensics.

Operating Systems, Linux, Tips, Hardware, Disks

Monday, July 22, 2019

Resize Linux partition while online

by cam

Previously I always had to fumble around using FDISK, do some dangerous task of deleting the partition table while in use, then add re-adding the table with the new extended settings!
Too difficult and too danger prone (although it never failed on me…)

Anyway I just stumbled on a way easier process! Introducing ‘growpart’.
The process for resizing EXT4 partition, ‘/dev/xvda1′ would be:

Backup the partition table (just in case)

sfdisk -d /dev/xvda > partition_backup

Resize the partition

growpart -v /dev/xvda 1

And finally resize the filesystem to make use of the larger partition

resize2fs /dev/xvda1

Happy resizing.

Note: Quick investigation it seems ‘growpart’ is from the RPM ‘cloud-disk-utils’ from Amazon AWS hosted systems. So not sure about regular avaliability.

Friday, January 11, 2019

Elasticsearch: search_context_missing_exception - No search context found for id

by cam

Consolidating daily logstash indexes into monthly logstash indexes sometimes results in the following error:

      {
        "index": "logstash-2018-11-01",
        "shard": 0,
        "node": "aodScdJuQ5OWLyucQ6Px5Q",
        "reason": {
          "type": "search_context_missing_exception",
          "reason": "No search context found for id [3708723]"
        }

This error is typically caused by the following:

Reindexing uses the scroll api under the covers to read a “point-in-time” view of the source data.

This point in time consists of a set of segments (Lucene files) that are essentially locked and prevented from being deleted by the usual segment merging process that works in the background to reorganise the index in response to ongoing CUD (create/update/delete) operations.

It is costly to preserve this view of data which is why users of the scroll API must renew their lock with each new request for a page of results. Locks will timeout if the client fails to return within the timespan which they said they would return.
The error you are seeing is because the reindex function has requested another page of results but the scroll ID which represents a lock on a set of files is either:

* timed out (i.e the reindex client spent too long indexing the previous page) or
* lost because the node serving the scroll api was restarted or otherwise became unavailable

(Source: https://discuss.elastic.co/t/problem-when-reindexing-large-index/117421)

Looking at the Elasticsearch logs I can see:

[INFO ][o.e.t.LoggingTaskListener] 3406201 finished with response BulkByScrollResponse[took=2.1h,timed_out=false,sliceId=null,updated=0,created=37036000,deleted=0,batches=37036,versionConflicts=0,noops=0,retries=0,throttledUntil=0s,bulk_failures=[],search_failures=[{"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [1102015]"}}, {"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [1102016]"}}, {"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [1102023]"}}]]
[INFO ][o.e.m.j.JvmGcMonitorService] [server-ls2.local] [gc][40994] overhead, spent [335ms] collecting in the last [1s]

[WARN ][o.e.t.TransportService   ] [server-ls2.local] Received response for a request that has timed out, sent [51770ms] ago, timed out [21769ms] ago, action [internal:discovery/zen/fd/master_ping], node [{server-ls5.local}{aM_wxa2mTY2XI9P7bsobSg}{gsNCWS-GQ0md1u-yuhxSNw}{192.168.11.55}{192.168.11.55:9300}{site_id=rack1, ml.machine_memory=67279155200, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], id [11840408]

So there was a timeout issue - but the cause is unknown at this time….