Restarting Elasticseach node — the safe way

Elasticsearch is being used in a lot of companies as a great search-engine thanks to its speed and scale. At Dynamic Yield, we are using Elasticsearch as part of our recommendations engine and handle thousands of requests per second.

Terminology

Elasticsearch cluster contains several nodes that can play one or more roles: master, data, ingestion, etc.

Elasticsearch cluster

Concepts

Before we proceed, there are a few Elasticsearch principles to keep in mind:

  1. Elasricseach is trying to balance all shards between all data nodes (both primary and replica shards). This is done by dividing the total number of shards by the number of data nodes (and not by their size! there’s a great post reviewing that issue). Say we have 1,000 shards and 8 data nodes, 125 shards will be allocated to each node.
  2. If a data node goes down for some reason, all of its shards will become “unassigned shards” and Elasticseach will try to assign them on different nodes (by duplicating other replicas of those shards). During that time, the cluster state might be red or yellow.
  1. Latency: the more replicas we have the more nodes can handle our search queries.

Mitigating risks

But what if we have a planned restart? we can count some reasons for that:

  1. Node termination (for changing availability-zone for example).
  2. Changing Elasticsearch’s static settings.
  3. Unexpected behavior such as high JVM heap usage that requires Elasticsearch process restart.

The simple way

Just stop Elasticsearch service:

service elasticsearch stop

The safe way

Elasticsearch provides us a rich REST API, so why not use it?

PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.exclude._name": "<node_id>"
}
}
GET _cat/allocation/<node_id>
GET _cat/allocation
service elasticsearch stop
service elasticsearch start
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.exclude._name": ""
}
}

Summary

We saw how we can perform a safe restart to Elasticsearch node with minimal risk and a small additional effort. That way we ensure our production environment is safe and stable.

Dad | Husband | Principal Software Engineer at Dynamic Yield | Tech Geek | https://www.linkedin.com/in/itaybittan