If the workloads on an Elasticsearch cluster's data plane change, you can scale the cluster vertically by changing its node specifications or node storage type.
Change Type | Scenario | Change Process |
|---|---|---|
Changing node specifications | Typically, you increase node specifications instead of decreasing them. Common scenarios include:
Alternatively, you may also decrease node specifications, but doing so will decrease the cluster's data processing and storage capacities. Exercise caution. |
The node specifications are changed one node at a time. This is to ensure that there are sufficient resources to keep services running. |
Changing the node storage type (disk type) | Change the node storage type if disk I/O has become a performance bottleneck, which impacts query and write performance. |
The nodes are changed one at a time to prevent service interruptions. |
For a pay-per-use cluster, you can see its new price when confirming the node specifications or storage type change on the console. After the change is complete, the new price will apply. For pricing details, see .
Before changing a cluster's node specifications or storage type, it is essential to assess the potential impacts and review operational recommendations. This enables proper scheduling of the change, minimizing service interruptions.
Changing the node storage type does not interrupt services. However, data migration that occurs during this process consumes I/O performance, and taking individual nodes offline still has some impact on the overall cluster performance.
To minimize this impact, it is advisable to adjust the data migration rate based on the cluster's traffic cycle: increase the data migration rate during off-peak hours to shorten the task duration, and decrease it before peak hours arrive to ensure optimal cluster performance. The data migration rate is determined by the indices.recovery.max_bytes_per_sec parameter. The default value of this parameter is the number of vCPUs multiplied by 32 MB. For example, for four vCPUs, the data migration rate is 128 MB. Set this parameter to a value between 40 MB and 1000 MB based on your service requirements.
PUT /_cluster/settings{"transient": {"indices.recovery.max_bytes_per_sec": "1000MB"}}
Taking nodes offline one at a time usually does not interrupt services. However, requests sent to offline nodes may fail. To mitigate this impact, the following measures may be taken:
Shards that have no replicas will become unavailable when the nodes that store them are taken offline, causing service interruptions. You are advised to add replicas for all important indexes before making the change described in this topic.
Changing the node storage type for a cluster will cause Kibana and Cerebro to be rebuilt. During this period, Kibana and Cerebro are temporarily unavailable. During a node specifications change, if Kibana and Cerebro become unavailable because the node that runs them is taken offline, refresh the web page or try to log in again, and the system will reschedule Kibana and Cerebro to an available node.
Once started, a change task cannot be stopped until it succeeds or fails. A change task failure only impacts a single node, and does not interrupt services if there are data replicas, but the failed node still needs to be restored promptly.
Change duration (min) = 10 (min) x Total number of nodes to change + Data recovery duration (min)
where,
Data recovery duration (min) = Total data size (MB)/[Total number of vCPUs of the data nodes x 32 (MB/s) x 60 (s)]
where,
Change duration (min) = 15 (min) x Total number of nodes to change + Data migration duration (min)
where,
Data migration duration (min) = Total data size (MB)/[Total number of vCPUs of the data nodes x 32 (MB/s) x 60 (s)]
where,
Parameter | Description |
|---|---|
Action | Select Change specifications. |
Resources | Shows the change of resources for this operation. |
Nodes | Configure the changes you want to make.
The node specifications and storage type cannot be changed at the same time. |
Item | Description |
|---|---|
Verify index copies | By default, CSS checks for indexes that do not have any replicas created for them. You can skip this step, but the lack of index replicas may impact service availability during a node specifications change.
|
Cluster status check | During a node specifications change, the cluster status is checked by default to improve the success rate and ensure data security. The nodes are changed one at a time. For each node, the system changes its specifications, restarts it, and checks that all its processes are started successfully before moving on to the next node. In emergencies (for example, when a cluster is overloaded and services are faulty, which may prevent a specifications change request from being delivered), you can skip cluster status check so that more resources can be made available for cluster recovery. However, doing so may cause the cluster to become faulty and interrupt services. Exercise caution. |
Check cluster load | During a node storage type change, data migration between nodes and the stopping and restarting of nodes will consume cluster resources, causing the cluster load to increase. A cluster load check can identify possible overload risks for a cluster and reduce the likelihood of an overload condition causing the node storage type change to fail. The cluster load check items are as follows:
|
If the change request fails to be submitted and a message is displayed indicating that the cluster needs to be upgraded, it means the current cluster version does not support a node storage type change. Upgrade the cluster to the latest image version and then try again. For a detailed upgrade guide, see Upgrading the Version of an Elasticsearch Cluster.