Nutanix developed extremely intelligent and self healing system that pro-actively detects the upcoming issue(s) and pop up warning in form of ALERT in Prism and via e-mail. Nutanix recently added one more feature called Nutanix degraded node forwarding state in Nutanix cluster.
Nutanix cluster runs algorithm in background to detect the Nutanix degraded node depends on node’s performance peer health global database score. If Nutanix algorithm found any node thats score is not up to the mark and Nutanix Controller-VM: CVM is struggling to complete their operation in real time and facing slowness and performance issue in term of network, disk, DIMM: Memory with unresponsive state then Nutanix cluster will take action to enter that node in forwarding state to prevent any failure.
It means when Nutanix Controller-VM: CVM having issue on hosting cluster services. in other words Nutanix CVM is not 100% reliable to perform normal cluster operations then Nutanix cluster enter that node in forwarding state.
Once a node is detected by Nutanix cluster as degraded node ( because low performance issue ) the leadership and critical services will not be hosted on that node.
Until a degraded node is unmarked from the degraded state, Nutanix node’s Cassandra services remain in the forwarding state, see Cluster Components. Once the node is marked as fixed in Nutanix Prism as Service, Cassandra restarts its services on the node.
Read Also : Why Nutanix Adopted Web-Scale Infrastructure Concept ?
Degraded Node State Cause
Nutanix cluster intelligently work to detect node(s) having performance issue as per predefined node’s health and performance score.
The Nutanix node health score is depends on following factors:
- Network bandwidth reduction
- Network pack loss / drop i.e Nutanix AHV Infected By OVS Packet Looping Issue
- Network latency
- Soft lockups
- Partially bad disks i.e Bit-rot data corruption issue
- disk failure i.e SSD failure
- Hardware issues (such as unreliable DIMM with ECC errors)
- RPC failure or timeouts
- Remote Procedure Call: RPC latency
- Either a metadata drive has failed, node removal has been initiated, or an unexpected subsystem fault has been encountered.
- Nutanix Controller-VM: CVM or critical service rebooting frequently
Note : if one node consistently receives poor scores for approximately 10 minutes then the peers mark that node as a degraded node. Clustering algorithms are used to identify outlier scores.
Read Also : Nutanix Acropolis AHV Core Architecture Explained
Degraded Node Event Alert
When Nutanix cluster declared any node as degraded node in cluster then Nutanix prism prompt following degrade node alert messages:
1. Metadata service on CVM ip_address is running in forwarding mode due to reason.
2. Cassandra on CVM ip_address is running in forwarding mode due to reason.
3. Possible degraded node
Read Also : How To Change Nutanix CVM, AHV and IPMI Passwords
Determine Degraded Node
If you want check the Nutanix degraded node in your cluster then need to run simple Nutanix Degrade Node command on any Nutanix CVM in cluster.
nutanix@cvm$ ncc health_checks system_checks degraded_node_check
Read Also : Nutanix Acropolis acli vs ncli Command Explained
Degraded Node Impact
Lets explore the Nutanix degraded Node impact on Nutanix cluster is following
Impact 1 : Cluster performance may be significantly degraded. In the case of multiple nodes with the same condition, the cluster may become unable to service I/O requests.
Impact 2 : Containers or data stores might be unavailable for 10 minutes until the node is marked as degraded.
Impact 3 : Upgrades and break fixes are not allowed until the degraded node is fixed.
Impact 4 : Continuing to run a degraded node can affect overall cluster and user VM performance.
Impact 5 : ZooKeeper might places the node into maintenance mode forward leadership position to another Nutanix CVM.
Impact 6 : Cassandra services remain in the forwarding state.
Impact 7 : The leadership and critical services will not be hosted on that node.
Impact 8 : Degraded node can adversely affect the performance of an entire cluster.
Read Also : Nutanix Cluster size Limitation, Scabalibity or Maximums
Nutanix Auto-healing Action
When Nutanix cluster detects any nodes as degraded node in cluster then Nutanix auto healing system work to mitigate the effects of failures and reduces the overall impact to the cluster.
To mitigate the impact, this software can perform one of the following actions:
- Prevent components on the degraded node from acquiring leadership roles
- Place the degraded node CVM in maintenance mode and reboots the Nutanix CVM to stop the services
- Shut down the host
Read Also : Nutanix Acropolis AOS Vs AHV
Conclusion
Nutanix evolving the intelligence of system in each new release of Nutanix AOS and AHV Hypervisor to automate the task and auto-healing the system as much as possible. Nutanix degraded node forwarding state is the one important featured added in Nutanix AOS / CVM to minimize the impact of failure and performance degrade.
Thanks to being with HyperHCI Tech Blog to learn new tech topic on every day.!