Nutanix Failover Techniques : Redundancy Factor Vs Replication Factor Explained

Redundancy factor and Resiliency factor or Replication factor with method RF 2 / RF 3 both have different working mechanism but use same concept to give extra layer of component(s) failure security i.e hardware component(s) failure and software defined storage media failure.

Lets talk about redundancy and resiliency factor in detail

Nutanix Fail-over techniques

  • What is Redundancy Factor ?
  • What is Resiliency Factor / Replication Factor ?
  • How Resiliency / Replication Factor Works ?
  • How to change Nutanix cluster RF Number ?

What is Redundancy ?

Redundancy is the technique provision of functional capabilities that would be give continuous operation or no interruption in operation in case of component(s) failure.

Redundancy depends on the automatic fault tolerance mechanism that relies on specialized hardware to detect a hardware fault or component failure and instantaneously switch to a redundant hardware component, whether the failed component is a processor, memory, power supply PDU , I/O subsystem, or storage subsystem , storage media or drives and cut over is apparently seamless and offers non-stop service.

But it would not necessarily maintain the safe state to having full functionality or fidelity. The system may operate in a degraded state and it would not put the system immediately in a dangerous state.

Failure Hardware Component List

In real scenarios, Every hardware vendor provide redundancy for hardware component failure for non-interrupting operations.

Nutanix Fail-over Component Table
Nutanix Fail-over Component Table

Here is list what component(s) might redundant along with fault tolerance mechanism.

  • Network interface failure
  • Network card failure NIC
  • Memory module failure
  • CPU failure
  • Power Unit Failure ( PDU )
  • Storage media or drive failure
  • Software defined service

What is Redundancy Factor ?

Redundancy factor is the technique that is determine how many component can be sustained in case of failure to deliver non-interrupting operations.

Nutanix offer two Redundancy factor of component failure :

  • Redundancy Factor 2
  • Redundancy Factor 3

Redundancy Factor 2 ( RF 2 )

Redundancy Factor 2 ( RF 2 ) which is default Redundancy factor required to build the Nutanix cluster. Nutanix required minimum three nodes to form the enterprise level along with fault tolerance Nutanix cluster to sustain the single component failure scenario.

If Nutuanix cluster configured with Redundancy Factor 2 it means Nutanix’s hardware has two components redundancy in case of one component failed the another / backup one component will take-over the operation load without any interruption and deliver one components failure sustainability in the Nutanix cluster.

Use Case : Nutanix’s server has two Power Supply Units ( PDU ) in redundancy state, if one PDU failed / faulty cause of any reason the another one will take-overt the responsibility to supply the power to the Nutanix server without any interruption or downtime.

Redundancy Factor 3 ( RF 3 )

Redundancy Factor 3 ( RF 2 ) delivers the advanced level Redundancy, Nutanix hyper converged HCI cluster required minimum five nodes to form the Nutanix cluster and can sustain up to two components failure simultaneously.

Redundancy Factor 3 ( RF 2 ) means Nutanix’s hardware has three components redundancy in case of two component failed the another / backup two component will take-over the operation load without any interruption and deliver two components failure sustainability in the Nutanix cluster.

Redundancy Factor 3 ( RF 3 ) is useful in most critical environment where Redundancy Factor 2 ( RF 2 ) is not enough to handle more than one component failure scenario.

Use Case : If Nutanix cluster having five or more nodes in the cluster, then Nutanix cluster can sustain up to two nodes or storage drives failure without any interruption or break down and will continue the operation.

What is Resiliency or Replication Factor ?

Resiliency Factor ( RF ) also known as Replication Factor ( RF ) is method for data protection to create data redundancy and highest degree of availability of master data block distributed in the Nutanix cluster.

To ensure the data blocks are redundant or having one or two copies, a majority of nodes must agree before anything is committed, which is enforced using the Paxos algorithm.  This ensures strict consistency for all data and global metadata stored as part of the platform.

The Nutanix platform currently uses a resiliency factor, also known as a replication factor (RF), and checksum to ensure data redundancy and availability in the case of a node or disk failure or corruption.  As explained above, the OpLog acts as a staging area to absorb incoming writes onto a low-latency SSD tier.

Upon being written to the local OpLog, the data is synchronously replicated to another one or two Nutanix CVM’s OpLog (dependent on RF) before being acknowledged (Ack) as a successful write to the host.  This ensures that the data exists in at least two or three independent locations and is fault tolerant.

Read more What is Nutanix Acropolis (AOS) ?

Replication Factor 2 ( RF 2 )

Nutanix Resiliency Factor / Replication Factor 2 ( RF 2 ) is required minimum three nodes to apply Replication Factor 2 ( RF 2 ) on container(s) on Nutanix cluster.

Resiliency Factor / Replication Factor 2 maintain one copy of Master data block ( VM’s data and OpLog ) to sustain the single storage drive failure or cause bit-rot or data corruption on drive on single drive or single node failure in Nutanix cluster.

Nutanix Node Failure Scenario
Nutanix Node Failure Scenario

Replication Factor 3 ( RF 3 )

Nutanix Resiliency Factor / Replication Factor 3 ( RF 3 ) is required minimum five nodes to apply Replication Factor ( RF 3 ) on container(s) on Nutanix cluster.

Resiliency Factor / Replication Factor 3 maintain two copies of Master data block ( VM’s data and OpLog ) to sustain the two storage drive failure or cause bit-rot or data corruption on drive on single drive or two nodes failure in Nutanix cluster.

How Replication Factor Works ?

Resiliency Factor / Replication Factor (RF) Data RF is configured via Prism and is done at the container level. All nodes participate in OpLog replication to eliminate any “hot nodes”, ensuring linear performance at scale.

While the data is being written, a checksum is computed and stored as part of its metadata. Data is then asynchronously drained to the extent store where the RF is implicitly maintained.In the case of a node or disk failure, the data is then re-replicated among all nodes in the cluster to maintain the RF.

Any time the data is read, the checksum is computed to ensure the data is valid.  In the event where the checksum and data don’t match, the replica of the data will be read and will replace the non-valid copy.
In the case of a node or disk failure, the data is then re-replicated among all nodes in the cluster to maintain the RF.

Any time the data is read, the checksum is computed to ensure the data is valid.  In the event where the checksum and data don’t match, the replica of the data will be read and will replace the non-valid copy.

Data is also consistently monitored to ensure integrity even when active I/O isn’t occurring. The Nutanix cluster component Stargate’s scrubber operation will consistently scan through extent groups and perform checksum validation when disks aren’t heavily utilized. This protects against things like bit rot or corrupted sectors.

Nutanix Topx Video

Nutanix Redundancy Factor Vs Resiliency / Replication Factor

How to change RF on Nutanix cluster ?

We can change RF number 2 to 3 but not vise versa from Nutanix Prism Web console or command line as well.

Watch video how to do it.

Change Nutanix cluster RF Number

Conclusion

Nutanix delivers the best fail-over mechanism ie. Redundancy Factor ( RF 2 / 3 ) for hardware component fault tolerance and Resiliency Factor / Replication Factor ( RF 2 / 3 ) for data protection to create redundant copies of data block to handle one or two component failure scenarios without any operation interruption in production environment.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Powered by WordPress.com.

Up ↑