Nutanix Metro Availability Troubleshooting

Nutanix Metro Availability Troubleshooting

Today i will explore the top 5 common errors / issues of Nutanix Disaster Recovery Metro availability along with will share the tips and tricks to troubleshoot the issues to get the final resolution. This post will very helpful to all those, who are stuck in Nutanix Metro availability technical issues and looking solution for it.

Nutanix Metro Availability feature is available with VMware ESXi only and in future- Metro availability feature coming with Nutanix AHV also. I am sharing common errors, issues you might face in Nutanix metro availability feature during and/or after Metro availability configuration.

Read also: Nutanix Move Tool Issues Troubleshooting

Nutanix Metro Availability common Issues & Troubleshooting

I have top 5 common errors, issues listed out here of Nutanix Disaster recovery Metro Availability and good thing is have the Nutanix Metro availability issues, error’s resolutions / solution also.

Lets explore the top 5 common technical issues of Nutanix Metro availability and do the solution step by step to all of them.

Read Also: How to Kill Stuck, Hung Task in Nutanix Prism via Command

Automatic Promoting Active-Passive Issue

Issue 1: Alert – A130116 – Automatic Promote Metro Availability

Symptoms:

The A130116 – Automatic Promote Metro Availability alert is raised on the standby site in a Witness Metro configuration when the PD (protection domain) on the standby site is promoted to Active.

Possible conditions are:

  • Witness VM cannot access the VIP (virtual IP address) of the site where the PD was initially active
  • Communication between the VIP of the Active and the Standby site is interrupted

Resolution

  • Verify that the 2 sites are up and can communicate with each other over the VIP
  • Verify that the 2 sites can communicate with the Witness VM.

Read also: Download Nutanix NGT Tool for Windows, Linux OS

Metro Availability Failure Issue

Issue 2: Alert – A130117 – Failed to update Metro Availability failure handling

Issue 2.1: Alert – A130118 – Metro Availability- Failed to update Metro Availability failure handling on the remote site

Symptoms

Error in updating failure handling on the Source Metro Availability protection domain, this alert may be generated because of the following reasons.

  • Network Issue between Nutanix clusters
  • Services on either cluster are in a crash loop or stopped
  • If ports 2009,2020 are blocked temporarily or unreachable

Resolution

  • Run following command on any CVM to review Cerebro maste
cvm$ cerebro_cli get_master_location
  • Execute following command on Source Cluster’s one of the CVM and review whether ping stats looks good between Active and Standby clusters.
cvm$ cat ~/data/logs/sysstats/ping_remotes.INFO | egrep -v "IP : time" | awk '/^#TIMESTAMP/ || $3>10.00 || $3=unreachable' | egrep -B1 " ms|unreachable" | egrep -v "\-\-"

Output would be like this:

#TIMESTAMP 2941677438847 : 10/02/2020 09:10:50 PM
10.X.X.X : 180 ms
10.Y.Y.Y : unreachable
10.Z.Z.Z : 180 ms
10.Q.Q.Q : unreachable
  • Verify if required ports are open between the Source and the Remote cluster through netcat utility
cvm$ nc -v <remote site CVM IP> 2009
cvm$ nc -v <remote site CVM IP> 2020
  • Verify if any services are crashing on source or target clusters
watch -d “genesis status”

Read Also: What is Nutanix AHV ?

High Network Latency Issue

Issue 3: High network latency between Metro Availability Protection Domains (also known as stretched clusters)

Symptoms

High network latency greater than 5ms between active and standby metro clusters for 10 seconds delays the commit acknowledgement to the VM, which results remote site is then shown in the Incompatible Remote Sites list due to LATENCY “Bad” and metro relationship automatically disabled.

Resolution

  • Ping latency between metro availability cluster, every hour or every 4 hours
  • Determine if there is any snapshot/replication activity around that time
  • Captured Ping ( ping_remotes.INFO ) Status and network latency in log file located in the cat “/home/nutanix/data/logs/sysstats/” directory on each CVM.

Read Also: Nutanix AHV – Boot VM in BIOS UEFI Mode

Witness VM Not Reachable Issue

Issue 4: Alert – A130115 – Witness VM Not Reachable

Symptoms

The Nutanix Metro Availability Witness VM Not Reachable alert is generated when the cluster involved in Metro Availability is unable to contact the Witness VM on the network, get a response or unable to authenticate to it. This alert may be generated because of the following reasons.

  • Witness VM is down
  • Witness VM is not reachable from Nutanix cluster. Possible temporary/permanent network issue or Firewall configuration.
  • Witness VM internal Server Errors. The Witness VM is not responding to requests
  • Witness VM admin user password has been changed and clusters involved in Metro Availability cannot authenticate the Witness server.

Resolution

  1. Witness VM is down or not reachable over the network
    • Check if the Witness VM is up and running
    • Ping the Witness VM to confirm if it is accessible over the network.
  2. Witness VM internal server errors
    • Check for any errors/Alerts being reported on the witness VM.
  3. Witness VM authentication/password errors
    • Confirm the password used to connect to the witness VM is valid.

Read Also: Google Nutanix Cloud Platform ?

Nutanix Files Server Migration Issue

Issue 5: Nutanix Files : Issues while Migrating Nutanix Files server between ESXi Nutanix clusters (Metro Availability pair)

Symptoms

When migrating a Nutanix Files server cluster to the remote site, where the remote site is the other side of a Nutnaix Metro Availability pair, you may see an issue during activation where the task to activate the Nutanix files server hangs at 47% until it finally times out.

Resolution

  • Reviewing the VMs in vSphere, you see two entries for each FSVM, one labeled “<FSVM_name> (Orphaned)” and another labeled “<FSVM_name> (1)”
  • Now you have to remove the each FSVM Orphanded entry from vCenter.

Note: Removing a VM from inventory does not delete the VM from disk

  • For each “<FSVM_name> (1)” right-click the VM, select rename, and remove the ” (1)” from the end of the name.

Now you should now have just one entry per FSVM with the correct name.

Having completed the workaround above, you should be able to run the Activate workflow for the migrated Files server without issue.

Read Also: Shutdown / Start Nutanix vSphere Cluster – Best Practice

Conclusion

hopefully, you would get help to resolve your Nutanix Metro Availability issues mentioned in this post.

Thanks to being with HyperHCI Tech Blog to stay tuned and being connected to Follow Us on social networks.

Useful Links