Nutanix Metro Availability Troubleshooting

HyperHCI Admin Nutanix Cluster February 18, 2020

Today i will explore the top 5 common errors / issues of Nutanix Disaster Recovery Metro availability along with will share the tips and tricks to troubleshoot the issues to get the final resolution. This post will very helpful to all those, who are stuck in Nutanix Metro availability technical issues and looking solution for it.

Nutanix Metro Availability feature is available with VMware ESXi only and in future- Metro availability feature coming with Nutanix AHV also. I am sharing common errors, issues you might face in Nutanix metro availability feature during and/or after Metro availability configuration.

Nutanix Metro Availability common Issues & Troubleshooting

I have top 5 common errors, issues listed out here of Nutanix Disaster recovery Metro Availability and good thing is have the Nutanix Metro availability issues, error’s resolutions / solution also.

Lets explore the top 5 common technical issues of Nutanix Metro availability and do the solution step by step to all of them.

Automatic Promoting Active-Passive Issue

Issue 1: Alert – A130116 – Automatic Promote Metro Availability

Symptoms:

The A130116 – Automatic Promote Metro Availability alert is raised on the standby site in a Witness Metro configuration when the PD (protection domain) on the standby site is promoted to Active.

Possible conditions are:

Witness VM cannot access the VIP (virtual IP address) of the site where the PD was initially active
Communication between the VIP of the Active and the Standby site is interrupted

Resolution

Verify that the 2 sites are up and can communicate with each other over the VIP
Verify that the 2 sites can communicate with the Witness VM.

Metro Availability Failure Issue

Issue 2: Alert – A130117 – Failed to update Metro Availability failure handling

Issue 2.1: Alert – A130118 – Metro Availability- Failed to update Metro Availability failure handling on the remote site

Symptoms

Error in updating failure handling on the Source Metro Availability protection domain, this alert may be generated because of the following reasons.

Network Issue between Nutanix clusters
Services on either cluster are in a crash loop or stopped
If ports 2009,2020 are blocked temporarily or unreachable

Resolution

Run following command on any CVM to review Cerebro maste

cvm$ cerebro_cli get_master_location

Execute following command on Source Cluster’s one of the CVM and review whether ping stats looks good between Active and Standby clusters.

cvm$ cat ~/data/logs/sysstats/ping_remotes.INFO | egrep -v "IP : time" | awk '/^#TIMESTAMP/ || $3>10.00 || $3=unreachable' | egrep -B1 " ms|unreachable" | egrep -v "\-\-"

Output would be like this:

#TIMESTAMP 2941677438847 : 10/02/2020 09:10:50 PM
10.X.X.X : 180 ms
10.Y.Y.Y : unreachable
10.Z.Z.Z : 180 ms
10.Q.Q.Q : unreachable

Verify if required ports are open between the Source and the Remote cluster through netcat utility

cvm$ nc -v <remote site CVM IP> 2009
cvm$ nc -v <remote site CVM IP> 2020

Verify if any services are crashing on source or target clusters

watch -d “genesis status”

Read Also: What is Nutanix AHV ?

High Network Latency Issue

Issue 3: High network latency between Metro Availability Protection Domains (also known as stretched clusters)

Symptoms

High network latency greater than 5ms between active and standby metro clusters for 10 seconds delays the commit acknowledgement to the VM, which results remote site is then shown in the Incompatible Remote Sites list due to LATENCY “Bad” and metro relationship automatically disabled.

Resolution

Ping latency between metro availability cluster, every hour or every 4 hours
Determine if there is any snapshot/replication activity around that time
Captured Ping ( ping_remotes.INFO ) Status and network latency in log file located in the cat “/home/nutanix/data/logs/sysstats/” directory on each CVM.

Witness VM Not Reachable Issue

Issue 4: Alert – A130115 – Witness VM Not Reachable

Symptoms

The Nutanix Metro Availability Witness VM Not Reachable alert is generated when the cluster involved in Metro Availability is unable to contact the Witness VM on the network, get a response or unable to authenticate to it. This alert may be generated because of the following reasons.

Witness VM is down
Witness VM is not reachable from Nutanix cluster. Possible temporary/permanent network issue or Firewall configuration.
Witness VM internal Server Errors. The Witness VM is not responding to requests
Witness VM admin user password has been changed and clusters involved in Metro Availability cannot authenticate the Witness server.

Resolution

Witness VM is down or not reachable over the network
- Check if the Witness VM is up and running
- Ping the Witness VM to confirm if it is accessible over the network.
Witness VM internal server errors
- Check for any errors/Alerts being reported on the witness VM.
Witness VM authentication/password errors
- Confirm the password used to connect to the witness VM is valid.

Read Also: Google Nutanix Cloud Platform ?

Nutanix Files Server Migration Issue

Issue 5: Nutanix Files : Issues while Migrating Nutanix Files server between ESXi Nutanix clusters (Metro Availability pair)

Symptoms

When migrating a Nutanix Files server cluster to the remote site, where the remote site is the other side of a Nutnaix Metro Availability pair, you may see an issue during activation where the task to activate the Nutanix files server hangs at 47% until it finally times out.

Resolution

Reviewing the VMs in vSphere, you see two entries for each FSVM, one labeled “<FSVM_name> (Orphaned)” and another labeled “<FSVM_name> (1)”

Now you have to remove the each FSVM Orphanded entry from vCenter.

Note: Removing a VM from inventory does not delete the VM from disk

For each “<FSVM_name> (1)” right-click the VM, select rename, and remove the ” (1)” from the end of the name.

Now you should now have just one entry per FSVM with the correct name.

Having completed the workaround above, you should be able to run the Activate workflow for the migrated Files server without issue.

Conclusion

hopefully, you would get help to resolve your Nutanix Metro Availability issues mentioned in this post.

Thanks to being with HyperHCI Tech Blog to stay tuned and being connected to Follow Us on social networks.

Useful Links

Blog Author

Nutanix Metro Availability Troubleshooting

Nutanix Metro Availability common Issues & Troubleshooting

Automatic Promoting Active-Passive Issue

Metro Availability Failure Issue

High Network Latency Issue

Witness VM Not Reachable Issue

Nutanix Files Server Migration Issue

Related

Written by HyperHCI Admin

Blog Author

Blog Author

Nutanix Metro Availability Troubleshooting

Nutanix Metro Availability common Issues & Troubleshooting

Automatic Promoting Active-Passive Issue

Metro Availability Failure Issue

High Network Latency Issue

Witness VM Not Reachable Issue

Nutanix Files Server Migration Issue

Share this:

Related

Written by HyperHCI Admin

Blog Author