Nutanix cluster / CVM runs hundreds of services to run and maintain the cluster, but what will happen when Nutanix cluster / CVMs services goes down. of course need to troubleshoot the Nutanix cluster / CVM services to get back them up and running to turn the cluster status green.
It is very difficult for any administrator to troubleshoot the Nutanix cluster / Nutanix CVM down services. Therefore is have mentioned the simple steps to troubleshoot the Nutanix cluster / CVM services.
Nutanix Cluster services Troubleshooting
Let’s troubleshoot the common Nutanix cluster / CVM services down issues to resolve it and turn Nutanix cluster / CVM health back to green.
Read also: Nutanix Cluster Most Critical Services
Issue 1: Upgrade stuck due to Genesis not able to start services after Cassandra service
Resolution : To resolve the Upgrade stuck due to Genesis not able to start services after Cassandra service – Run following command from any Nutanix CVM in the cluster
nutanix@cvm$ allssh 'genesis restart'
Issue 2: Unreachable DNS server can prevent 2 node clusters from starting services after failure
Resolution : To resolve the Unreachable DNS server can prevent 2 node clusters from starting services after failure – check the DNS / Name server entry in CVM configuration file and check connectivity.
Command 1: Check DNS / Name server entry on cluster configuration
nutanix@cvm:~$ zeus_config_printer | grep name_server
Command 2: then check the DNS / Name server entry in all CVMs configuration file.
nutanix@cvm:~$ allssh "cat /etc/resolv.conf"
If DNS entry is not found then add DNS server IP address / host name from Prism as showing following screenshot.
Make sure DNS server is reachable before putting DNS IP address / host name.
Issue 3: SSP: Enabling Self-Service Portal Services
Resolution: To resolve SSP: Enabling Self-Service Portal Services – Need to enable the SSP service on all Nutanix CVM
Services for the Self-Service Portal (SSP) feature are disabled by default on AHV hosts on which the Controller VM has less than 24 GB of memory.
SSP is supported on AHV hosts only.
Step 1: check the Nutanix CVM Memory allocation that must be at least 24 GB or greater would be fine.
nutanix@cvm$ free -m
Step 2: If Nutanix CVM Memory allocation is less then 24 GB then need to scale-up the memory to at least 24 GB or greater.
Option 1: Increase / scale-up Nutlanix CVM memory from Prism console
Option 2: Increase / Scale-up Nutanix CVM memory from command-line
Step 3: Restart Genesis service on all Nutanix CVMs
nutanix@cvm$ allssh genesis restart nutanix@cvm$ allssh genesis stop prism nutanix@cvm$ cluster start
Issue 4: Nutanix CVM / Cluster Services are down
Let’s troubleshoot the Nutanix Cluster / CVM services down issue. first of all try to understand the Nutanix cluster critical services here:
Few Nutanix Critical services list is here:
- acropolis
- andruil
- aplos
- aplos_engine
- catalog
- cluster_config
- cluster_sync
- delphi
- ergon
- flow
- lazan
- minerva_cvm
- snmp_manager
- sys_stat_collector
- uhura
- xtrim
Read more: Nutanix Cluster Most Critical Services
Resolution: check the Nutanix CVM / Cluster services status and restart them.
Step 1: Check Nutanix CVM / Cluster services status
nutanix@CVM$ ncc health_checks run_all nutanix@CVM$ ncc health_checks system_checks cluster_services_status nutanix@CVM$ ncc health_checks system_checks cvm_services_status nutanix@cvm$ ncc health_checks hypervisor_checks check_services nutanix@cvm$ ncc health_checks system_checks cluster_services_down_check
The NCC health check cluster_services_status verifies if the Controller VM (CVM) services have restarted recently across the cluster.
The following services are checked:
- alert_manager
- arithmos
- cassandra_monitor
- cerebro
- chronos_node_main
- cluster_manager_monitor
- hyperint_monitor
- pithos
- prism_monitor
- stargate
- stargate_monitor_main
- stats_aggregator_monitor
- zookeeper_monitor
- curator
Step 2: Shortlist the down services on all Nutanix CVM
nutanix@pcvm$ cluster status | grep -v UP
Step 3: Start Nutanix CVM / Cluster services
nutanix@pcvm$ cluster start
Note: Above command will not impact your production running VMs.
Optional Step 4: If step 3 command does not start the down services then you can reboot your either Nutlanix Node or Nutanix CVM.
Step 4.1: Reboot Nutanix CVM
nutanix@cvm$ cvm_shutdown -r now
Step 4.1.1: OR Shutdown Nutanix CVM
nutanix@cvm$ cvm_shutdown -P now
Step 4.1.2: Power-on the Shudown Nutanix CVM
SSH to Nuanix AHV host
root# virsh list --all | grep CVM
In output you will see CVM Name, just copy it and run following command to start the Nutanix CVM
root# virsh start <CVM_Name>
Wait for 5 Minutes to boot-up the Nutanix CVM and services.
OR Step 4.2 : You can put your host in maintenance mode and then reboot node
Read more: Enable Nutanix CVM, AHV Maintenance mode
Read more: How to Shutdown / Reboot Nutanix AHV Host and Nutanix CVM
Final Step : Now check Nutanix cluster status and running services.
nutanix@pcvm$ cluster status
Issue 5: Nutanix Gateway not reachable. Http request error
Resolution: Need to restart the Nutanix Console services on the host, which is Prism leader.
Step 1: Find the Nutanix Prism Leader – Verify which cluster node is the Prism leader, that is, the CVM running the Prism container services.
nutanix@cvm$ curl http://0:2019/prism/leader && echo
Output should look similar as following
{"leader":"xx.xx.xx.10:9080", "is_local":false}
It means xx.xx.xx.10 CVM is the Prism Leader.
Step 2: SSH to Prism Leader and run the following command to restart Prism service.
nutanix@cvm$ genesis stop prism nutanix@cvm$ cluster start
Note: There is no impact on running production of above commands.
Read also: Nutanix Prism web console is slow, not working, hanging issues troubleshooting
Issue 6: Critical : Cluster Service: Aplos is down on the Controller VM
Issue 7: LCM upgrade fails with error “Services not up” on a 2-node cluster
Resolution: Above both services run in LCM framework.
This is known issue. therefore it is recommended to upgrade Nutanix NCC and LCM framework version to latest available version.
Read also: How Nutanix LCM Life Cycle Management Framework Works ?
Hopefully, today you have learned something new and interesting topic.
Thanks to being with HyperHCI Tech Blog to stay tuned and keep learning till last breath.