Nutanix Cluster Services Down – Troubleshooting

Nutanix Cluster CVM Services are Down Troubleshooting

Nutanix cluster / CVM runs hundreds of services to run and maintain the cluster, but what will happen when Nutanix cluster / CVMs services goes down. of course need to troubleshoot the Nutanix cluster / CVM services to get back them up and running to turn the cluster status green.

It is very difficult for any administrator to troubleshoot the Nutanix cluster / Nutanix CVM down services. Therefore is have mentioned the simple steps to troubleshoot the Nutanix cluster / CVM services.

Nutanix Cluster services Troubleshooting

Let’s troubleshoot the common Nutanix cluster / CVM services down issues to resolve it and turn Nutanix cluster / CVM health back to green.

Read also: Nutanix Cluster Most Critical Services

Issue 1: Upgrade stuck due to Genesis not able to start services after Cassandra service

Resolution : To resolve the Upgrade stuck due to Genesis not able to start services after Cassandra service – Run following command from any Nutanix CVM in the cluster

nutanix@cvm$ allssh 'genesis restart'

Issue 2: Unreachable DNS server can prevent 2 node clusters from starting services after failure

Resolution : To resolve the Unreachable DNS server can prevent 2 node clusters from starting services after failure – check the DNS / Name server entry in CVM configuration file and check connectivity.

Command 1: Check DNS / Name server entry on cluster configuration

nutanix@cvm:~$ zeus_config_printer | grep name_server

Command 2: then check the DNS / Name server entry in all CVMs configuration file.

nutanix@cvm:~$ allssh "cat /etc/resolv.conf"

If DNS entry is not found then add DNS server IP address / host name from Prism as showing following screenshot.

Make sure DNS server is reachable before putting DNS IP address / host name.

Issue 3: SSP: Enabling Self-Service Portal Services

Resolution: To resolve SSP: Enabling Self-Service Portal Services – Need to enable the SSP service on all Nutanix CVM

Services for the Self-Service Portal (SSP) feature are disabled by default on AHV hosts on which the Controller VM has less than 24 GB of memory.

SSP is supported on AHV hosts only.

Step 1: check the Nutanix CVM Memory allocation that must be at least 24 GB or greater would be fine.

nutanix@cvm$ free -m

Step 2: If Nutanix CVM Memory allocation is less then 24 GB then need to scale-up the memory to at least 24 GB or greater.

Option 1: Increase / scale-up Nutlanix CVM memory from Prism console

Option 2: Increase / Scale-up Nutanix CVM memory from command-line

Step 3: Restart Genesis service on all Nutanix CVMs

nutanix@cvm$ allssh genesis restart
nutanix@cvm$ allssh genesis stop prism
nutanix@cvm$ cluster start

Issue 4: Nutanix CVM / Cluster Services are down

Let’s troubleshoot the Nutanix Cluster / CVM services down issue. first of all try to understand the Nutanix cluster critical services here:

Few Nutanix Critical services list is here:

  • acropolis
  • andruil
  • aplos
  • aplos_engine
  • catalog
  • cluster_config
  • cluster_sync
  • delphi
  • ergon
  • flow
  • lazan
  • minerva_cvm
  • snmp_manager
  • sys_stat_collector
  • uhura
  • xtrim

Read more: Nutanix Cluster Most Critical Services

Resolution: check the Nutanix CVM / Cluster services status and restart them.

Step 1: Check Nutanix CVM / Cluster services status

nutanix@CVM$ ncc health_checks run_all
nutanix@CVM$ ncc health_checks system_checks cluster_services_status
nutanix@CVM$ ncc health_checks system_checks cvm_services_status
nutanix@cvm$ ncc health_checks hypervisor_checks check_services
nutanix@cvm$ ncc health_checks system_checks cluster_services_down_check

The NCC health check cluster_services_status verifies if the Controller VM (CVM) services have restarted recently across the cluster.

The following services are checked:

  • alert_manager
  • arithmos
  • cassandra_monitor
  • cerebro
  • chronos_node_main
  • cluster_manager_monitor
  • hyperint_monitor
  • pithos
  • prism_monitor
  • stargate
  • stargate_monitor_main
  • stats_aggregator_monitor
  • zookeeper_monitor
  • curator

Step 2: Shortlist the down services on all Nutanix CVM

nutanix@pcvm$ cluster status | grep -v UP

Step 3: Start Nutanix CVM / Cluster services

nutanix@pcvm$ cluster start

Note: Above command will not impact your production running VMs.

Optional Step 4: If step 3 command does not start the down services then you can reboot your either Nutlanix Node or Nutanix CVM.

Step 4.1: Reboot Nutanix CVM

nutanix@cvm$ cvm_shutdown -r now

Step 4.1.1: OR Shutdown Nutanix CVM

nutanix@cvm$ cvm_shutdown -P now

Step 4.1.2: Power-on the Shudown Nutanix CVM

SSH to Nuanix AHV host

root# virsh list --all | grep CVM

In output you will see CVM Name, just copy it and run following command to start the Nutanix CVM

root# virsh start <CVM_Name>

Wait for 5 Minutes to boot-up the Nutanix CVM and services.

OR Step 4.2 : You can put your host in maintenance mode and then reboot node

Read more: Enable Nutanix CVM, AHV Maintenance mode

Read more: How to Shutdown / Reboot Nutanix AHV Host and Nutanix CVM

Final Step : Now check Nutanix cluster status and running services.

nutanix@pcvm$ cluster status

Issue 5: Nutanix Gateway not reachable. Http request error

Resolution: Need to restart the Nutanix Console services on the host, which is Prism leader.

Step 1: Find the Nutanix Prism Leader – Verify which cluster node is the Prism leader, that is, the CVM running the Prism container services.

nutanix@cvm$ curl http://0:2019/prism/leader && echo

Output should look similar as following

{"leader":"xx.xx.xx.10:9080", "is_local":false}

It means xx.xx.xx.10 CVM is the Prism Leader.

Step 2: SSH to Prism Leader and run the following command to restart Prism service.

nutanix@cvm$ genesis stop prism 
nutanix@cvm$ cluster start

Note: There is no impact on running production of above commands.

Read also: Nutanix Prism web console is slow, not working, hanging issues troubleshooting

Issue 6: Critical : Cluster Service: Aplos is down on the Controller VM

Issue 7: LCM upgrade fails with error “Services not up” on a 2-node cluster

Resolution: Above both services run in LCM framework.

This is known issue. therefore it is recommended to upgrade Nutanix NCC and LCM framework version to latest available version.

Read also: How Nutanix LCM Life Cycle Management Framework Works ?

Hopefully, today you have learned something new and interesting topic.

Thanks to being with HyperHCI Tech Blog to stay tuned and keep learning till last breath.