Nutanix Cluster Services Down – Troubleshooting

Nutanix Cluster CVM Services are Down Troubleshooting

HyperHCI Admin Nutanix CVM July 29, 2020

Nutanix cluster / CVM runs hundreds of services to run and maintain the cluster, but what will happen when Nutanix cluster / CVMs services goes down. of course need to troubleshoot the Nutanix cluster / CVM services to get back them up and running to turn the cluster status green.

It is very difficult for any administrator to troubleshoot the Nutanix cluster / Nutanix CVM down services. Therefore is have mentioned the simple steps to troubleshoot the Nutanix cluster / CVM services.

Nutanix Cluster services Troubleshooting

Let’s troubleshoot the common Nutanix cluster / CVM services down issues to resolve it and turn Nutanix cluster / CVM health back to green.

Issue 1: Upgrade stuck due to Genesis not able to start services after Cassandra service

Resolution : To resolve the Upgrade stuck due to Genesis not able to start services after Cassandra service – Run following command from any Nutanix CVM in the cluster

nutanix@cvm$ allssh 'genesis restart'

Issue 2: Unreachable DNS server can prevent 2 node clusters from starting services after failure

Resolution : To resolve the Unreachable DNS server can prevent 2 node clusters from starting services after failure – check the DNS / Name server entry in CVM configuration file and check connectivity.

Command 1: Check DNS / Name server entry on cluster configuration

nutanix@cvm:~$ zeus_config_printer | grep name_server

Command 2: then check the DNS / Name server entry in all CVMs configuration file.

nutanix@cvm:~$ allssh "cat /etc/resolv.conf"

If DNS entry is not found then add DNS server IP address / host name from Prism as showing following screenshot.

Make sure DNS server is reachable before putting DNS IP address / host name.

Issue 3: SSP: Enabling Self-Service Portal Services

Resolution: To resolve SSP: Enabling Self-Service Portal Services – Need to enable the SSP service on all Nutanix CVM

Services for the Self-Service Portal (SSP) feature are disabled by default on AHV hosts on which the Controller VM has less than 24 GB of memory.

SSP is supported on AHV hosts only.

Step 1: check the Nutanix CVM Memory allocation that must be at least 24 GB or greater would be fine.

nutanix@cvm$ free -m

Step 2: If Nutanix CVM Memory allocation is less then 24 GB then need to scale-up the memory to at least 24 GB or greater.

Option 1: Increase / scale-up Nutlanix CVM memory from Prism console

Option 2: Increase / Scale-up Nutanix CVM memory from command-line

Step 3: Restart Genesis service on all Nutanix CVMs

nutanix@cvm$ allssh genesis restart
nutanix@cvm$ allssh genesis stop prism
nutanix@cvm$ cluster start

Issue 4: Nutanix CVM / Cluster Services are down

Let’s troubleshoot the Nutanix Cluster / CVM services down issue. first of all try to understand the Nutanix cluster critical services here:

Few Nutanix Critical services list is here:

acropolis
andruil
aplos
aplos_engine
catalog
cluster_config
cluster_sync
delphi
ergon
flow
lazan
minerva_cvm
snmp_manager
sys_stat_collector
uhura
xtrim

Resolution: check the Nutanix CVM / Cluster services status and restart them.

Step 1: Check Nutanix CVM / Cluster services status

nutanix@CVM$ ncc health_checks run_all
nutanix@CVM$ ncc health_checks system_checks cluster_services_status
nutanix@CVM$ ncc health_checks system_checks cvm_services_status
nutanix@cvm$ ncc health_checks hypervisor_checks check_services
nutanix@cvm$ ncc health_checks system_checks cluster_services_down_check

The NCC health check cluster_services_status verifies if the Controller VM (CVM) services have restarted recently across the cluster.

The following services are checked:

alert_manager
arithmos
cassandra_monitor
cerebro
chronos_node_main
cluster_manager_monitor
hyperint_monitor
pithos
prism_monitor
stargate
stargate_monitor_main
stats_aggregator_monitor
zookeeper_monitor
curator

Step 2: Shortlist the down services on all Nutanix CVM

nutanix@pcvm$ cluster status | grep -v UP

Step 3: Start Nutanix CVM / Cluster services

nutanix@pcvm$ cluster start

Note: Above command will not impact your production running VMs.

Optional Step 4: If step 3 command does not start the down services then you can reboot your either Nutlanix Node or Nutanix CVM.

Step 4.1: Reboot Nutanix CVM

nutanix@cvm$ cvm_shutdown -r now

Step 4.1.1: OR Shutdown Nutanix CVM

nutanix@cvm$ cvm_shutdown -P now

Step 4.1.2: Power-on the Shudown Nutanix CVM

SSH to Nuanix AHV host

root# virsh list --all | grep CVM

In output you will see CVM Name, just copy it and run following command to start the Nutanix CVM

root# virsh start <CVM_Name>

Wait for 5 Minutes to boot-up the Nutanix CVM and services.

OR Step 4.2 : You can put your host in maintenance mode and then reboot node

Final Step : Now check Nutanix cluster status and running services.

nutanix@pcvm$ cluster status

Issue 5: Nutanix Gateway not reachable. Http request error

Resolution: Need to restart the Nutanix Console services on the host, which is Prism leader.

Step 1: Find the Nutanix Prism Leader – Verify which cluster node is the Prism leader, that is, the CVM running the Prism container services.

nutanix@cvm$ curl http://0:2019/prism/leader && echo

Output should look similar as following

{"leader":"xx.xx.xx.10:9080", "is_local":false}

It means xx.xx.xx.10 CVM is the Prism Leader.

Step 2: SSH to Prism Leader and run the following command to restart Prism service.

nutanix@cvm$ genesis stop prism 
nutanix@cvm$ cluster start

Note: There is no impact on running production of above commands.

Issue 6: Critical : Cluster Service: Aplos is down on the Controller VM

Issue 7: LCM upgrade fails with error “Services not up” on a 2-node cluster

Resolution: Above both services run in LCM framework.

This is known issue. therefore it is recommended to upgrade Nutanix NCC and LCM framework version to latest available version.

Hopefully, today you have learned something new and interesting topic.

Thanks to being with HyperHCI Tech Blog to stay tuned and keep learning till last breath.

Blog Author

Nutanix Cluster Services Down – Troubleshooting

Nutanix Cluster services Troubleshooting

Issue 1: Upgrade stuck due to Genesis not able to start services after Cassandra service

Issue 2: Unreachable DNS server can prevent 2 node clusters from starting services after failure

Issue 3: SSP: Enabling Self-Service Portal Services

Issue 4: Nutanix CVM / Cluster Services are down

Issue 5: Nutanix Gateway not reachable. Http request error

Issue 6: Critical : Cluster Service: Aplos is down on the Controller VM

Issue 7: LCM upgrade fails with error “Services not up” on a 2-node cluster

Related

Written by HyperHCI Admin

Blog Author

Blog Author

Nutanix Cluster Services Down – Troubleshooting

Nutanix Cluster services Troubleshooting

Issue 1: Upgrade stuck due to Genesis not able to start services after Cassandra service

Issue 2: Unreachable DNS server can prevent 2 node clusters from starting services after failure

Issue 3: SSP: Enabling Self-Service Portal Services

Issue 4: Nutanix CVM / Cluster Services are down

Issue 5: Nutanix Gateway not reachable. Http request error

Issue 6: Critical : Cluster Service: Aplos is down on the Controller VM

Issue 7: LCM upgrade fails with error “Services not up” on a 2-node cluster

Share this:

Related

Written by HyperHCI Admin

Blog Author