Nutanix Cluster Services Down

Nutanix cluster runs couple of services to run and maintain the cluster, but what will happen when Nutanix cluster or CVMs services goes down. of-course need to troubleshoot the Nutanix cluster / CVM services to get back them up and running to turn the cluster status green.

It is very difficult for any administrator to troubleshoot the Nutanix cluster / Nutanix CVM down services. Therefore is have mentioned the simple steps to troubleshoot the Nutanix cluster / CVM services.

Running a Nutanix HCI environment demands stability, predictable performance, and continuous availability. When a cluster service goes down, even for a short period, administrators may face VM downtime, management plane delays, upgrade failures, and unpredictable cluster behaviour.

This updated and fully refreshed guide explains, in simple and professional language, how Nutanix Cluster Services work, why they may fail, how to troubleshoot them step-by-step, and what administrators must do to prevent recurring issues.

The goal of this post is to help every Nutanix administrator understand the problem clearly, troubleshoot in a structured flow, and keep the Nutanix environment healthy.

If you are exploring Nutanix AHV-AOS versions or compatibility, you may also check our guide: Latest Nutanix AHV & AOS versions compatible matrix

Follow step to step Nutanix Cluster Services Down: Latest Troubleshooting Guide for Continuous Nutanix Cluster Availability

Table Of Contents

Understand Nutanix Cluster Services
Cluster Services Down: Production Impact
Nutanix Cluster services Troubleshooting
Frequently Ask Questions (FAQs)
Conclusion

Understand Nutanix Cluster Services

Nutanix follows a distributed architecture where several microservices run inside every Controller VM (CVM). These services communicate with each other continuously to maintain cluster operations. When even one of these services goes down, the entire environment may experience instability.

These services manage almost everything inside the environment, such as:

Virtual machine orchestration
Data replication and recovery
Storage input and output operations
Data compaction and balancing
Metadata consistency
UI and API communication
Cluster leadership and quorum
Security and certificate management

Some of the core Nutanix services include:

Genesis Service – Service manager for starting and stopping all other services
Cerebro – Replication metadata and disaster recovery service
Curator – Data compaction and storage optimisation
Acropolis (APLOS) – VM orchestration, AHV operations
Medusa – Distributed metadata store
Pithos – Storage IO path service
Epsilon – Cluster quorum and leader election
Cluster-Gateway and API Services – For Prism communication

Core Nutanix Services and Their Purpose:

Service Name	Purpose
Genesis	Controls start and stop of all services on CVMs
Cerebro	Powering disaster recovery, snapshots, and replication
Curator	Data balancing, cleanup, compaction, and storage optimization
Acropolis (APLOS)	VM orchestration, AHV operations, cluster-wide VM tasks
Medusa	Distributed metadata store for the entire cluster
Pithos	Storage IO service responsible for data path
Epsilon	Cluster quorum, leader election, and consensus
Cluster-Gateway	API communication for Prism Element and Prism Central
Zookeeper	Distributed coordination and configuration tracking

These services work like a tightly connected engine. If one component stops responding, the entire Nutanix environment may lose functionality.

To understand all Nutanix cluster components in detail, you can read our post: Nutanix Cluster Components and related services

Cluster Services Down: Production Impact

When Prism shows Cluster Services Down, it indicates that one or more essential services inside the CVM have stopped, crashed, hung, or cannot communicate with other nodes.

This may result in:

Delayed VM operations
Slow storage performance
Prism UI unresponsiveness
Replication or DR errors
LCM or AOS upgrade failures
Alerts showing missing services
Node status appearing offline even when IP is reachable

In simple words: If Nutanix services are down, your cluster cannot operate as designed.

Administrators must treat this as a priority issue because the longer the service remains down, the higher the risk of cluster imbalance or metadata inconsistency.

Nutanix Cluster services Troubleshooting

Nutanix Cluster Services Troubleshooting is one of the most important responsibilities for administrators who manage Nutanix AHV or hybrid cloud environments. Every Nutanix cluster depends on a set of distributed microservices running inside each Controller VM (CVM).

These services handle critical operations such as storage I/O, metadata management, VM orchestration, replication, and Prism communication. When even one of these services becomes unhealthy, the entire environment may show performance delays, service disruption, or “Cluster Services Down” alerts in Prism.

Most service failures are caused by common underlying factors such as DNS misconfiguration, NTP time drift, low CVM memory, disk space issues, or network connectivity interruptions. That is why troubleshooting Nutanix services requires a clear understanding of how these microservices work, how they depend on each other, and how to use commands like cluster status, genesis restart, and NCC checks to identify the exact root cause.

Let’s troubleshoot the common Nutanix cluster / CVM services down issues to resolve it and turn Nutanix cluster / CVM health back to green.

Issue 1#: Nutanix Cluster Health Service Is Down

This usually refers to issues detected by NCC health checks or major cluster service failures.

Step by step Resolution

Step 1: Login to Nutanix Cluster any CVM

Use can use putty or directly login to CVM using Nutanix Prism > VM > Select any CVM > Lauch console > Enter Credential

You are not aware about CVM Credential: Refer Nutanix CVM Default User ID and Password

Run following command to check Cluster Health checks

nutanix@cvm$ Run NCC Cluster Health Check

OR: Run full Cluster services check as mentioned below:

nutanix@cvm$ ncc health_checks run_all

Step 2: Validate Service Status

nutanix@cvm$ cluster status | grep -v UP

If Nutanix Cluster NCC health check related services are showing down you may start all services through single command

Step 3: Restart All Cluster Services (Safest Way)

nutanix@cvm$ genesis restart
OR
nutanix@cvm$ cluster start

Important: Above commands are safe to run on Nutanix cluster without any impact (Zero impact).

Issue 2#: Nutanix Ergon Service Is Down / Ergon Inaccessible on Nodes

What Ergon Does

Life Cycle Manager (LCM)
Internal workflows
Automated upgrades
System automation tasks

When Ergon is down:

LCM fails
Prism Central tasks get stuck
Some NCC modules fail

Step by step Resolution

Step 1: Login to Nutanix Cluster any CVM

nutanix@cvm$ cluster status | grep ergon
OR
nutanix@cvm$ cluster status | grep -v DOWN

if Ergon service is down. Please check following possible reasons and fix them.

Step 2: Check CVM Memory

nutanix@cvm$ free -m
nutanix@cvm$ top

If CVM memory is showing less than recommended as mentioned below. Please increase it immediately.

Put CVM in maintenance mode > shutdown CVM > Power-on CVM

Important:
Production workloads: A minimum of 32 GB is typically used for normal workloads.
All-flash clusters: At least 40 GB is required all-flash clusters, with 48 GB recommended to enable features – Blockstore
Heavy workloads: 64 GB is recommended for high-performance workloads, such as large databases or Oracle environments, AOS features like compression, deduplication, or erasure coding.

Step 3: Validate DNS resolution

Ergon relies heavily on endpoint communication.

nutanix@cvm$ nslookup <cluster-ip>

Above Nutanix cluster-IP (Prism IP) should be resolved with FQDN

Step 4: Check Network Port Access

Ensure no firewall blocks internal ports between CVMs. Refer Nutanix cluster firewall ports list

Step 5: You can review the ergon logs further

nutanix@cvm$ cat /home/nutanix/data/logs/ergon.out

Step 6: Finally Restart Genesis

nutanix@cvm$ genesis restart
OR
nutanix@cvm$ cluster star

Issue 3#: How to Fix Alert A200000 – Cluster Connectivity Status

New AOS releases include improved alerting, better dependency visibility, and deeper service-level intelligence. Alerts such as A200000, A400102, and A300001 directly point to critical cluster services. Understanding how to troubleshoot them helps administrators reduce downtime, prevent cascading failures, and bring the cluster back to a healthy state faster.

Below are the recommended steps to diagnose and fix each alert.

This alert appears when cluster nodes cannot communicate with each other properly.
It affects:

CVM heartbeat
Metadata replication
Leader election
NCC checks
Prism responsiveness

Root Causes

DNS forward or reverse lookup failure
NTP drift between nodes
IP conflict
VLAN or firewall rules blocking internal ports
Incorrect subnet or gateway
Packet drops between CVMs
CVM host instability

How to Fix It (Step-by-Step)

Step 1: Validate CVM-to-CVM Connectivity

Run from any CVM:

nutanix@cvm$ ping <all-peer-cvm-ip>

If latency or drops appear, troubleshoot network path.

Step 2: Confirm DNS Resolution

Forward and reverse lookup must work for every CVM.

Use nslookup or dig command:

nutanix@cvm$ nslookup <all-cvm-ip>
nutanix@cvm$ nslookup <cvm-hostname>
nutanix@cvm$ dig <cluster-name>

Fix DNS entries if mismatched.

Step 3: Validate NTP Time Synchronization

nutanix@cvm$ ntpq -p

If time offset is large, fix NTP configuration.

Step 4: Check Network Ports

Nutanix needs several internal communication ports open.
Most common issues involve firewall or ACL blocks.

Step 5: Check CVM Network Configuration

nutanix@cvm$ ifconfig
nutanix@cvm$ arp -a

Look for duplicated IPs, wrong subnet masks, or VLAN issues.

Step 6: Validate Host Stability

If the hypervisor host is overloaded or unhealthy, CVMs cannot communicate reliably.

Step 7: Restart Genesis (If Services Are Unresponsive)

nutanix@cvm$ genesis restart

Issue 4#: How to Fix Alert A400102 – Epsilon Service Down

The Epsilon service manages cluster quorum and leader election.
If Epsilon goes down, the cluster may become unstable or unresponsive.

Symptoms

Prism UI slow or stuck
Metadata operations delayed
Cluster leadership keeps switching
NCC checks fail
Upgrade operations stop

Root Causes

CVM resource exhaustion
DNS or NTP mismatch
Zookeeper service issues
Network path failures
Corrupted Epsilon configuration
Node isolation

How to Fix It

Step 1: Check Epsilon Status

nutanix@cvm$ cluster -e

Step 2: Validate DNS and NTP

Incorrect DNS or time drift breaks quorum.

nutanix@cvm$ nslookup <cvm-hostname>
nutanix@cvm$ ntpq -p

Step 3: Identify Epsilon Hosting Node

Only one CVM hosts Epsilon.
Check if that CVM is healthy.

nutanix@cvm$ cluster status | grep epsilon

Step 4: Restart Epsilon

nutanix@cvm$ cluster start

This command re-initialises quorum safely.

Step 5: Restart Genesis if Needed (optonal)

nutanix@cvm$ genesis restart

Issue 5#: Unreachable DNS server can prevent clusters from starting services after failure

Resolution : To resolve the Unreachable DNS server can prevent 2 node clusters from starting services after failure – check the DNS / Name server entry in CVM configuration file and check connectivity.

Command 1: Check DNS / Name server entry on cluster configuration

nutanix@cvm:~$ zeus_config_printer | grep name_server

Command 2: then check the DNS / Name server entry in all CVMs configuration file.

nutanix@cvm:~$ allssh "cat /etc/resolv.conf"

If DNS entry is not found then add DNS server IP address / host name from Prism as showing following screenshot.

Add DNS IP Address Entry in Nutanix Cluster Via Prism

Make sure DNS server is reachable before putting DNS IP address / host name.

Issue 6#: SSP: Enabling Self-Service Portal Services

Resolution: To resolve SSP: Enabling Self-Service Portal Services – Need to enable the SSP service on all Nutanix CVM

Services for the Self-Service Portal (SSP) feature are disabled by default on AHV hosts on which the Controller VM has less than 24 GB of memory.

SSP is supported on AHV hosts only.

Step 1: check the Nutanix CVM Memory allocation that must be at least 24 GB or greater would be fine.

nutanix@cvm$ free -m

Step 2: If Nutanix CVM Memory allocation is less then 24 GB then need to scale-up the memory to at least 24 GB or greater.

Option 1: Increase / scale-up Nutlanix CVM memory from Prism console

Option 2: Increase / Scale-up Nutanix CVM memory from command-line

Step 3: Restart Genesis service on all Nutanix CVMs

nutanix@cvm$ allssh genesis restart
nutanix@cvm$ allssh genesis stop prism
nutanix@cvm$ cluster start

Issue 7#: Nutanix CVM / Cluster Services are down

Let’s troubleshoot the Nutanix Cluster / CVM services down issue. first of all try to understand the Nutanix cluster critical services here:

Resolution: check the Nutanix CVM / Cluster services status and restart them.

Step 1: Check Nutanix CVM / Cluster services status

nutanix@CVM$ ncc health_checks run_all

The NCC health check cluster_services_status verifies if the Controller VM (CVM) services have restarted recently across the cluster.

Step 2: Shortlist the down services on all Nutanix CVM

nutanix@cvm$ cluster status | grep -v UP

Step 3: Start Nutanix CVM / Cluster services

nutanix@cvm$ cluster start

Note: Above command will not impact your production running VMs.

Optional Step 4: If step 3 command does not start the down services then you can reboot your either Nutlanix Node or Nutanix CVM.

Step 4.1: Reboot Nutanix CVM

nutanix@cvm$ cvm_shutdown -r now

Step 4.1.1: OR Shutdown Nutanix CVM

nutanix@cvm$ cvm_shutdown -P now

Step 4.1.2: Power-on the Shudown Nutanix CVM

You can login to Nutanix Prism > VM > select shutdown VM > power-on

OR. it can be start from AHV command-line > SSH to Nuanix AHV host

root# virsh list --all | grep CVM

In output you will see CVM Name, just copy it and run following command to start the Nutanix CVM

root# virsh start <CVM_Name>

Wait for 5 Minutes to boot-up the Nutanix CVM and services.

OR Step 4.2 : You can put your host in maintenance mode and then reboot node

Final Step : Now check Nutanix cluster status and running services.

nutanix@pcvm$ cluster status

Issue 8#: Nutanix Gateway not reachable. Http request error

Resolution: Need to restart the Nutanix Console services on the host, which is Prism leader.

Step 1: Find the Nutanix Prism Leader – Verify which cluster node is the Prism leader, that is, the CVM running the Prism container services.

nutanix@cvm$ curl http://0:2019/prism/leader && echo

Output should look similar as following

{"leader":"xx.xx.xx.10:9080", "is_local":false}

It means xx.xx.xx.10 CVM is the Prism Leader.

Step 2: SSH to Prism Leader and run the following command to restart Prism service.

nutanix@cvm$ genesis stop prism 
nutanix@cvm$ cluster start

Note: There is no impact on running production of above commands.

Issue 9#: Critical : Cluster Service: Aplos is down on the Controller VM

If APLOS fails:

VM creation fails
VM migration fails

Check status:

nutanix@cvm$ cluster status | grep aplos

Fix:

nutanix@cvm$ cluster restart aplos

Issue 10#: LCM upgrade fails with error “Services not up” on a 2-node cluster

Resolution: Above both services run in LCM framework.

This is known issue. therefore it is recommended to upgrade Nutanix NCC and LCM framework version to latest available version.

Hopefully, today you have learned something new and interesting topic.

Thanks to being with HyperHCI Tech Blog to stay tuned and keep learning till last breath.

Frequently Ask Questions (FAQs)

Troubleshooting Nutanix cluster services often raises many recurring questions for administrators, especially when dealing with alerts, service failures, Prism issues, or dependency errors.

Whether you are diagnosing service failures, improving cluster health, or learning how Nutanix microservices interact, these FAQs will guide you with straightforward, reliable information. Each answer is written in clear, simple language with practical steps that help you quickly understand the issue and apply the right solution.

Q1. Why do Nutanix cluster services go down even when CVMs are running?

Answer:
Cluster services may fail due to DNS issues, NTP drift, low CVM memory, disk full, or network instability.
Even if the CVM is reachable, internal microservices may be unhealthy.
Running NCC and checking cluster status | grep -v UP usually identifies the cause.

Q2. What is the quickest way to diagnose Nutanix service failures?

Answer:
These three commands can detect missing services, show faulty components, and safely restart them in the correct order.

nutanix@cvm$ ncc health_checks run_all
nutanix@cvm$ cluster status | grep -v UP
nutanix@cvm$ genesis restart

Q3. How do I fix Prism UI being slow or unreachable?

Answer:
Check Prism services, CVM memory, disk usage, and Zookeeper health.
Restart Prism service using cluster restart prism.
If still stuck, restart Genesis or reboot the affected CVM.
Refer: How to fix if Prism Web UI not working or stuck

Q4. Why do alerts like A200000, A300001, or Epsilon Service Down keep repeating?

Answer:
These alerts repeat when core issues like DNS failure, time mismatch, packet loss, or CVM resource shortage remain unresolved. Restarting services does not help until DNS, NTP, and network path issues are fixed completely.

Q5. When is it safe to reboot a Nutanix CVM?

Answer:
Reboot only if multiple services are down, Genesis cannot restart them, or disk/memory is critically low.
One CVM reboot is safe because the Nutanix cluster redistributes services automatically.
Note: Always Put CVM in maintenance mode first then reboot.

Conclusion

Maintaining a healthy Nutanix cluster is not only about using the right commands or restarting services. It is about understanding how each component inside the Controller VM works together and how one small issue, such as DNS mismatch, NTP drift, low memory, or a blocked port, can affect the entire environment.

The modern Nutanix AOS architecture uses a strong microservices design, where every service has a clear dependency chain. This design helps administrators quickly identify which component has failed and why specific alerts appear, but it also means that a simple failure in Medusa, Pithos, or Zookeeper can cause multiple dependent services to stop working.

By following the structured troubleshooting steps covered in this guide, administrators can diagnose service failures confidently, avoid unnecessary downtime, and bring the environment back to a stable state quickly. Commands such as cluster status, genesis restart, NCC health checks, and individual service restarts form the foundation of daily troubleshooting.

At the same time, long-term stability comes from preventive best practices, including keeping DNS and NTP consistent, ensuring CVM resources are sufficient, monitoring disk usage, and running periodic NCC checks.

If the same alerts or service failures appear repeatedly, it is a sign that the underlying network, DNS configuration, or hardware resource allocation requires attention. Fixing the root cause always provides a permanent solution and prevents recurring service instability.

Manish Kumar

I’m Manish Kumar, founder of HyperHCI.com and a senior IT consultant with 13+ years of experience in infrastructure design and cybersecurity. An official certified SME for ISC2 and Nutanix, Also, certified in CISSP, CompTIA Security+, VMware and AWS. My expertise covers HCI, virtualization, cloud computing, network and security across Nutanix, VMware, and AWS platforms Read more

Nutanix Cluster Services Down – Troubleshooting

Understand Nutanix Cluster Services

Cluster Services Down: Production Impact

Nutanix Cluster services Troubleshooting

Issue 1#: Nutanix Cluster Health Service Is Down

Issue 2#: Nutanix Ergon Service Is Down / Ergon Inaccessible on Nodes

Issue 3#: How to Fix Alert A200000 – Cluster Connectivity Status

Issue 4#: How to Fix Alert A400102 – Epsilon Service Down

Issue 5#: Unreachable DNS server can prevent clusters from starting services after failure

Issue 6#: SSP: Enabling Self-Service Portal Services

Issue 7#: Nutanix CVM / Cluster Services are down

Issue 8#: Nutanix Gateway not reachable. Http request error

Issue 9#: Critical : Cluster Service: Aplos is down on the Controller VM

Issue 10#: LCM upgrade fails with error “Services not up” on a 2-node cluster

Frequently Ask Questions (FAQs)

Q1. Why do Nutanix cluster services go down even when CVMs are running?

Q2. What is the quickest way to diagnose Nutanix service failures?

Q3. How do I fix Prism UI being slow or unreachable?

Q4. Why do alerts like A200000, A300001, or Epsilon Service Down keep repeating?

Q5. When is it safe to reboot a Nutanix CVM?

Conclusion

Recent Posts

Leave a Comment Cancel Reply

Nutanix Cluster Services Down – Troubleshooting

Understand Nutanix Cluster Services

Cluster Services Down: Production Impact

Nutanix Cluster services Troubleshooting

Issue 1#: Nutanix Cluster Health Service Is Down

Issue 2#: Nutanix Ergon Service Is Down / Ergon Inaccessible on Nodes

Issue 3#: How to Fix Alert A200000 – Cluster Connectivity Status

Issue 4#: How to Fix Alert A400102 – Epsilon Service Down

Issue 5#: Unreachable DNS server can prevent clusters from starting services after failure

Issue 6#: SSP: Enabling Self-Service Portal Services

Issue 7#: Nutanix CVM / Cluster Services are down

Issue 8#: Nutanix Gateway not reachable. Http request error

Issue 9#: Critical : Cluster Service: Aplos is down on the Controller VM

Issue 10#: LCM upgrade fails with error “Services not up” on a 2-node cluster

Frequently Ask Questions (FAQs)

Q1. Why do Nutanix cluster services go down even when CVMs are running?

Q2. What is the quickest way to diagnose Nutanix service failures?

Q3. How do I fix Prism UI being slow or unreachable?

Q4. Why do alerts like A200000, A300001, or Epsilon Service Down keep repeating?

Q5. When is it safe to reboot a Nutanix CVM?

Conclusion

Recent Posts

Leave a Comment Cancel Reply

Recommended For You