Nutanix cluster runs couple of services to run and maintain the cluster, but what will happen when Nutanix cluster or CVMs services goes down. of-course need to troubleshoot the Nutanix cluster / CVM services to get back them up and running to turn the cluster status green.
It is very difficult for any administrator to troubleshoot the Nutanix cluster / Nutanix CVM down services. Therefore is have mentioned the simple steps to troubleshoot the Nutanix cluster / CVM services.
Running a Nutanix HCI environment demands stability, predictable performance, and continuous availability. When a cluster service goes down, even for a short period, administrators may face VM downtime, management plane delays, upgrade failures, and unpredictable cluster behaviour.
This updated and fully refreshed guide explains, in simple and professional language, how Nutanix Cluster Services work, why they may fail, how to troubleshoot them step-by-step, and what administrators must do to prevent recurring issues.
The goal of this post is to help every Nutanix administrator understand the problem clearly, troubleshoot in a structured flow, and keep the Nutanix environment healthy.
If you are exploring Nutanix AHV-AOS versions or compatibility, you may also check our guide: Latest Nutanix AHV & AOS versions compatible matrix
Follow step to step Nutanix Cluster Services Down: Latest Troubleshooting Guide for Continuous Nutanix Cluster Availability
- Understand Nutanix Cluster Services
- Cluster Services Down: Production Impact
- Nutanix Cluster services Troubleshooting
- Issue 1#: Nutanix Cluster Health Service Is Down
- Issue 2#: Nutanix Ergon Service Is Down / Ergon Inaccessible on Nodes
- Issue 3#: How to Fix Alert A200000 – Cluster Connectivity Status
- Issue 4#: How to Fix Alert A400102 – Epsilon Service Down
- Issue 5#: Unreachable DNS server can prevent clusters from starting services after failure
- Issue 6#: SSP: Enabling Self-Service Portal Services
- Issue 7#: Nutanix CVM / Cluster Services are down
- Issue 8#: Nutanix Gateway not reachable. Http request error
- Issue 9#: Critical : Cluster Service: Aplos is down on the Controller VM
- Issue 10#: LCM upgrade fails with error "Services not up" on a 2-node cluster
- Frequently Ask Questions (FAQs)
- Q1. Why do Nutanix cluster services go down even when CVMs are running?
- Q2. What is the quickest way to diagnose Nutanix service failures?
- Q3. How do I fix Prism UI being slow or unreachable?
- Q4. Why do alerts like A200000, A300001, or Epsilon Service Down keep repeating?
- Q5. When is it safe to reboot a Nutanix CVM?
- Conclusion
Understand Nutanix Cluster Services
Nutanix follows a distributed architecture where several microservices run inside every Controller VM (CVM). These services communicate with each other continuously to maintain cluster operations. When even one of these services goes down, the entire environment may experience instability.
These services manage almost everything inside the environment, such as:
- Virtual machine orchestration
- Data replication and recovery
- Storage input and output operations
- Data compaction and balancing
- Metadata consistency
- UI and API communication
- Cluster leadership and quorum
- Security and certificate management
Some of the core Nutanix services include:
- Genesis Service – Service manager for starting and stopping all other services
- Cerebro – Replication metadata and disaster recovery service
- Curator – Data compaction and storage optimisation
- Acropolis (APLOS) – VM orchestration, AHV operations
- Medusa – Distributed metadata store
- Pithos – Storage IO path service
- Epsilon – Cluster quorum and leader election
- Cluster-Gateway and API Services – For Prism communication
Core Nutanix Services and Their Purpose:
| Service Name | Purpose |
|---|---|
| Genesis | Controls start and stop of all services on CVMs |
| Cerebro | Powering disaster recovery, snapshots, and replication |
| Curator | Data balancing, cleanup, compaction, and storage optimization |
| Acropolis (APLOS) | VM orchestration, AHV operations, cluster-wide VM tasks |
| Medusa | Distributed metadata store for the entire cluster |
| Pithos | Storage IO service responsible for data path |
| Epsilon | Cluster quorum, leader election, and consensus |
| Cluster-Gateway | API communication for Prism Element and Prism Central |
| Zookeeper | Distributed coordination and configuration tracking |
These services work like a tightly connected engine. If one component stops responding, the entire Nutanix environment may lose functionality.
To understand all Nutanix cluster components in detail, you can read our post: Nutanix Cluster Components and related services
Cluster Services Down: Production Impact
When Prism shows Cluster Services Down, it indicates that one or more essential services inside the CVM have stopped, crashed, hung, or cannot communicate with other nodes.
This may result in:
- Delayed VM operations
- Slow storage performance
- Prism UI unresponsiveness
- Replication or DR errors
- LCM or AOS upgrade failures
- Alerts showing missing services
- Node status appearing offline even when IP is reachable
In simple words: If Nutanix services are down, your cluster cannot operate as designed.
Administrators must treat this as a priority issue because the longer the service remains down, the higher the risk of cluster imbalance or metadata inconsistency.
Nutanix Cluster services Troubleshooting
Nutanix Cluster Services Troubleshooting is one of the most important responsibilities for administrators who manage Nutanix AHV or hybrid cloud environments. Every Nutanix cluster depends on a set of distributed microservices running inside each Controller VM (CVM).
These services handle critical operations such as storage I/O, metadata management, VM orchestration, replication, and Prism communication. When even one of these services becomes unhealthy, the entire environment may show performance delays, service disruption, or “Cluster Services Down” alerts in Prism.
Most service failures are caused by common underlying factors such as DNS misconfiguration, NTP time drift, low CVM memory, disk space issues, or network connectivity interruptions. That is why troubleshooting Nutanix services requires a clear understanding of how these microservices work, how they depend on each other, and how to use commands like cluster status, genesis restart, and NCC checks to identify the exact root cause.
Let’s troubleshoot the common Nutanix cluster / CVM services down issues to resolve it and turn Nutanix cluster / CVM health back to green.
Issue 1#: Nutanix Cluster Health Service Is Down
This usually refers to issues detected by NCC health checks or major cluster service failures.
Step by step Resolution
Step 1: Login to Nutanix Cluster any CVM
Use can use putty or directly login to CVM using Nutanix Prism > VM > Select any CVM > Lauch console > Enter Credential
You are not aware about CVM Credential: Refer Nutanix CVM Default User ID and Password
Run following command to check Cluster Health checks
nutanix@cvm$ Run NCC Cluster Health Check
OR: Run full Cluster services check as mentioned below:
nutanix@cvm$ ncc health_checks run_all
Step 2: Validate Service Status
nutanix@cvm$ cluster status | grep -v UP
If Nutanix Cluster NCC health check related services are showing down you may start all services through single command
Step 3: Restart All Cluster Services (Safest Way)
nutanix@cvm$ genesis restart
OR
nutanix@cvm$ cluster start
Important: Above commands are safe to run on Nutanix cluster without any impact (Zero impact).
Issue 2#: Nutanix Ergon Service Is Down / Ergon Inaccessible on Nodes
What Ergon Does
- Life Cycle Manager (LCM)
- Internal workflows
- Automated upgrades
- System automation tasks
When Ergon is down:
- LCM fails
- Prism Central tasks get stuck
- Some NCC modules fail
Step by step Resolution
Step 1: Login to Nutanix Cluster any CVM
nutanix@cvm$ cluster status | grep ergon
OR
nutanix@cvm$ cluster status | grep -v DOWN
if Ergon service is down. Please check following possible reasons and fix them.
Step 2: Check CVM Memory
nutanix@cvm$ free -m
nutanix@cvm$ top
If CVM memory is showing less than recommended as mentioned below. Please increase it immediately.
Put CVM in maintenance mode > shutdown CVM > Power-on CVM
Important:
Production workloads: A minimum of 32 GB is typically used for normal workloads.
All-flash clusters: At least 40 GB is required all-flash clusters, with 48 GB recommended to enable features – Blockstore
Heavy workloads: 64 GB is recommended for high-performance workloads, such as large databases or Oracle environments, AOS features like compression, deduplication, or erasure coding.
Step 3: Validate DNS resolution
Ergon relies heavily on endpoint communication.
nutanix@cvm$ nslookup <cluster-ip>
Above Nutanix cluster-IP (Prism IP) should be resolved with FQDN
Step 4: Check Network Port Access
Ensure no firewall blocks internal ports between CVMs. Refer Nutanix cluster firewall ports list
Step 5: You can review the ergon logs further
nutanix@cvm$ cat /home/nutanix/data/logs/ergon.out
Step 6: Finally Restart Genesis
nutanix@cvm$ genesis restart
OR
nutanix@cvm$ cluster star
Issue 3#: How to Fix Alert A200000 – Cluster Connectivity Status
New AOS releases include improved alerting, better dependency visibility, and deeper service-level intelligence. Alerts such as A200000, A400102, and A300001 directly point to critical cluster services. Understanding how to troubleshoot them helps administrators reduce downtime, prevent cascading failures, and bring the cluster back to a healthy state faster.
Below are the recommended steps to diagnose and fix each alert.
This alert appears when cluster nodes cannot communicate with each other properly.
It affects:
- CVM heartbeat
- Metadata replication
- Leader election
- NCC checks
- Prism responsiveness
Root Causes
- DNS forward or reverse lookup failure
- NTP drift between nodes
- IP conflict
- VLAN or firewall rules blocking internal ports
- Incorrect subnet or gateway
- Packet drops between CVMs
- CVM host instability
How to Fix It (Step-by-Step)
Step 1: Validate CVM-to-CVM Connectivity
Run from any CVM:
nutanix@cvm$ ping <all-peer-cvm-ip>
If latency or drops appear, troubleshoot network path.
Step 2: Confirm DNS Resolution
Forward and reverse lookup must work for every CVM.
Use nslookup or dig command:
nutanix@cvm$ nslookup <all-cvm-ip>
nutanix@cvm$ nslookup <cvm-hostname>
nutanix@cvm$ dig <cluster-name>
Fix DNS entries if mismatched.
Step 3: Validate NTP Time Synchronization
nutanix@cvm$ ntpq -p
If time offset is large, fix NTP configuration.
Step 4: Check Network Ports
Nutanix needs several internal communication ports open.
Most common issues involve firewall or ACL blocks.
Step 5: Check CVM Network Configuration
nutanix@cvm$ ifconfig
nutanix@cvm$ arp -a
Look for duplicated IPs, wrong subnet masks, or VLAN issues.
Step 6: Validate Host Stability
If the hypervisor host is overloaded or unhealthy, CVMs cannot communicate reliably.
Step 7: Restart Genesis (If Services Are Unresponsive)
nutanix@cvm$ genesis restart
Issue 4#: How to Fix Alert A400102 – Epsilon Service Down
The Epsilon service manages cluster quorum and leader election.
If Epsilon goes down, the cluster may become unstable or unresponsive.
Symptoms
- Prism UI slow or stuck
- Metadata operations delayed
- Cluster leadership keeps switching
- NCC checks fail
- Upgrade operations stop
Root Causes
- CVM resource exhaustion
- DNS or NTP mismatch
- Zookeeper service issues
- Network path failures
- Corrupted Epsilon configuration
- Node isolation
How to Fix It
Step 1: Check Epsilon Status
nutanix@cvm$ cluster -e
Step 2: Validate DNS and NTP
Incorrect DNS or time drift breaks quorum.
nutanix@cvm$ nslookup <cvm-hostname>
nutanix@cvm$ ntpq -p
Step 3: Identify Epsilon Hosting Node
Only one CVM hosts Epsilon.
Check if that CVM is healthy.
nutanix@cvm$ cluster status | grep epsilon
Step 4: Restart Epsilon
nutanix@cvm$ cluster start
This command re-initialises quorum safely.
Step 5: Restart Genesis if Needed (optonal)
nutanix@cvm$ genesis restart
Issue 5#: Unreachable DNS server can prevent clusters from starting services after failure
Resolution : To resolve the Unreachable DNS server can prevent 2 node clusters from starting services after failure – check the DNS / Name server entry in CVM configuration file and check connectivity.
Command 1: Check DNS / Name server entry on cluster configuration
nutanix@cvm:~$ zeus_config_printer | grep name_server
Command 2: then check the DNS / Name server entry in all CVMs configuration file.
nutanix@cvm:~$ allssh "cat /etc/resolv.conf"
If DNS entry is not found then add DNS server IP address / host name from Prism as showing following screenshot.

Make sure DNS server is reachable before putting DNS IP address / host name.
Issue 6#: SSP: Enabling Self-Service Portal Services
Resolution: To resolve SSP: Enabling Self-Service Portal Services – Need to enable the SSP service on all Nutanix CVM
Services for the Self-Service Portal (SSP) feature are disabled by default on AHV hosts on which the Controller VM has less than 24 GB of memory.
SSP is supported on AHV hosts only.
Step 1: check the Nutanix CVM Memory allocation that must be at least 24 GB or greater would be fine.
nutanix@cvm$ free -m
Step 2: If Nutanix CVM Memory allocation is less then 24 GB then need to scale-up the memory to at least 24 GB or greater.
Option 1: Increase / scale-up Nutlanix CVM memory from Prism console
Option 2: Increase / Scale-up Nutanix CVM memory from command-line
Step 3: Restart Genesis service on all Nutanix CVMs
nutanix@cvm$ allssh genesis restart
nutanix@cvm$ allssh genesis stop prism
nutanix@cvm$ cluster start
Issue 7#: Nutanix CVM / Cluster Services are down
Let’s troubleshoot the Nutanix Cluster / CVM services down issue. first of all try to understand the Nutanix cluster critical services here:
Read more: Nutanix Cluster Most Critical Services
Resolution: check the Nutanix CVM / Cluster services status and restart them.
Step 1: Check Nutanix CVM / Cluster services status
nutanix@CVM$ ncc health_checks run_all
The NCC health check cluster_services_status verifies if the Controller VM (CVM) services have restarted recently across the cluster.
Step 2: Shortlist the down services on all Nutanix CVM
nutanix@cvm$ cluster status | grep -v UP
Step 3: Start Nutanix CVM / Cluster services
nutanix@cvm$ cluster start
Note: Above command will not impact your production running VMs.
Optional Step 4: If step 3 command does not start the down services then you can reboot your either Nutlanix Node or Nutanix CVM.
Step 4.1: Reboot Nutanix CVM
nutanix@cvm$ cvm_shutdown -r now
Step 4.1.1: OR Shutdown Nutanix CVM
nutanix@cvm$ cvm_shutdown -P now
Step 4.1.2: Power-on the Shudown Nutanix CVM
You can login to Nutanix Prism > VM > select shutdown VM > power-on
OR. it can be start from AHV command-line > SSH to Nuanix AHV host
root# virsh list --all | grep CVM
In output you will see CVM Name, just copy it and run following command to start the Nutanix CVM
root# virsh start <CVM_Name>
Wait for 5 Minutes to boot-up the Nutanix CVM and services.
OR Step 4.2 : You can put your host in maintenance mode and then reboot node
Read more: Enable Nutanix CVM, AHV Maintenance mode
Read more: How to Shutdown / Reboot Nutanix AHV Host and Nutanix CVM
Final Step : Now check Nutanix cluster status and running services.
nutanix@pcvm$ cluster status
Issue 8#: Nutanix Gateway not reachable. Http request error
Resolution: Need to restart the Nutanix Console services on the host, which is Prism leader.
Step 1: Find the Nutanix Prism Leader – Verify which cluster node is the Prism leader, that is, the CVM running the Prism container services.
nutanix@cvm$ curl http://0:2019/prism/leader && echo
Output should look similar as following
{"leader":"xx.xx.xx.10:9080", "is_local":false}It means xx.xx.xx.10 CVM is the Prism Leader.
Step 2: SSH to Prism Leader and run the following command to restart Prism service.
nutanix@cvm$ genesis stop prism
nutanix@cvm$ cluster start
Note: There is no impact on running production of above commands.
Read also: Nutanix Prism web console is slow, not working, hanging issues troubleshooting
Issue 9#: Critical : Cluster Service: Aplos is down on the Controller VM
If APLOS fails:
- VM creation fails
- VM migration fails
Check status:
nutanix@cvm$ cluster status | grep aplos
Fix:
nutanix@cvm$ cluster restart aplos
Issue 10#: LCM upgrade fails with error “Services not up” on a 2-node cluster
Resolution: Above both services run in LCM framework.
This is known issue. therefore it is recommended to upgrade Nutanix NCC and LCM framework version to latest available version.
Read also: How Nutanix LCM Life Cycle Management Framework Works ?
Hopefully, today you have learned something new and interesting topic.
Thanks to being with HyperHCI Tech Blog to stay tuned and keep learning till last breath.
Frequently Ask Questions (FAQs)
Troubleshooting Nutanix cluster services often raises many recurring questions for administrators, especially when dealing with alerts, service failures, Prism issues, or dependency errors.
Whether you are diagnosing service failures, improving cluster health, or learning how Nutanix microservices interact, these FAQs will guide you with straightforward, reliable information. Each answer is written in clear, simple language with practical steps that help you quickly understand the issue and apply the right solution.
Q1. Why do Nutanix cluster services go down even when CVMs are running?
Answer:
Cluster services may fail due to DNS issues, NTP drift, low CVM memory, disk full, or network instability.
Even if the CVM is reachable, internal microservices may be unhealthy.
Running NCC and checking cluster status | grep -v UP usually identifies the cause.
Q2. What is the quickest way to diagnose Nutanix service failures?
Answer:
These three commands can detect missing services, show faulty components, and safely restart them in the correct order.
nutanix@cvm$ ncc health_checks run_all
nutanix@cvm$ cluster status | grep -v UP
nutanix@cvm$ genesis restart
Q3. How do I fix Prism UI being slow or unreachable?
Answer:
Check Prism services, CVM memory, disk usage, and Zookeeper health.
Restart Prism service using cluster restart prism.
If still stuck, restart Genesis or reboot the affected CVM.
Refer: How to fix if Prism Web UI not working or stuck
Q4. Why do alerts like A200000, A300001, or Epsilon Service Down keep repeating?
Answer:
These alerts repeat when core issues like DNS failure, time mismatch, packet loss, or CVM resource shortage remain unresolved. Restarting services does not help until DNS, NTP, and network path issues are fixed completely.
Q5. When is it safe to reboot a Nutanix CVM?
Answer:
Reboot only if multiple services are down, Genesis cannot restart them, or disk/memory is critically low.
One CVM reboot is safe because the Nutanix cluster redistributes services automatically.
Note: Always Put CVM in maintenance mode first then reboot.
Conclusion
Maintaining a healthy Nutanix cluster is not only about using the right commands or restarting services. It is about understanding how each component inside the Controller VM works together and how one small issue, such as DNS mismatch, NTP drift, low memory, or a blocked port, can affect the entire environment.
The modern Nutanix AOS architecture uses a strong microservices design, where every service has a clear dependency chain. This design helps administrators quickly identify which component has failed and why specific alerts appear, but it also means that a simple failure in Medusa, Pithos, or Zookeeper can cause multiple dependent services to stop working.
By following the structured troubleshooting steps covered in this guide, administrators can diagnose service failures confidently, avoid unnecessary downtime, and bring the environment back to a stable state quickly. Commands such as cluster status, genesis restart, NCC health checks, and individual service restarts form the foundation of daily troubleshooting.
At the same time, long-term stability comes from preventive best practices, including keeping DNS and NTP consistent, ensuring CVM resources are sufficient, monitoring disk usage, and running periodic NCC checks.
If the same alerts or service failures appear repeatedly, it is a sign that the underlying network, DNS configuration, or hardware resource allocation requires attention. Fixing the root cause always provides a permanent solution and prevents recurring service instability.

I’m Manish Kumar, founder of HyperHCI.com and a senior IT consultant with 13+ years of experience in infrastructure design and cybersecurity. An official certified SME for ISC2 and Nutanix, Also, certified in CISSP, CompTIA Security+, VMware and AWS. My expertise covers HCI, virtualization, cloud computing, network and security across Nutanix, VMware, and AWS platforms Read more




