Nutanix CVM Memory is 100% in vSphere 6.5 or later
An issue that has been popping up around the Internet is that Nutanix Controller VMs (CVM) are triggering VM Memory Usage alerts and showing usage of ~100% continuously, namely after installing/upgrading to vSphere 6.5 or later. This in fact happened to my own installation, so after research and discussion with my VAR, we determined the issue was related to the issue listed in Nutanix Support KB 4465. To save you the click, this is not unusual behavior, according to Nutanix.
As you can see from the image, the Nutanix CVM is pegging at 100% utilization, causing alarm after alarm to be triggered by vSphere. Of course, that is telling me something is very, very wrong with the CVM…right? Not so fast.
So, according to the Nutanix KB article, the primary issue is actually how ESXi 6.x monitors VM memory usage. Though you can read VMware KB2149787 for the entire explanation, the two key points are:
- In 6.5 or later, if the VM is configured with either:
- PCI passthrough devices
- Fault Tolerance (FT) enabled
- Latency Sensitivity set to High
- When one of the above settings are enabled, “[they]
force a full memory reservation and are not subjected to reclamation techniques, hence memory sampling is disabled and active memory will default to a display value of 100%“.
I highlighted the two key pieces that affect the CVMs here. Each CVM, to have full access to the underlying hardware, use PCI passthrough; thus, even though they are not actually utilizing all the vRAM given, ESXi thinks that it is. Thus, there is nothing Nutanix can do to convince it otherwise, short of changing the underlying functionality of the CVMs. As Nutanix notes, it is simply a cosmetic issue; albeit a really frustrating one.
Unfortunately the solution from both KBs is less a solution, and more of a ‘turn a blind eye’ situation. Since you cannot fix the issue of vSphere showing the CVMs pegging at 100%, you must effectively ignore it. Namely, by turning off the Virtual Machine Memory Usage Alarm at the vCenter level. Obviously in a production environment that is an awful idea, but in case you just need if off immediately, the procedure is:
- Select the vCenter object in the navigation pane of the vSphere Web Client.
- Click the Monitor tab > Issues sub-tab.
- Select Alarm Definitions and search for Virtual machine memory usage.
- Highlight the alarm and click Edit.
- Deselect Enable this alarm.
- Click Finish.
Since alarms automatically propagate down, all child objects under vCenter inherit this disabled alarm. However, there is no way to re-enable the alarm on those child objects. Thus, in a production environment you will wish to recreate the alarm definition lower down in the hierarchy of your vCenter.
- If you child objects are not already organized into a folder structure, do so. Reason is you can apply alarms to any object, and this will allow you to create custom alarms and other triggers per folder.
- For each folder, create a new alarm definition, matching the original’s settings:
- Trigger: VM Memory Usage
- Operator: is above
- Warning Condition: 85% for 10 minutes
- Critical Condition: 95% for 10 minutes
- Tip #1: You will need to do this in Flash-client, since the HTML5 client will only allow you to select 10% increments. Bizarre, I know…
- Tip #2: Alarm Definition names must be unique to vCenter. Thus, you cannot use ‘Custom Memory Alarm’ as a name for each object; you will get an error that the alarm already exists somewhere else. Adding the object name to the front will be enough to get past this, such as ‘SQL – VM Memory Usage’.
If you do not wish to create multiple folders for some reason, simply create two folders under your Datacenter, e.g. CVMs and Production, then move all CVMs into the CVM (duh) folder, and all other VMs into Production.
So, the more skilled sysadmins out there will by now be screaming “Why not just write a script to create the new alarms?!”. Well, skilled sysadmin, there is one flaw with PowerCLI: there is no ‘New-AlarmDefinition’ cmdlet yet available. So you can create new alarm actions and triggers via scripts, but not the actual definition that the trigger connects to.
There are some workarounds, such as Luc Dekens script here: Alarm Expressions – Part 2 : Event Alarms, yet this script is 9 years old now and will only get you part of the way there. Thankfully, there is work being done on this, and I found a module to do what we need in the VMware github. How stable it is though, I will find out and report back.