Lost NFS + network – why few VM’s powered off post connectivity restored?

Hello All,

This is one of the issue which I have experienced .

This is a 5 node cluster running on Nutanix boxes and ESXi 5.5 update 2.

Nutanix boxes are connected to Cisco 6k switches, and due to a network failure all the ESXi host lost access to network. This lead to host isolation as none of the nodes were able to ping to each other on the management network , and the sad part was that the Nutanix controller vms were communicating on the same network and the complete NFS export went down .

The network outage remained for close to 50 minutes. Cluster settings for HA isolation response was “Leave Powered On”. Well, after the connectivity was restored, most of the vms remained power on , but few 10 of them were powered off , out of which vcenter was one of them. I had to login into multiple hosts to find the vCenter , and then power on vCenter first..Ahhh Time waste.

So , I started going through the logs to understand the cause behind this vms powering off. Firstly, I thought it something with these Nutanix boxes , but later I thought if this was a normal NFS export, then we would have still experienced the same problem.

After several hours of review of the logs , I found the issue which lead to these vms powering off.


esx-NTNX-xxxxxx-2015-08-01–08.25/var/run/log/vmkernel.4:2015-08-01T00:33:09.607Z cpu9:36450494)WARNING: World: vm 36450494: 11151: vmm0:xxxxxxx:vmk: vcpu-0:Unable to read swapped out pgNum(0xfe9e0) from swap slot(0x2ec7c) for VM(36450494)

The reason that these vms a remained powered off , was due to the vmx crash post the storage lost.
The storage was lost , which also caused the swap memory or vswp files to be lost as well .

The vms were swapping memory to their respective vswp files while they were running.
Once the Nutanix NFS was unreachable, vms were unable to reach thier swap files as well, and hence crashed.

Thus, we reviewed the memory over-committment and allocations on the cluster , and fixed the swapping issue.

However, at the end of the day, have to fix the networking problem to fix the problem.