Hey Checkyourlogs fans,
Today I wanted to talk about a case that I am currently working on for a customer running Veeam Backup and Replication V9.5 UR 3. Veeam is protecting our Storage Spaces Direct Hyper-V Hyper-Converged infrastructure running on Server 2016.
We recently upgraded all of the Servers from Windows 2012 R2 and started to see a weird error with Veeam over the weekend. Some of my backup jobs were failing with a WMI Error 32775 as you can see in the screen shot below.
When I checked the Hyper-V Cluster using Failover Cluster Manager everything looked fine.
Further, the cluster wasn’t showing any errors either. So I decided to have a look at the Hyper-V Nodes… This configuration is a 2x Node Storage Spaces Direct configuration.
What I found was very interesting the failed job for SQL was sitting on Node 2. Look at what Hyper-V Manager was showing.
It appears that the jobs were stalled out with the Hyper-V Management Services or the VMWP.exe were hung for these VM’s. I had seen this before with VSS Snapshots … The most interesting part was the VM WAC-ADM was going to be my new Windows Admin Center Virtual Machine. This one had failed on creation late last week before I had a chance to dig into the issue.
Something was definitely up with this node.
The event viewer showed an error 19060 in the Hyper-V-VMMS log. Stating that the VM was Creating a Checkpoint that never finishes. Thus making my new backups and replicas fail.
I have found reboots of nodes very problematic in this particular situation because the Cluster Service hangs and the Host Hyper-V Server sits there saying Shutting down Cluster Service. Then I am forced to do a hard reboot of the server which is never really a good idea especially with any type of Hyper-Converged solution.
NOTE: Storage Spaces Direct seems to be fine with the hard power outages. I just don’t like to push my luck because with other Hyper-Converged platforms this has caused me a lot of grief in the past.
There is a way to kill worker process, and I normally use SysInternals Process Explorer to get the job done.
NOTE: You need to run this from an Administrative Command Prompt to elevate Process Explorer to Admin Rights. If you don’t do this, you won’t be able to see the VMWP.exe processes that are controlling these Virtual Machines from the Parent Partition.
I killed each of the processes one by one watching Hyper-V Manager to see if it cleared up my issue.
In the end I wasn’t able to kill the above vmwp.exe processes. I got a general access denied as they were tied to an orphaned VMcompute.exe process. I ended up rebooting the node by disabling the cluster service manually killing it.
Upon reboot of the node I could see that the orphaned VMWP.exe processes were gone.
I started back up the cluster service and checked the Storage Jobs.
I also checked Hyper-V Manager to see if it was back being more responsive.
Instead of being stuck loading….. in the MMC it was now back to normal.
I live migrated the SQL Servers back to Node 2 and as you can see the locks from the CheckPoint creation were cleared.
At this point, I wanted to test the backups again and see if this fixed the problem. It looked good as the checkpoint for the Veeam Backup did create properly this time.
From what I can tell in troubleshooting this case the Hyper-V Server locked up while creating checkpoints on the Virtual Machines. Thus making it impossible to proceed with backups and or perform any other operations on the Server. I know I hate rebooting the Node but this is what ended up solving my issue.
Backups are now working again, and the customer is once again happy.
I hope this helps you if you run into this issue.