Hey Checkyourlogs Fans,

In today’s post, I wanted to discuss I case that I had this week with a customer that had a couple of VM’s that were stuck in a Pending state in Failover Cluster Manager. They are running Windows Server 2016 with the April Cumulative Update.

Leading up to this issue was the customer doing a routine operation where he restored a Virtual Machine to an alternate location using Veeam. What this did was indeed restore the Virtual Machine to a standalone Hyper-V Host but it also re-registered the same GUID for the VM. When the cluster detected this it had an issue with duplicate GUID’s for the VM Resources and got stuck in a pending state.

Below is a screen shot of what the GUI looked like when I first logged on.


We couldn’t do anything with these resources while they were in this state. We tried removing them with PowerShell but this just hung the PowerShell Console when we tried.


Every operation we tried would also fail in Failover Cluster Manager.


I did notice that above the resource went from Online Pending to Failed and then back the Online Pending.

This was because the Failure Policy was trying to restart the cluster resource and I just happened to catch it between cycles.

Step number one was to check the error message that was occurring when trying to start this resource.


I noticed that we were getting an Error Event ID of 21502 – “Virtual Machine Configuration SCVMM COMT Server 2012 Resources” failed to unregister the virtual machine with the virtual machine management service.

Note: There was also an issue with Failover Cluster Manager where we couldn’t add any Virtual Machines. It would throw the error when trying to add a standalone Hyper-V Virtual Machine into the cluster. It threw an error An item with the same key has already been added.


I figured the issue was caused by the customer using Veeam to restore the Virtual Machine to an alternate location and with the same GUID and then the customer trying to add it back into the cluster. But before I could troubleshoot it, the first thing I wanted to do was to get this resource to move into a Failed state so I could try to fix it. What was preventing me from doing this was the failover policy.

Note: I normally don’t restore to an alternate location inside the same cluster with Veeam. If I was to use that restore option it would be to another cluster or standalone host.


I ended up setting the Maximum Failures in the specified period to 0 and Period (:hours) to zero.

I did this for all of the resources associated with the failed virtual machines.

Eventually, I did end up getting them into a failed state.

Then I tried removing the resources again with PowerShell, GUI, and everything failed.

I did eventually get an error message stating that there was a lock on the file preventing me from changing or deleting it.

Next, I wanted to find the VM’s with the duplicate GUID’s to ensure this was indeed my issue.


Get-ChildItem 'hklm:\cluster\Resources' | ?{ ($_ | Get-ItemProperty -Name Type).Type -eq 'Virtual Machine' } | %{ ( $_ | gci | Get-ItemProperty -Name VmId).vmid} | Group-Object | ? Count -gt 1 | out-gridview


I then went into the registry and searched for the duplicated Resource ID’s.

This can be done by browsing on one of the cluster nodes to: HKLM\Cluster\Resources


I found the two VM’s that were causing the issues and deleted them in the registry. This cleared up the issue with An item with the same key has already been added. However, the Virtual Machines were still showing up in Failover Cluster Manager.


We tried everything that we could think of and that File Lock issue was still there. The only thing that I could think of that would release the lock was a Shutdown and Restart of the cluster. This was a bit of a big issue because there were over 100 + VM’s inside this cluster.


Note: It took approximately 7 minutes for the Cluster to fully shut down. This was because the stop action of the Virtual Machines was set to Save the State and this took a bit of time for the close to 100 VM’s. I think it was about 7 minutes in all.


After the cluster came back online the resources were still there. I felt that the file locks should be fine now so we tried to remove the Virtual Machine and it’s failed resources and it WORKED!

To clean things up we re-created the Virtual Machines from scratch in Hyper-V Manager and then added them back into the cluster.

This was definitely a tricky issue and I hope this post helps you solve this issue if you encounter it.

Thanks,


Dave


Advertisements