The Case of: Windows Server 2019 Hyper-V Checkpoint stuck at 9% after Veeam 9.5 UR 4 Upgrade – #Veeam #Hyper-V

Posted by Kawula Dave | Jan 29, 2019 | Veeam, Windows Server | 16 |

Hey Checkyourlogs Fans,

Today I had an interesting case where a customer called in to let us know that Checkpoints were stuck on their Windows Server 2019 Hyper-V Host. Everything had been working great on this host since it was upgraded from Windows Server 2016 back in November. The only change was we upgraded to Veeam 9.5 Update Rollup 4, and then we started experiencing these issues with Checkpoints and Disk Locks.

How we noticed this was the failing Veeam Backup and Replica Jobs to and from this host. You can see that it is stuck with Creating Checkpoint at 9%. We hadn’t patched this host since it had been deployed and figured that might be the issue.

The first thing I normally try in this situation is not to have to reboot the Hyper-V Host as that is pretty invasive. So I tried stopping the Hyper-V VMMS Service. That just hung in a stopping state. At this point, we decided just to Patch and reboot the host and reboot.

This did solve our problem as seen in the screen shot below. So now I have a note to self of patching my Hyper-V Hosts before moving to Veeam 9.5 UR4. Although I’m not 100 percent certain if it is related I’m pretty sure because that was the only change in our environment.

Thanks,

Dave

About The Author

Kawula Dave

Dave Kawula is a seasoned author, renowned blogger, global speaker, and enterprise consulting leader with over 30 years of experience in the IT industry. A recognized expert in Microsoft technologies, Dave has built a reputation for delivering practical, impactful solutions tailored to meet diverse business needs. Dave has authored numerous technical books, covering topics such as Windows Server, System Center, and Hyper-V. His publications have become essential resources for IT professionals looking to deepen their understanding of these technologies. Beyond writing, Dave is a prolific blogger, sharing insights and expertise on his blog and other prominent platforms, where he demystifies complex concepts and offers actionable advice to the tech community. As a world-class speaker, Dave has presented at leading conferences and events across the globe. His engaging style and in-depth knowledge have made him a sought-after speaker, inspiring IT professionals to harness the full potential of Microsoft technologies. Whether addressing audiences on Azure, Windows Server, or virtualization, Dave’s presentations are packed with practical insights that resonate with technical and business audiences alike. Dave’s extensive consulting experience spans a broad spectrum of organizations, from small businesses to Fortune 500 enterprises. As an enterprise consulting leader, he has guided clients through digital transformation, leveraging Microsoft technologies to drive innovation and achieve strategic goals. His ability to align technology with business objectives has made him a trusted advisor and invaluable partner. A Microsoft Most Valuable Professional (MVP) and Veeam Vanguard, Dave’s expertise is widely recognized by both the tech community and industry leaders. His contributions to the IT field, combined with his passion for mentoring and empowering others, solidify his position as a thought leader and a pillar of the global IT community.

16 Comments

Chad holm on March 15, 2019 at 4:08 pm

Hi Dave,
I now have this exact same scenario except that patching and rebooting the host only fixed it temporarily. The last time it happened the host had to be forced to restart because the vmms service would not shut down. The event corrupted two of the VMs running on the host server, which I was thankfully able to restore with Veeam. I am going to remove Veeam and 5nine agents from the host and see how it goes before adding them back in again. I just added the Veeam endpoint agents in the meantime to make sure I still have backups of the VMs. Is there a particular patch you credit with fixing the host server? The only update shown as still available when I ran updates was for SilverLight and I passed on it then because it had hung previously.
The other issue may be that we still have a mix of 2012R2 hosts with the 2019 server. This was the first 2019 in the environment. VMs on the new host are all V9 now too, which poses a different problem for me now. Last, a couple of the VMs are not updating their integration services, which concerns me too and may have some bearing in the issue.
Reply
- Chad Holm on March 15, 2019 at 4:11 pm
  
  P.S. The 2019 server was not an upgrade. It was a fresh install on a new server.
  Reply
- Dave Kawula on March 27, 2019 at 2:33 am
  
  Happened again last night it was directly related to the update of Veeam to 9.5 UR4. Everything was fine before that.
  Locked up the node solid and had to do a hard reboot to clear things up.
  We were quite far behind on patches so maybe that has something to do with it.
  So glad google found my article to fix this lol.
  Reply
James Bertram on May 23, 2019 at 1:49 pm

Sadly I have it on a Server 2019 Dell 740xd, fully patched. What type of VM load do you have on your affected hosts? Mine ahs three very light VMs and a monster (6TB+) SQL server.. Alkl my other hyperviusors never have this issue
Reply
James Bertram on June 18, 2019 at 1:05 pm

Sadly I seem completely unable to get rid of this issue and now it has spread to another one of our Dell R740s. It does not seem to happen on any of our older generation Dells (R730s and R710s) This is just killing operations. Both servers fully patched up on the Microsoft side. Driver/firmware side, one server is fully up to date with the other lagging behind (the issue only started showing up there last week)
Reply
jim bailey on June 27, 2019 at 3:42 pm

We are experiencing a similar problem Nutanix , Veeam 9.5 upd4 , server 2016 , 3 hosts veeam backup will work for months then will stall on 9% creating checkpoint on a veeam backup site. This site is offgrid for security reasons so hosts have not been patched for a while. Any tips on this issue would be most useful as there is not a lot to go on.
Reply
Alex on July 5, 2019 at 8:08 am

Hello,

I have the same problem, sometimes backup & replication works during 3 weeks, and then problem appears
I patched WS2019, Veeam update 4a, NICcard, BIOS/Firmware (dell r540).
I had the problem again this night…
I opened a veeam support ticket, they answered me that it’s a hyper-v problem with snapshot.

Did you solve your problem ?

Thanks
Alexandre
Reply
Alexandre on July 10, 2019 at 6:42 am

Hello
I have the same problem on a fresh install customer.
Environnement
2 Server Windows Server 2019 Datacenter (1809) with Hyper-V Role
Veeam Backup & Replication 9.5.4.2753
Veeam replica job HyperV01 to HyperV02
Dell PowerEdge R540
Servers are directly connected with Broadcom NetXtreme E-Series Advanced Dual-port 10GBASE-T for the replication.

I change production checkpoint to standard checkpoint in the hyper-v vm property. I have the problem again. Cannot reboot Hyper-v properly
Everytime, i had to hard reboot hyper-v.

I try many option in veeam but problem appears again.
In hyper-v VMMS logs, the day of problem, i had id 19060 VM01 could not perform the Create Control Point operation. The virtual machine is currently performing the following task: Creating the checkpoint. (Virtual Machine ID: 4C9D3F65-B731-487F-A2C6-4002E018103C)

I have also id 18016 “Can not create production control points for VM01 (Virtual Machine ID: E9E041FE-8C34-494B-83AF-4FE43D58D063) each night during backup. However i disable production checkpoint for standard checkpoint.

My customer have older OS like 2000, 2003r2, 2008, i try lot of options in veeam backup job. But I’m not sure that the problem comes from there.

There is many people who have the problem here, recently but no solution.
https://social.technet.microsoft.com/Forums/en-US/0d99f310-77cf-43b8-b20b-1f5b1388a787/hyperv-2016-vms-stuck-creating-checkpoint-9-while-starting-backups?forum=winserverhyperv

did you have the problem again?

thank you
Alexandre
Reply
peti1212 on September 4, 2019 at 6:29 pm

So we had a case open with Microsoft for 3 months now. We have 3 clusters with now 2 having the issue. Initially it was only 1. The 2nd one started having the issue about 2-3 weeks ago. First 2 clusters didn’t have the issue, these were configured back in March and April with Server 2019. The third cluster that had the issue since the beginning were installed on May-June wiht Server 2019. I have a feeling one of the newer updates is causing the issue. The 1st cluster not having the problem has not been patched since.

To this day nothing was resolved and they have no idea what it might be. Now they are closing the case on us because the issue went from one Host in our Cluster to another host, and our scope was the first Hyper-V host having the issue. Unbelievable. The issue is still there though just happening on another host in the Cluster.

The clusters experiencing the issues have the latest generation Dell Servers in them, PE 640s, while the one not having the issue only has older generation PE 520, PE 630, etc.

The way we realize the issue is that we have a PRTG Sensor checking our host for responsiveness. At some random point in the day or night, PRTG will report that the sensor is not responding to general Hyper-V Host checks (WMI). After this, no checkpoints, backups, migrations, setting changes can happen because everything is stuck. Can’t restart VMMS service or kill it.

Here is what we have tested with no solution yet:

*Remove all 3rd party applications – BitDefender (AV), Backup Software (Backup Exec 20.4), SupportAssist, WinDirStat, etc. – Didn’t fix it.

*Make sure all VMSwitches and Network adapters were identical in the whole cluster, with identical driver versions (Tried Intel, and Microsoft drivers on all hosts) – Didn’t fix it.

*Check each worker process for the VM – When a VM got stuck during a checkpoint or migration. – Didn’t fix it.

*get-vm | ft name, vmid

*compare vmid to vmworkerprocess.exe seen in details -> Task Manager

*kill process

*Hyper-V showed VM running as Running-Critical

*Restart VMMS service (didn’t work)

*net stop vmms (didn’t work)

*Restart Server -> VMs went unmonitored

*After restart everything works fine as expected

*Evict Server experiencing issues in Cluster -> This just causes the issue to go to another host, but the issue is still there. – Didn’t fix it.

*Create two VMS (one from template, one new one) on the evicted host -> No issues here, never gets stuck, but other hosts still experience the issue.

*Install latest drivers, updates, BIOS, firmware for all hardware in all the hosts of the cluster. – didn’t fix it.

*We migrated our hosts to a new Datacenter, running up to date switches (old Datacenter – HP Switches, new Datacenter – Dell Switches), and the issue still continues.

*New Cat6 wiring was put in place for all the hosts – Issue still continues.

*Disable “Allow management operating system to share this network adapter” on all VMSwitches – issue still continues

*Disable VMQ and IPSec offloading on all Hyper-V VMs and adapters – issue still continues

*We’re currently patched all the way to August 2019 Patches – issue still continues.

We asked Microsoft to assign us a higher Tier technician to do a deep dive in to kernel dumps and process dumps, but they would not do it until we exhausted all the basic troubleshooting steps. Now they are not willing to work further because the issue moved from 1 host to another after we have moved from one datacenter to another. So it seems like based on how the Cluster comes up and who’s the owner of the disks and network, it might determine which hosts has the issue.

Also, our validation testing passes for all the hosts, besides minor warnings due to CPU differences.

Any ideas would be appreciated.
Reply
- Bostjan Cvelbar on September 19, 2019 at 5:23 am
  
  Same issue here.
  Error message that we receive is:
  
  Failed to create VM recovery checkpoint (mode: Crash consistent) Details: Job failed (”). Error code: ‘32774’. Failed to create VM recovery snapshot, VM ID ‘dd9d9847-7efe-4195-852a-c34f71b15d5e’.
  Retrying snapshot creation attempt (Failed to create production checkpoint.)
  
  Only rebooting the Hyper-V OS 2019 host solves the issue :/ updating the NIC from Intel site did not help to solve it.
  Any solutions yet guys?
  Reply
  - peti1212 on October 7, 2019 at 11:22 pm
    
    Disable VMQ on Host, VMs one by one, then restart your VMs and your host. This will resolve the issue. Initially when we first tested this, we didn’t restart the host, and the issue still happened. After that it never came back again. Verified on 3 different clusters. It’s a bug on Microsoft’s side.
    Reply
    - Bostjan Cvelbar on October 8, 2019 at 8:06 am
      
      HI. Thank u for your reply. Please advise us where can we disable VMQ setting?
      Reply
      - peti1212 on October 8, 2019 at 4:21 pm
        
        Go to Hyper-V -> Right click on a VM -> Settings -> Network Adapters -> Advanced Settings. Do it on all the VMs. Also go into Control Panel -> Network and Sharing Center -> Change Adapter Settings -> Right click and “Properties” on your adapters -> Configure -> Advanced Tab -> Check for any Virtual Machine Queues or VMQ here.
        
        Then restart VMs and Host.
        
        Also, take a look at this thread: https://social.technet.microsoft.com/Forums/en-US/0d99f310-77cf-43b8-b20b-1f5b1388a787/hyperv-2016-vms-stuck-creating-checkpoint-9-while-starting-backups?forum=winserverhyperv
        
        Someone claims that they got a proper fix from Microsoft. There are no details on what was changed, but something was updated and we might be expecting a patch soon if all goes well.
Boštjan Cvelbar on October 10, 2019 at 2:08 pm

Hi peti1212. Unforunatelly I did not help.
On Hyper-v OS 2019 I have several (windows and linux VM’S).
I can only solve the issue with rebooting Hyper-V host then the backups start to work.
I have taken your advise and tried only on one of VM’s to turn off VMQ, restart the VM and restart the host.
Now the funny thing is that this error appears on some period, and now the VM where I had turned off VMQ fails the first (other VM’s sucessfully makes backup until the day when they all fail).

error message in veeam backup and replication 9.5 4b:

9. 10. 2019 22:36:04 :: Failed to create VM recovery checkpoint (mode: Crash consistent) Details: Job failed (”). Error code: ‘32774’.
Failed to create VM recovery snapshot, VM ID ‘a22ac904-57bb-42d1-ae95-022bfbe2f04a’.

9. 10. 2019 22:36:15 :: Retrying snapshot creation attempt (Failed to create production checkpoint.)

9. 10. 2019 22:36:17 :: Unable to allocate processing resources. Error: Job failed (”). Error code: ‘32774’.
Failed to create VM recovery snapshot, VM ID ‘a22ac904-57bb-42d1-ae95-022bfbe2f04a’.
Reply
- peti1212 on October 15, 2019 at 10:28 pm
  
  You need to do it on all of the VMs. My understanding is that there is a bad implementation of VMQ in the drivers that is not compatible with Hyper-V 2019, so all VMs and Host need to be disabled and restarted. Once that is done, the issue will go away. We have thankfully had 15 Hyper-V hosts running smoothly now for over a month.
  Reply
  - Boštjan Cvelbar on October 22, 2019 at 4:28 am
    
    Hi. Even after disabling VMQ on all VM’s, rebooting the VM’s and the hots it didn’t help.
    After of several days, months of debugging I finally found the reason why it’s not working.
    It was bloody damn Group Policy. We had 1 wrong GPO set which messed with log on as service. Similiar issue as this guy (article bellow: https://social.technet.microsoft.com/Forums/en-US/8a6e0f16-49a1-433c-aaec-9bff21a37181/hyperv-not-merging-snapshots-automatically-disk-merge-failed?forum=winserverhyperv)
    Reply