Hey Checkyourlogs Fans,
I am writing to you here tonight after having some not so fun nights dealing with persisting issues with Windows Server 2019 and Storage Spaces Direct this month. Let me preface this with the fact that we are early adopters and all of my clients so far understand this and are willing to working Microsoft and the Vendors to improve the experience.
WSSD is slated to come out with full certifications starting in March of 2019.
My customer has decided to purchase a brand new Hyper-Converged Cluster running all NVME SSD flash drives. We have seen the following issues so far in our deployment some resolved some not:
#1 – Mellanox Firmware and Drive Issues – We saw a ton of Paused Packets on the Switches and in the Mellanox Performance Counters. This was breaking our Lossless configuration of RDMA (RoCE) for the Storage Spaces Direct Nodes. This has been Resolved with the help of Mellanox
#2 – Mellanox S2700 Switches have a new RoCE configuration for lossless networks. Specifically they have some special settings for Advanced Buffer Management. Without these settings configured properly you will see abnormally high Paused Packets which also break the Lossless configuration required for RDMA (RoCE). Here is a great link to some configurations authored by a Microsoft Premier Field Engineer (Jan Mortenen) – https://www.s2d.dk/2019/01/monitor-roce-mellanox_5.html . For the record, we have been configuring with Layer 3 DSCP. You should checkout his blog he has some great stuff up there.
#3 – The SDDC Management Resource which is what Windows Admin Center uses as a polling mechanism to return results for the HTML 5 UI was crashing cluster roles (VM). There was a confirmed bug, and it is to have been fixed in the 1D Cumulative update for January. Right now we have been disabling this resource in Failover Cluster Manager until we can confirm what is happening. Things have been pretty stable since we stopped using Windows Admin Center. We don’t anticipate this to be a prolonged issue but it is one that we can’t have, and hopefully, it is indeed fixed in 1D.
#4 – We have had reports of customers getting files locked up in their Cluster Shared Volumes (CSV). In some cases, this has caused some production data loss. For our customer, we had good backups and replicas and were able to avoid prolonged outages. It is still unclear at this point what was causing this problem. This Microsoft product teams are investigating.
#5 – NVME Performance issues – I took an identical cluster working with one of the vendors I’m close with, and we ran identical Vmfleet tests on the same hardware. The results are pretty shocking. I discovered this issue in the customer’s production Cluster when my all NVME cluster started showing +15MS latency. The same cluster reformatted with Windows Server 2016 <1MS (US) latency. Digging in we have been working with Mellanox and have validated that their updated drivers and firmware look good. RoCE at the hardware level appears to be configured correctly with the Mellanox S2700 switches. This issue has now been escalated to Microsoft, and there is still no clear path as to what the issue might be at this time. Below are some screenshot examples of VMFleet running on both platforms. Despite different host names it is the same hardware.
Random 4K, 8 Threads, 8 Outstanding I/O 100 % Read
Windows Server 2016
Pretty good numbers right. +3 Million IOPS and <1ms latency.
Look at the Bandwidth – 12 GB/sec
These are running on 40 GbE Mellanox CX4 adapters.
Looks great to me.
Now lets, try the same thing on Windows Server 2019
Umm 800 K IOPS with 3.5 GB /Sec
WHAT 62 MS Latency at the top end.
Yah something is not right here.
Random 4K, 8 Threads, 8 Outstanding I/O, 100 % Write
Not bad 800 K IOPS + 3.25 GB/SEC about 3.8 MS Latency
WS 2019 – 500 K IOPS – 2.1 GB/SEC and <20ms Latency
Moreover, for the final test:
Good old 70/30 Read Write
Random 4K, 8 Threads, 8 Outstanding I/O, 70% Read / 30% Write
Windows Server 2016 performs quite well with over 1.7 Million IOPS 7+ GB/Sec bandwidth and <1ms latency
Windows Server 2019:
We hit 700 K IOPS / 3 GB/Sec and +35ms latency.
Now, I know what you are thinking: Aren’t you suppose to be a huge fanboy of Storage Spaces Direct Dave? The answer to that is absolutely. However, my customers always come first, and this one, in particular, is running into some really weird issues at this time. At this point, Microsoft internally has taken our issue and is working through it to see what exactly is up. At the end of the day, all I know is that if I take the same hardware and run Windows Server 2016 with 4 x the performance or more something is wrong.
Also please remember that we are early adopters and as such we expect to hit roadblocks. My goal of this post is to see if anyone else is experiencing something similar to a common goal to get things resolved. As always I would highly recommend waiting until your vendors of choice have certified their hardware with the Windows Server Software Defined Program (WSSD). That programs single goal in life is to prevent these types of issues from occurring in the field. It has vendors certify their hardware and ensure that it can pass stress tests.
This configuration that we have is what we feel not a hardware problem because it works so well on Windows Server 2016. Microsoft at this point like I said is evaluating the problems, and as soon as I hear back you know, I’ll let you know.
So for right now, continue your testing of Windows Server 2019 and if you ask me I would wait for about the next 45 days until the certified builds are ready from both Microsoft and the Vendors.
I really hope you enjoyed this post and it saves you a ton of time,