Hey Checkyourlogs fans,

I wanted to write to you today chatting about what happens if you lose your East/West (Storage) Switch Fabric with Storage Spaces Direct. In my design for this customer, they had a Dedicated pair of Mellanox switches for their Storage Network and then different core switches for their North/South (VM & MGMT Networks). We had to do some emergency maintenance on our pair of Mellanox Switches that would require a reboot of both.

I let the customer know that when this happens, the RDMA (RoCE) traffic would just
fail over to non-RDMA (RoCE) during the outage. They couldn’t believe this and wanted to check it out or themselves.

Here is how our Virtual Adapters are configured:

MGMT are configured through a SET Switch on NIC_1 and NIC_2 to the Core Cisco Switches

HB/LM/SMB_1/SMB_2 are configured through a 2nd SET Switch to the Mellanox Core

So, I reloaded both switches (One had failed, so I only had one left). In essence a complete failure of the Storage Network at this point.  (This is one major reason why I like dedicated Storage and Client Networks)

If you check in Failover Manager this is what it looks like:


After …

Our VMs for this cluster stayed online without any issues.

Pretty cool right. Just shows off a bit of resilience for Storage Spaces Direct for you today.

NOTE: All of the traffic would be a bit more congested but things would be alive and not just die and bluescreen.  This obviously isn’t a permanent solution but knowing that you can do this gives some nice options for outage windows with your S2D Clusters.

BTW à This works the same for Windows Server 2016 and 2019.

Hope you enjoy,