Hey Checkyourlogs fans,
I wanted to write to you today chatting about what happens if you lose your East/West (Storage) Switch Fabric with Storage Spaces Direct. In my design for this customer, they had a Dedicated pair of Mellanox switches for their Storage Network and then different core switches for their North/South (VM & MGMT Networks). We had to do some emergency maintenance on our pair of Mellanox Switches that would require a reboot of both.
I let the customer know that when this happens, the RDMA (RoCE) traffic would just
fail over to non-RDMA (RoCE) during the outage. They couldn’t believe this and wanted to check it out or themselves.
Here is how our Virtual Adapters are configured:
MGMT are configured through a SET Switch on NIC_1 and NIC_2 to the Core Cisco Switches
HB/LM/SMB_1/SMB_2 are configured through a 2nd SET Switch to the Mellanox Core
So, I reloaded both switches (One had failed, so I only had one left). In essence a complete failure of the Storage Network at this point. (This is one major reason why I like dedicated Storage and Client Networks)
If you check in Failover Manager this is what it looks like:
Our VMs for this cluster stayed online without any issues.
Pretty cool right. Just shows off a bit of resilience for Storage Spaces Direct for you today.
NOTE: All of the traffic would be a bit more congested but things would be alive and not just die and bluescreen. This obviously isn’t a permanent solution but knowing that you can do this gives some nice options for outage windows with your S2D Clusters.
BTW à This works the same for Windows Server 2016 and 2019.
Hope you enjoy,