Hey Storage Spaces Direct Fans,
Today seemed to have finally resolved the case of the jobs that were stuck on our Production 4 x Node S2D Cluster running on HPE DL380 G9’s and Mellanox CX3-Pro NICS. We had been banging our heads against the wall and I decided to re-test our RDMA Configuration using the script mentions in an earlier post called Test-RDMA.ps1.
The results of the above script indicate that we have some issues with RDMA being configured. This is kind of funny because after I had set this up everything was working great. We tested all of this and it was working. Now, time to put on the big boy pants and figure this problem out.
Next step was to see if RDMA was even listening on our Storage Virtual Adapters.
We ran Netstat -xan…
This actually didn’t look right to me because this is the output we get prior to running Enable-ClusterS2D to build up Storage Spaces Direct. It’s like the nodes won’t talk RDMA to one another.
Just before we were about to give up we had a look at the Networks in Failover Cluster Manager. Here is the before shot:
On a hunch, I figured that this might have some kind of relationship to Storage Spaces Direct Traffic. When we had cleaned up the networks in Failover Cluster Manager I remember changing this to none.
We changed it back to Cluster and Client as below…
Then like magic look at the result from Netstat -xan
That is back to more what I would see after running enable-clusters2d.
We then re-ran the Test-RDMA.ps1 script with much better results.
Lastly, we decided to reboot all the nodes to see if the Storage Jobs that were stuck started working and it appears so. My client ended up having to leave for the day so I don’t know if this fixed both the RDMA issue and the stuck job one. We will check back tomorrow and fine out.
Here is a shot of the jobs running now post reboot of the cluster.
Hopefully this makes that painful stuck job and RDMA issue disappear so we can move into production.
Thanks, and happy learning.
Great series so far, we’ve learnt a lot from your blog! Our 2-node setup has been working well, except an issue where any repair job takes hours and copies TBs of data. Have you experienced this with your deployments?
Repair jobs are throttled by default to not overload the cluster. It is the nature of the way it was design and as far as I know right now no changes to this behavior are planned. We have asked the Microsoft Product team to allow this to be a configurable setting and they are taking this under advisement.
I have found with a proper RDMA Configuration this does occur a lot faster.