Deploying Storage Spaces Direct – Part 42 – HPE RDMA Issues Mellanox CX3-Pro Prod Issue #StorageSpacesDirect #HPE #S2D @MellanoxTech

Hey Storage Spaces Direct Fans,

Today seemed to have finally resolved the case of the jobs that were stuck on our Production 4 x Node S2D Cluster running on HPE DL380 G9’s and Mellanox CX3-Pro NICS. We had been banging our heads against the wall and I decided to re-test our RDMA Configuration using the script mentions in an earlier post called Test-RDMA.ps1.

The results of the above script indicate that we have some issues with RDMA being configured. This is kind of funny because after I had set this up everything was working great. We tested all of this and it was working. Now, time to put on the big boy pants and figure this problem out.

Next step was to see if RDMA was even listening on our Storage Virtual Adapters.

We ran Netstat -xan…

This actually didn’t look right to me because this is the output we get prior to running Enable-ClusterS2D to build up Storage Spaces Direct. It’s like the nodes won’t talk RDMA to one another.

Just before we were about to give up we had a look at the Networks in Failover Cluster Manager. Here is the before shot:

On a hunch, I figured that this might have some kind of relationship to Storage Spaces Direct Traffic. When we had cleaned up the networks in Failover Cluster Manager I remember changing this to none.

We changed it back to Cluster and Client as below…

Then like magic look at the result from Netstat -xan

That is back to more what I would see after running enable-clusters2d.

We then re-ran the Test-RDMA.ps1 script with much better results.

Lastly, we decided to reboot all the nodes to see if the Storage Jobs that were stuck started working and it appears so. My client ended up having to leave for the day so I don’t know if this fixed both the RDMA issue and the stuck job one. We will check back tomorrow and fine out.

Here is a shot of the jobs running now post reboot of the cluster.

Hopefully this makes that painful stuck job and RDMA issue disappear so we can move into production.

Thanks, and happy learning.

Dave

2 Comments

Reshad A on September 17, 2017 at 9:51 pm

Hi Dave,

Great series so far, we’ve learnt a lot from your blog! Our 2-node setup has been working well, except an issue where any repair job takes hours and copies TBs of data. Have you experienced this with your deployments?

Dave Kawula on November 8, 2017 at 5:40 pm

Hey Reshad,
Repair jobs are throttled by default to not overload the cluster. It is the nature of the way it was design and as far as I know right now no changes to this behavior are planned. We have asked the Microsoft Product team to allow this to be a configurable setting and they are taking this under advisement.
I have found with a proper RDMA Configuration this does occur a lot faster.

Thanks,

Dave
Reply

Reshad A on September 17, 2017 at 9:51 pm

Hi Dave,

Great series so far, we’ve learnt a lot from your blog! Our 2-node setup has been working well, except an issue where any repair job takes hours and copies TBs of data. Have you experienced this with your deployments?
- Dave Kawula on November 8, 2017 at 5:40 pm
  
  Hey Reshad,
  Repair jobs are throttled by default to not overload the cluster. It is the nature of the way it was design and as far as I know right now no changes to this behavior are planned. We have asked the Microsoft Product team to allow this to be a configurable setting and they are taking this under advisement.
  I have found with a proper RDMA Configuration this does occur a lot faster.
  
  Thanks,
  
  Dave

Featured

Notes from the Field: Migrating Azure AD Connect to Microsoft Entra ID Connect (Before the April 30, 2025 Deadline)

Featured

Containing Rogue Devices on the Network: Microsoft Defender for Endpoint’s New IP Containment and How It Stacks Up

Featured

Hardening IPMI Interfaces on Intel Servers with RADIUS & Duo MFA

Featured

🛠️ KB Report Summary – April 8, 2025 (I know, a little late :))

Featured

🚨 April Patch Tuesday Breakdown: Elevation of Privilege, RCE, and More

Featured

Embracing the Next Chapter: Leveraging My Three Decades of IT Experience to Drive Organizational Transformation and Nurture Future IT Leaders

Deploying Storage Spaces Direct – Part 42 – HPE RDMA Issues Mellanox CX3-Pro Prod Issue #StorageSpacesDirect #HPE #S2D @MellanoxTech

Related

About The Author

Kawula Dave

2 Comments

Leave a ReplyCancel reply

Translate our Blog

Subscribe to our videos

Subscribe to our Blog

Our Authors

Cary Sun

Cristal Kawula

Dave Kawula

Émile Cabot

John O'Neill Sr.

Kawula Dave

Kevin Kaminski

Rick Vanover

Steve Labeau

Follow Us

Facebook

Youtube

Twitter

Instagram

Category

Blog Stats

Featured

Notes from the Field: Migrating Azure AD Connect to Microsoft Entra ID Connect (Before the April 30, 2025 Deadline)

Featured

Containing Rogue Devices on the Network: Microsoft Defender for Endpoint’s New IP Containment and How It Stacks Up

Featured

Hardening IPMI Interfaces on Intel Servers with RADIUS & Duo MFA

Featured

🛠️ KB Report Summary – April 8, 2025 (I know, a little late :))

Featured

🚨 April Patch Tuesday Breakdown: Elevation of Privilege, RCE, and More

Featured

Embracing the Next Chapter: Leveraging My Three Decades of IT Experience to Drive Organizational Transformation and Nurture Future IT Leaders

Deploying Storage Spaces Direct – Part 42 – HPE RDMA Issues Mellanox CX3-Pro Prod Issue #StorageSpacesDirect #HPE #S2D @MellanoxTech

Share this:

Related

About The Author

Kawula Dave

Related Posts

Critical Firmware Update Required for Intel Solid Sate DC S4510 and S4610 Drives – #StorageSpacesDirect

UPDATE – June 27, 2017 – Deploying Storage Spaces Direct – Part 26 – @MellanoxTech Firmware Bug CX3-PRO #StorageSpacesDirect #MVPHour #HyperV

Fixed SSD Journal Disks Lost Communication Issues at Storage Space Direct Server #Lost Communication #S2D #Journal Disk #SSD #mvphour

SMB Special: #StorageSpacesDirect on Fanless Xeon Servers #MVPHour @ecabot

2 Comments

Leave a ReplyCancel reply

Translate our Blog

Subscribe to our videos

Subscribe to our Blog

Our Authors

Cary Sun

Cristal Kawula

Dave Kawula

Émile Cabot

John O'Neill Sr.

Kawula Dave

Kevin Kaminski

Rick Vanover

Steve Labeau

Follow Us

Facebook

Youtube

Twitter

Instagram

Category

Tags

Blog Stats