Important UPDATE: June 27, 2017 – by Dave Kawula

Hey folks –> We have an active support case going on now with Mellanox Technical Support.   They are working through some issues and we have communications going on daily regarding this issue.   I will update more as we know more.

Your Mellanox support case # 00358620/   has been updated By , with the following comment:
/————————————————————-
Dave – I am now seeing “Responder CQE Errors” as you say… please hold for diagnostics.

 

As we know more I will provide some more updates.

Thanks,
Dave

 

 

Important UPDATE June 16, 2017 – by Dave Kawula

Hey folks –> I have posted this issue up in the Mellanox Support forums and the community has been great getting back to me.  Mellanox to this point not so much.   Basically, there was a user that said RoCE was broken on Windows Server 2016.   He downgraded the Firmware and the oldest driver for 2016..

 

Hi, Dave!

Thanks for confirming, I’m not alone with this problem.

 

 

I’d also like to say, we have faced with this problem while running previous version of firmware (2.40.5032) and drivers (5.35.12970.0). I’ve saw, that 2.40.7000 fixes two [irrelevant] bugs, but updated this version with hope it will suddenly fix RoCE.

Actually, I’m not building S2D, but trying to connect Hyper-V hosts to SOFS via RDMA, and see, that RDMA is not working at all (nd_rping, nd_send_bw can’t establish RoCE connection).

 

 

Btw, even without using RDMA we have problems with these servers – virtual machines are constantly lose connection to storage (which is on the SOFS). I think now, in may be caused by firmware too.

 

 

I’ll try to downgrade to older fw/drivers…

 

I’ve downgraded to oldest driver for windows 2016 (v 5.25, the first firmware for windows 2016) and choosed option flash firmware included with these drivers.

RoCE works now!

 

 

***************************************************

 

 

Hey Storage Spaces Direct fans I have been working on a deployment for a customer in Washington (Health Care).

We are building out a development environment for them and they have purchased Mellanox NIC’s for both DEV and Production.   (Mellanox CX3-Pro MCX-3112B-XCCT)

Like any normal, deployment we decided to grab the latest firmware and drivers from Mellanox’s website.

We downloaded the latest Drivers and Firmware and this is what we get for performance on the cards when copying to the CSV Volume in the Cluster.

10 MB/Sec are you kidding me… (We tried everything before looking at this firmware: Changed the Avago Controller “LSI 9300”, Updated Firmware on our drives, tore apart the Failover Cluster, tried every conceivable configuration change in Storage Spaces Direct. Finally, I noticed the Firmware version in my lab and the customer’s was different. So I upgraded mine and this is what happened.

For all you Networking junkies out there… I can repo this on demand every time I upgrade the firmware to 2.40.7000


I am able to somewhat fix the issue by reverting the firmware, reinstalling the old driver, and tearing down the entire Virtual networking stack.

Here are the detailed steps I followed to fix this:

  1. Revert the Firmware on the Card to: 2.40.5030 (fw-ConnectXPro-rel-2_40_5030-MCX3112B-XCC_Ax-FlexBoot-3.4.746.bin)
  2. Uninstall the Mellanox Driver (This will break your networking Setup but that is fine)
  3. Reinstall the Mellanox Driver with the older version: 5.35.51100.0 (MLNX_VPI_WinOF-5_35_All_win2016_x64)


  1. Restart the Server
  2. Validate that the Firmware and Driver are updated in Device MGR
  3. Rebuilt the SET Team / VMSwitch / and Virtual Adapters / Re-Enable RDMA
#Remove Virtual Adapters

Get-VMNetworkAdapter -ManagementOS SMB_3 | Remove-VMNetworkAdapter
Get-VMNetworkAdapter -ManagementOS SMB_4 | Remove-VMNetworkAdapter

Get-VMSwitch -Name teamedvswitch02 | Remove-VMSwitch -confirm:$false

New-VMSwitch -Name teamedvswitch02 -NetAdapterName "Ethernet 3", "Ethernet 4" -EnableEmbeddedTeaming $true -Confirm:$false

Add-VMNetworkAdapter -SwitchName teamedvswitch02 -name SMB_3 -ManagementOS
Add-VMNetworkAdapter -SwitchName teamedvswitch02 -name SMB_4 -ManagementOS
Enable-NetAdapterRdma "vethernet (SMB_3)","vethernet (SMB_4)"
Get-NetAdapterRdma

  1. Reconfigure the IP Addresses on the SMB_3 and SMB_4 virtual adapters

This is still slow as it is about 1/2 of the speed I used to get in my lab prior to touching the drivers and firmware.   I was getting 500-700 MB /sec before I touched the drivers and firmware.


I also tested this configuration upgrading to the latest Mellanox Driver – MLNX_VPI_WinOF-5.35_All_Win2016_x64 (5.35.12978.0 “Device MGR” 5.35.52000.0 “on the Driver Package”. My results were the same à Better than the 10 MB/sec but not faster than 200 MB /sec yet.


One really big note if you are reading this is that the WinOF Driver is for the CX3 and CX3-PRO lineup and the WinOF-2 is for the CX4/5 lineup of Mellanox Cards.

You can see this in the screen shot below from their website.


Make sure you download the correct drivers for your Card!

I did some digging on the Firmware and this is what I have been able to come up with so far:

There only appear to be two fixes in this latest firmware:

RM#980151: Fixed an issue where a virtual MAC address which is configured by set_port (ifconfig), remained after driver restart

RM#913926: Fixed an issue where the two interfaces reported the same MAC address when bonding configuration was used.


http://www.mellanox.com/pdf/firmware/ConnectX3Pro-FW-2_40_7000-release_notes.pdf

I don’t think it will be that big of a deal NOT installing this Firmware and going with the latest drivers. At least until Mellanox comes back to the community with a fix and or workaround.

There was something rather disturbing that I read in the release notes for this flakey firmware. It appears to me that no testing was done on this firmware one Windows Server 2016. I may to totally off base here and I am just going off the release notes published on the www.mellanox.com website for this firmware.

Have a look at the screen shot below from the release notes:


Ok, that is all that I have for you for now on this issue. I will try to report back with more once Mellanox comes back to me and or the community.

At this time it is still unconfirmed if this is an actual bug from Mellanox.. it sure feels that way to me.

Thanks and happy learning,

Dave