I recently came across a post on the Dell Support forum listing issues with a BLUE Screen on Dell R730XD Servers and Mellanox CX3-Pro Adapters. It is a really interesting read and here are some of the highlights of the case. BTW -> you can check it out yourself here: http://en.community.dell.com/support-forums/servers/f/956/t/20010905

Basically user M.Olsson was deploying a small 3x node Storage Spaces Direct cluster with brand new Dell PowerEdge 730xd’s. He ordered the Mellanox Connectx-3 pro nics direct from Dell. This is interesting for me as I have the exact same configuration coming up in a few weeks and thus why it caught my eye.

Here was where the train went off the tracks: He built his Cluster, Enabled Storage Spaces Direct, and started stress testing with VMFleet. Seems straightforward right. Pretty much a standard Storage Spaces Direct deployment in my opinion. As soon as one of the nodes was rebooted he got the following BLUE Screen:


The Blue Screen Memory_Management would just keep looping on that node. Not a good scenario right… Nope don’t think so especially on a brand new build on hardware that was purchased in April of 2017.

Then about 3 days later on April 24, 2017 a moderator on the forum posted the issue could be related to IO Non Posted Prefetching in the UEFI BIOS. He also posted a link to the Mellanox Technical Support Form that was much more helpful than the post on the Dell Support Site. You can check that one out here: https://community.mellanox.com/thread/3593

Apparently, this issue goes all the way back to January of 2017 or earlier. This is when user t3chyphil posted the following the Mellanox Support Forum:

I have copied some of the post here for completeness of this blog post.

Storage Spaces Direct Windows Server 2016 (1607) BSOD – Mellanox ConnectX-3 Pro (Dell)

Good afternoon,

There is very little documentation specific to Windows Server 2016, much of the RDMA/RoCE  documentation referrers to Windows Server 2012(r2) Storage Spaces. So I figured I’d start a conversation in here to help others also looking at Microsoft Storage Spaces Direct (S2D) in Windows Server 2016.

I currently have an open case with Dell ProSupport regarding a BSOD my 2 Node cluster encounters. Either node will just halt and restart after 60 seconds when stress testing the environment. Each server is configured as follows…

  • Dell 13th Gen R730XD
  • 2x 120GB Intel SSDs SSDSC2BB120G6R (OS Mirror)
  • 6x 1.6TB SSDs SSDSC2BX016T4R
  • 6x 8TB HDDs ST8000NM0055-1RM112
  • 2x Intel DC P3700 800GB (Journal / Cache)
  • 256GB 2400Mhz Memory
  • HBA330 Mini Controller
  • 1x Mellanox ConnectX-3 Pro (MT04103) Dual Port SFP+ 10GbE (Firmware Version: 2.26.50.80 / Driver Version: 2.25.12665.0)
  • Running Windows Server 2016 DataCenter 1607 Build 14393.693

    Each server has two links to a Dell N4032F Switch.

    To rule out a possible fault with my switch config, Dell advised I directly connect the two nodes together. RDMA is engaged because I can see the traffic using performance monitor.

    Here’s the order in which I’ve setup my environment…

  1. Install the OS and fully update/patch
  2. Set Windows Power Mode to Performance
  3. Install Windows Features – Hyper-V / File-Services / Failover-Clustering / Data-Center-Bridging
  4. Install Dell drivers for all hardware including the Mellanox nics. (I’ve tried both the Mellanox drivers and Dell’s. They appear to be the same. MLNX_VPI_WinOF-5_25_All_Win2016_x64 / Driver Version: 2.25.12665.0)
  5. I perform the network configuration. Essentially create a Hyper-V SET Switch joined to both ports of the Mellanox nic. I then create two vNics connected to the new Switch with a VLAN tag. (See attached file)
  6. I then create the Failover-Cluster and enable Storage Spaces Direct (See attached file)

    Everything appears to be okay then it’ll randomly crash. Below is a memory dump. This is what I receive on either host. I want to upgrade the firmware but it’s a Dell product code so I’m stuck. It’s been three weeks and we still don’t have a working environment. I also have another debug output further below…

    *******************************************************************************

    *                                                                             *

    *                        Bugcheck Analysis                                    *

    *                                                                             *

    *******************************************************************************

    DRIVER_POWER_STATE_FAILURE (9f)

    A driver has failed to complete a power IRP within a specific time.

    Arguments:

    Arg1: 0000000000000003, A device object has been blocking an Irp for too long a time

    Arg2: ffffa48778febe20, Physical Device Object of the stack

    Arg3: ffffc080258f4960, nt!TRIAGE_9F_POWER on Win7 and higher, otherwise the Functional Device Object of the stack

    Arg4: ffff9c8fe2328010, The blocked IRP

    Debugging Details:

    ——————

    Implicit thread is now ffff9c8f`e23a8080

    DUMP_CLASS: 1

    DUMP_QUALIFIER: 401

    BUILD_VERSION_STRING:  14393.693.amd64fre.rs1_release.161220-1747

    SYSTEM_MANUFACTURER:  Dell Inc.

    SYSTEM_PRODUCT_NAME:  PowerEdge R730xd

    SYSTEM_SKU:  SKU=NotProvided;ModelName=PowerEdge R730xd

    BIOS_VENDOR:  Dell Inc.

    BIOS_VERSION:  2.3.4

    BIOS_DATE:  11/08/2016

    BASEBOARD_MANUFACTURER:  Dell Inc.

    BASEBOARD_PRODUCT:  0WCJNT

    BASEBOARD_VERSION:  A04

    DUMP_TYPE:  1

    BUGCHECK_P1: 3

    BUGCHECK_P2: ffffa48778febe20

    BUGCHECK_P3: ffffc080258f4960

    BUGCHECK_P4: ffff9c8fe2328010

    DRVPOWERSTATE_SUBCODE:  3

    FAULTING_THREAD:  e23a8080

    CPU_COUNT: 38

    CPU_MHZ: 960

    CPU_VENDOR:  GenuineIntel

    CPU_FAMILY: 6

    CPU_MODEL: 4f

    CPU_STEPPING: 1

    CPU_MICROCODE: 6,4f,1,0 (F,M,S,R)  SIG: B00001E’00000000 (cache) B00001E’00000000 (init)

    DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

    BUGCHECK_STR:  0x9F

    PROCESS_NAME:  System

    CURRENT_IRQL:  2

    ANALYSIS_SESSION_HOST:  PHALFORDPC

    ANALYSIS_SESSION_TIME:  01-26-2017 10:07:27.0372

    ANALYSIS_VERSION: 10.0.14321.1024 amd64fre

    LAST_CONTROL_TRANSFER:  from fffff800d1ce5f5c to fffff800d1dcf506

    STACK_TEXT:

    ffffc080`2afcd6a0 fffff800`d1ce5f5c : 00000000`00000000 00000000`00000001 ffffa487`79d23801 fffff800`d1d47359 : nt!KiSwapContext+0x76

    ffffc080`2afcd7e0 fffff800`d1ce59ff : ffffa487`70040100 00000000`00000000 00000000`00000000 fffff800`00000000 : nt!KiSwapThread+0x17c

    ffffc080`2afcd890 fffff800`d1ce77c7 : ffffc080`00000000 fffff80d`41a33a01 ffffa487`70040130 00000000`00000000 : nt!KiCommitThreadWait+0x14f

    ffffc080`2afcd930 fffff80d`41a0aaba : ffffa487`790a6c90 ffffa487`00000000 fffff80d`41a44000 ffffa487`00000000 : nt!KeWaitForSingleObject+0x377

    ffffc080`2afcd9e0 fffff80d`3b05debf : 00000000`00000000 00000000`00000006 ffffa487`78fd3980 fffff80d`3b428bf9 : mlx4eth63+0x4aaba

    ffffc080`2afcda30 fffff80d`3b0f6f80 : ffffa487`71c971a0 00000000`00000000 ffff9c8f`e2328010 00000000`00000000 : NDIS!ndisMInvokeShutdown+0x53

    ffffc080`2afcda60 fffff80d`3b0b910a : ffffa487`71c971a0 00000000`00000000 0000007f`fffffff8 ffff9c8e`c5249bb0 : NDIS!ndisMShutdownMiniport+0xb4

    ffffc080`2afcda90 fffff80d`3b09d342 : 00000000`00000000 00000000`00000000 ffff9c8f`e2328010 ffffa487`71c971a0 : NDIS!ndisSetSystemPower+0x1bdc6

    ffffc080`2afcdb10 fffff80d`3b01fc28 : ffff9c8f`e2328010 ffffa487`78febe20 ffff9c8f`e2328200 ffffa487`71c97050 : NDIS!ndisSetPower+0x96

    ffffc080`2afcdb40 fffff800`d1d9a1c2 : ffff9c8f`e23a8080 ffffc080`2afcdbf0 fffff800`d1f80600 ffffa487`71c97050 : NDIS!ndisPowerDispatch+0xa8

    ffffc080`2afcdb70 fffff800`d1c82729 : ffffffff`fa0a1f00 fffff800`d1d99fe4 ffff9c8e`c9cb8120 00000000`000001d1 : nt!PopIrpWorker+0x1de

    ffffc080`2afcdc10 fffff800`d1dcfbb6 : ffffc080`25955180 ffff9c8f`e23a8080 fffff800`d1c826e8 00000000`00000000 : nt!PspSystemThreadStartup+0x41

    ffffc080`2afcdc60 00000000`00000000 : ffffc080`2afce000 ffffc080`2afc8000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16

    STACK_COMMAND:  .thread 0xffff9c8fe23a8080 ; kb

    THREAD_SHA1_HASH_MOD_FUNC:  b7cf6cc0234897f6fd93ad4ead1f75c9e7fd9df1

    THREAD_SHA1_HASH_MOD_FUNC_OFFSET:  263f1d39481efd9f34c4df5786cc37534825cc6e

    THREAD_SHA1_HASH_MOD:  1de60aba82b9f9b6af56a445a099815cd801e5d9

    FOLLOWUP_IP:

    mlx4eth63+4aaba

    fffff80d`41a0aaba 488d152f050300  lea     rdx,[mlx4eth63+0x7aff0 (fffff80d`41a3aff0)]

    FAULT_INSTR_CODE:  2f158d48

    SYMBOL_STACK_INDEX:  4

    SYMBOL_NAME:  mlx4eth63+4aaba

    FOLLOWUP_NAME:  MachineOwner

    MODULE_NAME: mlx4eth63

    IMAGE_NAME:  mlx4eth63.sys

    DEBUG_FLR_IMAGE_TIMESTAMP:  57c2dc3b

    BUCKET_ID_FUNC_OFFSET:  4aaba

    FAILURE_BUCKET_ID:  0x9F_3_POWER_DOWN_mlx4eth63!unknown_function

    BUCKET_ID:  0x9F_3_POWER_DOWN_mlx4eth63!unknown_function

    PRIMARY_PROBLEM_CLASS:  0x9F_3_POWER_DOWN_mlx4eth63!unknown_function

    TARGET_TIME:  2017-01-26T09:54:25.000Z

    OSBUILD:  14393

    OSSERVICEPACK:  0

    SERVICEPACK_NUMBER: 0

    OS_REVISION: 0

    SUITE_MASK:  400

    PRODUCT_TYPE:  3

    OSPLATFORM_TYPE:  x64

    OSNAME:  Windows 10

    OSEDITION:  Windows 10 Server TerminalServer DataCenter SingleUserTS

    OS_LOCALE:

    USER_LCID:  0

    OSBUILD_TIMESTAMP:  2016-12-21 06:50:57

    BUILDDATESTAMP_STR:  161220-1747

    BUILDLAB_STR:  rs1_release

    BUILDOSVER_STR:  10.0.14393.693.amd64fre.rs1_release.161220-1747

    ANALYSIS_SESSION_ELAPSED_TIME: 6ba

    ANALYSIS_SOURCE:  KM

    FAILURE_ID_HASH_STRING:  km:0x9f_3_power_down_mlx4eth63!unknown_function

    FAILURE_ID_HASH:  {476104f0-13a3-bd96-8e08-ff1f10ccd888}

    Followup:     MachineOwner

    In the forum users have been fighting with Dell Pro Support and Mellanox Technical Support to figure out the issue. Remember that a vendor like Dell can OEM Cards from a vendor like Mellanox. The issue that you run into is that typically Dell will only support the cards that they sell in an OEM fashion if you use their drivers. I have run into this exact issue with Dell in the past with Intel NIC’s where intel will release the latest driver with lots of fixes and Dell won’t support it until they QA it. I understand why Dell has to do this from their perspective and I feel it is important for my readers to also understand.

    Dell can’t support the latest revision of every driver for every manufacturer instantly. They all need to go through a formal QA process and be certified to work on the Dell Platform. This is not unique to Dell as pretty much every manufacturer that OEM’s is in the same boat.

    Folks this happens and it is what it is. So, you have two choices as a client of Dell:

    1. If you really want to have the latest Mellanox Tech Drivers and take on support from them and not worry about the Dell Pro Support you purchase these cards directly from Mellanox. In this scenario Mellanox will be more than happy to help you.
    2. If you choose to purchase the cards from Dell you will have one uniform support provider without the latest and greatest versions of the drivers. These drivers will be fully supported and certified though.

    Now my dilemma is that my customer is in the support scenario where they have purchased all of their equipment from Dell. This means that I will have to live with whatever the Driver Versions are that Dell has certified (Option 2 from above). Thankfully the same poster that had the issue in the first place found a solution. I have listed his fix below:

    I managed to fix the issue. I had a support case open with Dell ProSupport for about 3 weeks. They too had issues trying to replicate the fault. I suggested the firmware was out of sync with the drivers they’d released. Anyway, they said try BIOS settings. I then spent the next 3 weeks reinstalling windows over and over because it would corrupt the install of Windows on occasion because of the BSOD’s.

    In the end I was able to resolve the issue. There’s a BIOS setting IO Non Posted Prefetching. This was enabled by default on delivery of the servers. I disabled this setting and was able to run VMFleet for a few days hammering the system with no crashes. I fed this info back to Dell who then closed the case. They did acknowledge the firmware is a problem but said they can’t do anything about it other than raise a case for it to be updated. We just have to wait.

    I think I’d buy Mellanox cards directly from Mellanox in future. I can’t see a way of upgrading the firmware as the firmware tools don’t recognise the cards at all. There’s no way to discover them because Dell have changed the identifiers the MFT’s look for. Mellanox was very unhelpful as I tried to raise a case with them, only to be told I don’t have support. Pretty annoyed at the time. Dell won’t give me a time or date for firmware or even if it’s on the cards. Mellanox did not want to know unless I paid more. Anyway, I hope this helps others.

    May I add. The servers have been running fine for about a month and now we’re experiencing similar crashes again (not as often). This time Microsoft have a case open as I believe the mellanox side of things are sorted. Who knows, Microsoft might turn around and say there’s a firmware + driver mismatch on the Mellanox cards. It’s been a nightmare.

    Anyway, I hope that BIOS setting helps others.

    It looks like the initial issue was resolved but they were still having what appeared to be some driver issues. I will make sure to fully patch my Windows Hosts (Dell R730XD’s), get the latest Firmware applied, find out if Dell has released newer drivers, and I have a support case open with Mellanox to see if there has been any movement on this case. I will update the blog post later once I hear back from Mellanox Technical Support as in the forum they were saying contact them directly which I have.

    Happy Learning and have a nice weekend.


    Dave

Advertisements