Hey Checkyourlogs Fans,

 

I know a lot of you have been reaching out to me asking about why they are getting the 5120 errors with a Status Code of STATUS_IO_TIMEOUT or STATUS_CONNECTION_DISCONNECTED when a node is rebooted.

 

It appears that in the May Cumulative update Microsoft introduced a new feature SMB Resilient Handles for the Storage Spaces Direct Intra-Cluster network to improve resiliency to transient network failures. This had some side effects in increased timeouts when a node is rebooted. This can effect a system under stress.

 

Until a fix is made from Microsoft here is a Workaround that addresses the issue. You can Invoke Storage Maintenance Mode prior to rebooting a node on a Storage Spaces Direct Cluster.

 

Here is an example:

First drain the node, then invoke Storage Maintenance Mode, then reboot

Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Enable-StorageMaintenanceMode 

Once the node is back online disable Storage Maintenance Mode.

Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "<NodeName>"} | Disable-StorageMaintenanceMode 

 

 

I really hope this helps you to resolve some of your issues

 

Thanks,

 

Dave