Hey Checkyourlogs fans,

Today I was working with a friend that is running a Seven Node Storage Spaces Direct cluster on Lenovo Hardware. Everything has been running great over the past six months until a few of his drives had started to report a “Lost Communication” Warning. Interestingly enough the cluster itself was reporting healthy, and all the associated Virtual Disks and Volumes were also reporting healthy.

The reason for this is that Storage Spaces Direct had already removed them from the pool and they were showing up missing from Failover Cluster Manager and in a Lost Communication status when running

Get-Physicaldisk

 

Here is the output that we had when we first looked at things à There are over 90 + disks in this cluster, so the output is a bit long.


The next thing to check was the Storage Health Actions

Get-StorageSubSystem clu* | Get-StorageHealthAction

 


I wanted to check what hotfix level that the customer was on which was July 2018.

wmic qfe

 


I then tried to see if it was an issue with the disks being stuck in Storage Maintenance Mode which is a known bug right now if you have installed the June or later cumulative updated.

Get-StorageFaultDomain -Type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "S2D01"} | Disable-StorageMaintenanceMode

 


Everything seemed ok, and when I checked with Get-Physical Disk, the drives were still stuck in a “Lost Communication State.”


Before trying to Retire and Reset the disks, I wanted to check the status of the Virtual Disks

Get-VirtualDisk

 


I also wanted to run a Storage Health Report

Get-StorageSubsystem clu* | Get-StorageHealthReport

 


When dealing with a lot of drives in a Storage Spaces Direct Cluster, I find it handy to get a count of the drives that are working and the drives that are broken.

Get-PhysicalDisk | Where-Object operationalstatus -eq ok | Measure-Object

 


I had a total of 7 x Nodes with 4 x Cache Drives and 10 x HDD Drives = 98 Drives

NOTE: When you run the above command remember it adds the boot drive from the node you are running it on. In my case, there was only 1 x drive for the boot drive, so the count was +1. In your case, if you have other additional drives, you have to account for this. I had a total of 5 x drives that were in a state of Lost Communication, so that left me with the following calculation. 99 Drives Total – 1 boot drive – 5 Failed Drives = 93. After I fixed this issue my count of real disks should be 98.

The process that I followed to fix this issue was to Retire the Physical Disks, Remove them from the Pool, and then Reset them. At this point, the Storage Pool would automatically add them back in if they were indeed healthy.

First, create an array of the disks that are failing or failed.

$disk = get-physicaldisk | where-object operationalstatus -like *lost*
$disk

 


Next, retire the disks.

$disk | set-physicaldisk -Usage Retired
Get-PhysicalDisk

 


Next, remove the disks from the Storage Pool.

Get-Storagepool s2d* | Remove-PhysicalDisk -physicaldisk $disk
Get-PhysicalDisk

 


Once completed, the drives will change from a Lost Communication state to Unrecognized Metadata, and the operational status will change to Unhealthy. This is because the disks have stale metadata sitting in the first 1GB of the disk. To add them back into the pool, we have to Reset the disks.

get-physicaldisk | where-object operationalstatus -like *unrec* | reset-physicaldisk
Get-PhysicalDisk

 


Next, I wanted to check the Storage Health Actions and Storage Health Reports to see if anything had changed.

Get-StorageSubSystem clu* |Get-StorageHealthAction

 


As you can see, the drives were getting added back into the pool successfully, and the unresponsive state moved from failed to canceled.

Get-StorageSubSystem clu* | Get-StorageHealthReport

 


After doing this, the disks all showed up as healthy. We also had a chance to get Windows Admin Center up and running for the client, and this is the view once everything was all fixed up with the disks.


You can see all 98 disks are back to a healthy state.

I hope this helps you out if you encounter this issue.

Thanks,


Dave