The Case of the Physical Disk “Lost Communication” – #StorageSpacesDirect #HCI

Hey Checkyourlogs fans,

Today I was working with a friend that is running a Seven Node Storage Spaces Direct cluster on Lenovo Hardware. Everything has been running great over the past six months until a few of his drives had started to report a “Lost Communication” Warning. Interestingly enough the cluster itself was reporting healthy, and all the associated Virtual Disks and Volumes were also reporting healthy.

The reason for this is that Storage Spaces Direct had already removed them from the pool and they were showing up missing from Failover Cluster Manager and in a Lost Communication status when running

Get-Physicaldisk

Here is the output that we had when we first looked at things à There are over 90 + disks in this cluster, so the output is a bit long.

The next thing to check was the Storage Health Actions

Get-StorageSubSystem clu* | Get-StorageHealthAction

I wanted to check what hotfix level that the customer was on which was July 2018.

wmic qfe

I then tried to see if it was an issue with the disks being stuck in Storage Maintenance Mode which is a known bug right now if you have installed the June or later cumulative updated.

Get-StorageFaultDomain -Type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "S2D01"} | Disable-StorageMaintenanceMode

Everything seemed ok, and when I checked with Get-Physical Disk, the drives were still stuck in a “Lost Communication State.”

Before trying to Retire and Reset the disks, I wanted to check the status of the Virtual Disks

Get-VirtualDisk

I also wanted to run a Storage Health Report

Get-StorageSubsystem clu* | Get-StorageHealthReport

When dealing with a lot of drives in a Storage Spaces Direct Cluster, I find it handy to get a count of the drives that are working and the drives that are broken.

Get-PhysicalDisk | Where-Object operationalstatus -eq ok | Measure-Object

I had a total of 7 x Nodes with 4 x Cache Drives and 10 x HDD Drives = 98 Drives

NOTE: When you run the above command remember it adds the boot drive from the node you are running it on. In my case, there was only 1 x drive for the boot drive, so the count was +1. In your case, if you have other additional drives, you have to account for this. I had a total of 5 x drives that were in a state of Lost Communication, so that left me with the following calculation. 99 Drives Total – 1 boot drive – 5 Failed Drives = 93. After I fixed this issue my count of real disks should be 98.

The process that I followed to fix this issue was to Retire the Physical Disks, Remove them from the Pool, and then Reset them. At this point, the Storage Pool would automatically add them back in if they were indeed healthy.

First, create an array of the disks that are failing or failed.

$disk = get-physicaldisk | where-object operationalstatus -like *lost*
$disk

Next, retire the disks.

$disk | set-physicaldisk -Usage Retired
Get-PhysicalDisk

Next, remove the disks from the Storage Pool.

Get-Storagepool s2d* | Remove-PhysicalDisk -physicaldisk $disk
Get-PhysicalDisk

Once completed, the drives will change from a Lost Communication state to Unrecognized Metadata, and the operational status will change to Unhealthy. This is because the disks have stale metadata sitting in the first 1GB of the disk. To add them back into the pool, we have to Reset the disks.

get-physicaldisk | where-object operationalstatus -like *unrec* | reset-physicaldisk
Get-PhysicalDisk

Next, I wanted to check the Storage Health Actions and Storage Health Reports to see if anything had changed.

Get-StorageSubSystem clu* |Get-StorageHealthAction

As you can see, the drives were getting added back into the pool successfully, and the unresponsive state moved from failed to canceled.

Get-StorageSubSystem clu* | Get-StorageHealthReport

After doing this, the disks all showed up as healthy. We also had a chance to get Windows Admin Center up and running for the client, and this is the view once everything was all fixed up with the disks.

You can see all 98 disks are back to a healthy state.

I hope this helps you out if you encounter this issue.

Thanks,

Dave

3 Comments

Rob on January 7, 2020 at 4:47 pm

thanks 😉

Rogier Koek on October 2, 2023 at 9:32 am

Removed the disks from the pool but they did not show up to be re-added. Are they lost forever?

Kawula Dave on October 10, 2023 at 4:30 pm

You might have to run Get-Physical disk and go through the manual process of re-adding them in.
Reply

Rob on January 7, 2020 at 4:47 pm

thanks 😉
Rogier Koek on October 2, 2023 at 9:32 am

Removed the disks from the pool but they did not show up to be re-added. Are they lost forever?
- Kawula Dave on October 10, 2023 at 4:30 pm
  
  You might have to run Get-Physical disk and go through the manual process of re-adding them in.

Featured

Notes from the Field: Migrating Azure AD Connect to Microsoft Entra ID Connect (Before the April 30, 2025 Deadline)

Featured

Containing Rogue Devices on the Network: Microsoft Defender for Endpoint’s New IP Containment and How It Stacks Up

Featured

Hardening IPMI Interfaces on Intel Servers with RADIUS & Duo MFA

Featured

🛠️ KB Report Summary – April 8, 2025 (I know, a little late :))

Featured

🚨 April Patch Tuesday Breakdown: Elevation of Privilege, RCE, and More

Featured

Embracing the Next Chapter: Leveraging My Three Decades of IT Experience to Drive Organizational Transformation and Nurture Future IT Leaders

The Case of the Physical Disk “Lost Communication” – #StorageSpacesDirect #HCI

Related

About The Author

Kawula Dave

3 Comments

Leave a ReplyCancel reply

Translate our Blog

Subscribe to our videos

Subscribe to our Blog

Our Authors

Cary Sun

Cristal Kawula

Dave Kawula

Émile Cabot

John O'Neill Sr.

Kawula Dave

Kevin Kaminski

Rick Vanover

Steve Labeau

Follow Us

Facebook

Youtube

Twitter

Instagram

Category

Blog Stats

Featured

Notes from the Field: Migrating Azure AD Connect to Microsoft Entra ID Connect (Before the April 30, 2025 Deadline)

Featured

Containing Rogue Devices on the Network: Microsoft Defender for Endpoint’s New IP Containment and How It Stacks Up

Featured

Hardening IPMI Interfaces on Intel Servers with RADIUS & Duo MFA

Featured

🛠️ KB Report Summary – April 8, 2025 (I know, a little late :))

Featured

🚨 April Patch Tuesday Breakdown: Elevation of Privilege, RCE, and More

Featured

Embracing the Next Chapter: Leveraging My Three Decades of IT Experience to Drive Organizational Transformation and Nurture Future IT Leaders

The Case of the Physical Disk “Lost Communication” – #StorageSpacesDirect #HCI

Share this:

Related

About The Author

Kawula Dave

Related Posts

The Case of Expanding a Full – Azure Stack HCI Nested Resilient Volume – #AzureStackHCI #S2D

How to move failover clusters on the same hardware to another domain

BUG Alert – SDDC Management Resource (Windows Admin Center) is Impacting Storage Spaces Direct Virtual Machines on Server 2019 – #StorageSpacesDirect #S2D #WindowsAdminCenter

In-Place Upgrading from Storage Spaces 2016 to 2022

3 Comments

Leave a ReplyCancel reply

Translate our Blog

Subscribe to our videos

Subscribe to our Blog

Our Authors

Cary Sun

Cristal Kawula

Dave Kawula

Émile Cabot

John O'Neill Sr.

Kawula Dave

Kevin Kaminski

Rick Vanover

Steve Labeau

Follow Us

Facebook

Youtube

Twitter

Instagram

Category

Tags

Blog Stats