ZFS hangs after removing physical drive from zpool

Does MD immediately detect the removed drive?

Also, any firmware updates available for your HBA? OEM drivers?

Ah I see. ā€˜ejectā€™ did not work: ā€œnon hot-pluggable deviceā€.

But ā€˜echo 1 > /sys/block/sdg/device/deleteā€™ worked fine and did a clean eject that the zpool could handle. Even when performing a ā€˜zpool scubā€™ zfs did not hang and worked as expected.

So another sign pointing to the raid controller and driver not properly communicating a hard remove of the drive.

MD behaved like ZFS, as in only noticing the drive was missing when the next io was issued, and not right away when the drive was removed.

So I went onto the firmware update odyssey of HPE servers.

There was actually an available update for the P440ar, from 4.42 to 6.06. Release notes gave no hint about something like hardware failure communication. Performed the update anyways, behavior did not change. ZFS still hangs in the described scenario.

When browsing the abomination called ā€œHewlett Packard Enterprise Support Centerā€ I also noticed a ā€œCRITICALā€ firmware update for the drives we use in the zpool.
Release notes again kinda worthless, but might as well try it.

Though, when in HBA mode firmware updates to drives can only performed in ā€œofflineā€ mode, meaning having to boot the ā€œService Pack for Proliantā€ ISO and performing the update from there.

Well, we did that. Instructions again kinda meh, but finally got it working after a few hours. Though, the SPP also updated the firmware of the UEFI, NICs and iLo and some other crap.

Guess what, now the ZFS hot removal works as expected, no hanging zfs module, even when running ā€˜zfs scrubā€™ directly after!!! :open_mouth:

Sadly, now I cant exactly pinpoint which of the firmware updates actually fixed it, either UEFI, drives, others or any combination of those. In retrospect it would have been possible to the firmware updates one by one. Luckily, we are going to install a second system with identical hardware configuration later on, so I will have a chance at figuring it out completely.

Still super weird that FreeNAS could seemingly handle the misbehaving firmware in the driver, whereas CentOS could not.

Oh, almost forgot, now 2 of the drivesā€™ fault indicator led is permanently lit amber, meaning ā€œcomplete drive failureā€. The drives work fine, SMART says they are fine, the P440ar says they are fine, the iLo says they are fine, ZFS says they are fine, just the drives them selves think they are dead.

GG HP, going to open a support case now. :poop:

EDIT: second installation with identical hardware never had the hanging issue. Why, I have no idea.

exporting and importing the zpool cleared the drivesā€™ fault led. Apparently zfs can directly talk to the drivesā€™ controller and convince it that it is dead.

tl;dr when in doubt, always update all the firmwares.

3 Likes

:clap:

1 Like