ZFS hangs after removing physical drive from zpool

Dynamic_Gravity · January 26, 2018, 3:25pm

Unsure.

magicthighs · January 26, 2018, 4:06pm

I’m not sure if this will help, because I haven’t been crazy enough to just pull a drive from a running zpool, but try setting failmode=continue on your pool.

wendell · January 26, 2018, 4:27pm

is the drive removed hot or cold?
UEFI setting for “sata hotplug enabled” ?

skullux · January 26, 2018, 8:49pm

We tried setting failmode=continue, but have not experience any difference in behavior.
According to the documentation the default value is ‘wait’, which stops the zpool until a ‘zpool clear’ is performed. ‘continue’ is supposed to put the pool into a read only mode, which is still kinda meh.
But from our experience with both settings the pool and the kernel module hangs, so we would no even be able to issue a ‘zpool clear’ either way. We also tried putting the drive back in, but the zpool keeps hanging with both options until the system is rebooted.

It is a hot remove by just pulling the drive out of the cage, basically simulating a complete drive failure.

Regarding hotplug support, I am not sure. There is no setting in the uefi. The drives are connected to a HP P440ar in HBA mode, which documentation only mentions that hotplug compatible drives can be hotplugged… And yes, the drives are hotplug compatible according to their data sheet.

Before the zpool hangs, we see a clean remove event by the P440ar in dmesg.

Other drives connected to the same controller, but which are member of an md raid, disconnect and connect fine without hanging the raid.

Levitance · January 26, 2018, 8:53pm

Someone’s going to ask it, so let’s get it out of the way. P440ar is in HBA mode, yes?

oO.o · January 26, 2018, 8:53pm

Is it possible to test in FreeBSD?

skullux · January 26, 2018, 8:57pm

Yes, P440ar is in HBA mode.

skullux · January 30, 2018, 7:26am

Found this: https://github.com/zfsonlinux/zfs/issues/3461

Looks like the hang is intended behavior. Also the failmode option seems to be completely useless, since you will need a reboot to restore function to the zpool anyway.

Other file systems dont behave like this, right, so wtf?
Are complete drive failures really that uncommon?

Levitance · January 30, 2018, 3:55pm

Sounds like there’s a way to recover from this without a reboot. If you put the drive back in, does the machine come back? Maybe it needs to sit for a spell to make sure that the data is accurate?

But I totally agree. This is weird behavior. I mean, given the mission of ZFS, Data Integrity Uber Alles, it kinda makes sense. After a fashion. ??? Still really, really weird.

skullux · January 30, 2018, 4:56pm

Yes, we tried putting the drive back in. it just keeps hanging. Supposedly this is beacuse of the pool option ‘failmode=wait’ at default, which halts the pool until a ‘zpool clear’ is issued. But then again we cant use the ‘zpool’ command, because the whole module is hanging.

We also tried this:

you can set the failmode=continue zpool property to prevent the pool from suspending when the drive is hard removed from a non-redundant configuration. This will result in errors to the applications but it should not hang the system.

It hangs the system, too.

So is this specific to our hardware, or has like hardly anyone ever tested what happens when an an online vdev gets pulled from a zpool?

oO.o · January 30, 2018, 6:14pm

In that github thread, the drive that’s pulled actually kills a zvol, and it makes more sense to hang the system in that case so that the admin can try to scrap together whatever data is there (in ARC for instance).

I don’t think it’s supposed to behave that way if a drive is pulled from a redundant array. I tested this on FreeNAS a while back. I was able to crash services by pulling random drives, but not ZFS, and the system wouldn’t hang. I believe the service crashes were due to non-redundant swap partitions on the raidz drives. After I disabled swap, I wasn’t able to crash services by pulling drives. That said, my testing wasn’t exhaustive (I didn’t have time to pull drives all day).

So maybe it’s a bug or compatibility issue with ZFS on Linux? Is it possible to pop in a FreeNAS USB and see if you experience the same issue in FreeBSD?

magicthighs · January 30, 2018, 10:18pm

Just out of curiosity, does the same thing happen when you eject the drive?

skullux · January 31, 2018, 12:30pm

The devil is in the details! …in this case in my testing methodology.

When pulling the drive and then issuing a “zpool status”, it looks like this:

 pool: motherducker
 state: ONLINE
config:

    NAME          STATE     READ WRITE CKSUM
    motherducker  ONLINE       0     0     0
      mirror-0    ONLINE       0     0     0
        sdb       ONLINE       0     0     0
        sda       ONLINE       0     0     0
      mirror-1    ONLINE       0     0     0
        sde       ONLINE       0     0     0
        sdf       ONLINE       0     0     0
      mirror-2    ONLINE       0     0     0
        sdg       ONLINE       0     0     0
        sdh       ONLINE       0     0     0

errors: No known data errors

All vdevs are shown as online. zpool does not know about the missing drive, even up to 10 mintues later zpool has no idea about the change of its hardware, unless an io has been issued (more on that later).

So what I did in my testing, was to run ‘zpool scrub motherducker’, with the intention of getting zpool to update the status of the pool. However this is what was in fact causing the hang.

‘zpool scrub’ seems to use the information of ‘zpool status’ to figure out which vdevs to operate on. Since all vdevs are supposedly still online, zpool scrub attempts to scrub all vdevs. Obviously, it cant do that, because one vdev is in fact unavailable, which causes zfs to catch on fire and cry for help.

The proper way of notifying zpool about its failed hardware seems to be either:

wait up to 10 minutes for an automatic status update or
issue an io operation, like ‘touch /motherducker/fu.txt’

Then zfs does not hang and ‘zpool status’ looks like this:

  pool: motherducker
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 0h10m with 0 errors on Wed Jan 31 12:45:56 2018
config:

        NAME          STATE     READ WRITE CKSUM
        motherducker  DEGRADED     0     0     0
          mirror-0    ONLINE       0     0     0
            sdb       ONLINE       0     0     0
            sda       ONLINE       0     0     0
          mirror-1    ONLINE       0     0     0
            sde       ONLINE       0     0     0
            sdf       ONLINE       0     0     0
          mirror-2    DEGRADED     0     0     0
            sdg       UNAVAIL      0    46     0  corrupted data
            sdh       ONLINE       0     0     0

errors: No known data errors

Sanity, finally!

This looks like a bug in ‘zpool’ to me. Other operations like ‘zpool clear’ also hang the zfs module when ‘zpool status’ is not in sync with reality.
I would expect zpool to update the status of the pool every time and before an operation with zpool is performed, to avoid taking action based on outdated information and then hanging the whole module.

Granted, this is an edge case, since the lag time of zpool status in 10 minutes max.

Still, in this day and age of computing, this feels like blindly trusting the navigation system in a car which has not been updated in the last 5 years. You will reach your destination 99% of the time and be fine, but 1% of the time things will go horribly wrong.

skullux · January 31, 2018, 2:43pm

oO.o:

In that github thread, the drive that’s pulled actually kills a zvol, and it makes more sense to hang the system in that case so that the admin can try to scrap together whatever data is there (in ARC for instance).

I don’t think it’s supposed to behave that way if a drive is pulled from a redundant array. I tested this on FreeNAS a while back. I was able to crash services by pulling random drives, but not ZFS, and the system wouldn’t hang. I believe the service crashes were due to non-redundant swap partitions on the raidz drives. After I disabled swap, I wasn’t able to crash services by pulling drives. That said, my testing wasn’t exhaustive (I didn’t have time to pull drives all day).

So maybe it’s a bug or compatibility issue with ZFS on Linux? Is it possible to pop in a FreeNAS USB and see if you experience the same issue in FreeBSD?

Tried importing the pool readonly with a live FreeNAS 11. ZFS on FreeNAS actually catches the error, amazing!

When issuing a ‘zpool clear’ after the drive was pulled, it does not blindly trust the last status update like zol, but actually checks if the vdev is accessible.

This check fails, of course, and returns errors to the console like you would expect.
After that the vdev shows up as removed in ‘zpool status’.

So it looks like the issue can be narrowed down to the particular implementation of zfsonlinux.

You mean ‘zpool detach’ and ‘zpool attach’? Yep, this works fine. Though, that is a different use case.

oO.o · January 31, 2018, 3:59pm

Glad you got to the bottom of it!

If you open a ticket with ZOL on Github (I think you should FWIW), could you link it here so we can track progress?

magicthighs · January 31, 2018, 4:02pm

No, I meant by using the command eject, as in eject /dev/sdg.

Glad you got it sorted, btw!

Edit: Another way to safely remove a device would be with echo 1 > /sys/block/sdg/device/delete

Levitance · January 31, 2018, 4:23pm

To me this sounds like the HBA isn’t giving proper updates to the OS. And…

Given your experience here, I would suspect it’s how the driver in Linux handles your HBA. Where the driver in FreeBSD is properly handling the hot removal of a device, Linux is not.

I only suggest this because having ZFS go 10 minutes without realizing a disk doesn’t exist anymore is a completely foreign experience to me. Not that I’m the end all be all of ZoL experience, but I have worked on several different machines, and have dealt with failed disks or just straight up replacing working disks for a larger pool. ZFS has always known when the disk goes away.

oO.o · January 31, 2018, 4:35pm

Does MD immediately detect the removed drive?

Also, any firmware updates available for your HBA? OEM drivers?

skullux · February 2, 2018, 8:47am

Ah I see. ‘eject’ did not work: “non hot-pluggable device”.

But ‘echo 1 > /sys/block/sdg/device/delete’ worked fine and did a clean eject that the zpool could handle. Even when performing a ‘zpool scub’ zfs did not hang and worked as expected.

So another sign pointing to the raid controller and driver not properly communicating a hard remove of the drive.

MD behaved like ZFS, as in only noticing the drive was missing when the next io was issued, and not right away when the drive was removed.

So I went onto the firmware update odyssey of HPE servers.

There was actually an available update for the P440ar, from 4.42 to 6.06. Release notes gave no hint about something like hardware failure communication. Performed the update anyways, behavior did not change. ZFS still hangs in the described scenario.

When browsing the abomination called “Hewlett Packard Enterprise Support Center” I also noticed a “CRITICAL” firmware update for the drives we use in the zpool.
Release notes again kinda worthless, but might as well try it.

Though, when in HBA mode firmware updates to drives can only performed in “offline” mode, meaning having to boot the “Service Pack for Proliant” ISO and performing the update from there.

Well, we did that. Instructions again kinda meh, but finally got it working after a few hours. Though, the SPP also updated the firmware of the UEFI, NICs and iLo and some other crap.

Guess what, now the ZFS hot removal works as expected, no hanging zfs module, even when running ‘zfs scrub’ directly after!!!

Sadly, now I cant exactly pinpoint which of the firmware updates actually fixed it, either UEFI, drives, others or any combination of those. In retrospect it would have been possible to the firmware updates one by one. Luckily, we are going to install a second system with identical hardware configuration later on, so I will have a chance at figuring it out completely.

Still super weird that FreeNAS could seemingly handle the misbehaving firmware in the driver, whereas CentOS could not.

Oh, almost forgot, now 2 of the drives’ fault indicator led is permanently lit amber, meaning “complete drive failure”. The drives work fine, SMART says they are fine, the P440ar says they are fine, the iLo says they are fine, ZFS says they are fine, just the drives them selves think they are dead.

GG HP, going to open a support case now.

EDIT: second installation with identical hardware never had the hanging issue. Why, I have no idea.

exporting and importing the zpool cleared the drives’ fault led. Apparently zfs can directly talk to the drives’ controller and convince it that it is dead.

tl;dr when in doubt, always update all the firmwares.

oO.o · February 2, 2018, 4:41pm