ZFS hangs after removing physical drive from zpool

skullux · January 26, 2018, 2:36pm

We are running a zpool with 6 drives, consisting of a stripe of 3 mirrors:

NAME                     STATE     READ WRITE CKSUM
motherducker             ONLINE       0     0     0
  mirror-0               ONLINE       0     0     0
    sdb                  ONLINE       0     0     0
    sda                  ONLINE       0     0     0
  mirror-1               ONLINE       0     0     0
    sde                  ONLINE       0     0     0
    sdf                  ONLINE       0     0     0
  mirror-2               ONLINE       0     0     0
    sdg                  ONLINE       0     0     0
    sdf                  ONLINE       0     0     0

When physically removing one of the drives, the pool hangs the next time zfs tries to sync:

kernel: “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
kernel: qemu-kvm D ffff885b60ab8fd0 0 21248 1 0x00000080
kernel: Call Trace:
kernel: [] schedule+0x29/0x70
kernel: [] schedule_timeout+0x239/0x2c0
kernel: [] ? __wake_up+0x44/0x50
kernel: [] ? taskq_dispatch_ent+0x57/0x170 [spl]
kernel: [] io_schedule_timeout+0xad/0x130
kernel: [] ? prepare_to_wait_exclusive+0x56/0x90
kernel: [] io_schedule+0x18/0x20
kernel: [] cv_wait_common+0xb2/0x150 [spl]
kernel: [] ? wake_up_atomic_t+0x30/0x30
kernel: [] __cv_wait_io+0x18/0x20 [spl]
kernel: [] zio_wait+0x10b/0x1b0 [zfs]
kernel: [] zil_commit.part.12+0x4ae/0x830 [zfs]
kernel: [] zil_commit+0x17/0x20 [zfs]
kernel: [] zfs_fsync+0x77/0xf0 [zfs]
kernel: [] zpl_fsync+0x65/0x90 [zfs]
kernel: [] do_fsync+0x65/0xa0
kernel: [] SyS_fdatasync+0x13/0x20
kernel: [] system_call_fastpath+0x16/0x1b

When this situation occurs, the zfs kernel module stays unresponsive until the system is hard reset, as it even prevents a normal reboot.

Should you not be able to replace a physical device in a zpool without the whole pool freezing and needing to reset the whole system, since this is the point of having redundant storage, in this case in the form of mirrors?

(running zfsonlinux 0.7.5 on CentOS 7.4)

anon75264233 · January 26, 2018, 2:42pm

From the FreeNAS wiki; advice does not just pertain to FreeNAS, but ZFS datasets:

If the disk is formatted with ZFS, click the disk’s entry then its “Offline” button in order to change that disk’s status to OFFLINE. This step is needed to properly remove the device from the ZFS pool and to prevent swap issues. If your hardware supports hot-pluggable disks, click the disk’s “Offline” button, pull the disk, then skip to step 3. If there is no “Offline” button but only a “Replace” button, then the disk is already offlined and you can safely skip this step.

Note

if the process of changing the disk’s status to OFFLINE fails with a “disk offline failed - no valid replicas” message, you will need to scrub the ZFS volume first using its “Scrub Volume” button in Storage ‣ Volumes ‣ View Volumes. Once the scrub completes, try to “Offline” the disk again before proceeding.

If the hardware is not AHCI capable, shutdown the system in order to physically replace the disk. When finished, return to the GUI and locate the OFFLINE disk.

Once the disk has been replaced and is showing as OFFLINE, click the disk again and then click its “Replace” button. Select the replacement disk from the drop-down menu and click the “Replace Disk” button. If the disk is a member of an encrypted ZFS pool, the menu will also prompt you to input and confirm the passphrase for the pool. Once you click the “Replace Disk” button, the ZFS pool will start to resilver and the status of the resilver will be displayed.

Basically, you need to mark the failed drive as offline before you replace it, then you can begin the resilvering process.

Goto 8.1.10

https://doc.freenas.org/9.3/freenas_storage.html

skullux · January 26, 2018, 3:16pm

Fair enough, but what if there is an actual hardware defect that causes the drive to vanish from the os? Surely it cant be expected behavior that the whole storage pool becomes unresponsive, right?

anon75264233 · January 26, 2018, 3:25pm

Unsure.

Paging @wendell

magicthighs · January 26, 2018, 4:06pm

I’m not sure if this will help, because I haven’t been crazy enough to just pull a drive from a running zpool, but try setting failmode=continue on your pool.

wendell · January 26, 2018, 4:27pm

is the drive removed hot or cold?
UEFI setting for “sata hotplug enabled” ?

skullux · January 26, 2018, 8:49pm

We tried setting failmode=continue, but have not experience any difference in behavior.
According to the documentation the default value is ‘wait’, which stops the zpool until a ‘zpool clear’ is performed. ‘continue’ is supposed to put the pool into a read only mode, which is still kinda meh.
But from our experience with both settings the pool and the kernel module hangs, so we would no even be able to issue a ‘zpool clear’ either way. We also tried putting the drive back in, but the zpool keeps hanging with both options until the system is rebooted.

It is a hot remove by just pulling the drive out of the cage, basically simulating a complete drive failure.

Regarding hotplug support, I am not sure. There is no setting in the uefi. The drives are connected to a HP P440ar in HBA mode, which documentation only mentions that hotplug compatible drives can be hotplugged… And yes, the drives are hotplug compatible according to their data sheet.

Before the zpool hangs, we see a clean remove event by the P440ar in dmesg.

Other drives connected to the same controller, but which are member of an md raid, disconnect and connect fine without hanging the raid.

Levitance · January 26, 2018, 8:53pm

Someone’s going to ask it, so let’s get it out of the way. P440ar is in HBA mode, yes?

oO.o · January 26, 2018, 8:53pm

Is it possible to test in FreeBSD?

skullux · January 26, 2018, 8:57pm

Yes, P440ar is in HBA mode.

skullux · January 30, 2018, 7:26am

Found this: https://github.com/zfsonlinux/zfs/issues/3461

Looks like the hang is intended behavior. Also the failmode option seems to be completely useless, since you will need a reboot to restore function to the zpool anyway.

Other file systems dont behave like this, right, so wtf?
Are complete drive failures really that uncommon?

Levitance · January 30, 2018, 3:55pm

Sounds like there’s a way to recover from this without a reboot. If you put the drive back in, does the machine come back? Maybe it needs to sit for a spell to make sure that the data is accurate?

But I totally agree. This is weird behavior. I mean, given the mission of ZFS, Data Integrity Uber Alles, it kinda makes sense. After a fashion. ??? Still really, really weird.

skullux · January 30, 2018, 4:56pm

Yes, we tried putting the drive back in. it just keeps hanging. Supposedly this is beacuse of the pool option ‘failmode=wait’ at default, which halts the pool until a ‘zpool clear’ is issued. But then again we cant use the ‘zpool’ command, because the whole module is hanging.

We also tried this:

you can set the failmode=continue zpool property to prevent the pool from suspending when the drive is hard removed from a non-redundant configuration. This will result in errors to the applications but it should not hang the system.

It hangs the system, too.

So is this specific to our hardware, or has like hardly anyone ever tested what happens when an an online vdev gets pulled from a zpool?

oO.o · January 30, 2018, 6:14pm

In that github thread, the drive that’s pulled actually kills a zvol, and it makes more sense to hang the system in that case so that the admin can try to scrap together whatever data is there (in ARC for instance).

I don’t think it’s supposed to behave that way if a drive is pulled from a redundant array. I tested this on FreeNAS a while back. I was able to crash services by pulling random drives, but not ZFS, and the system wouldn’t hang. I believe the service crashes were due to non-redundant swap partitions on the raidz drives. After I disabled swap, I wasn’t able to crash services by pulling drives. That said, my testing wasn’t exhaustive (I didn’t have time to pull drives all day).

So maybe it’s a bug or compatibility issue with ZFS on Linux? Is it possible to pop in a FreeNAS USB and see if you experience the same issue in FreeBSD?

magicthighs · January 30, 2018, 10:18pm

Just out of curiosity, does the same thing happen when you eject the drive?

skullux · January 31, 2018, 12:30pm

The devil is in the details! …in this case in my testing methodology.

When pulling the drive and then issuing a “zpool status”, it looks like this:

 pool: motherducker
 state: ONLINE
config:

    NAME          STATE     READ WRITE CKSUM
    motherducker  ONLINE       0     0     0
      mirror-0    ONLINE       0     0     0
        sdb       ONLINE       0     0     0
        sda       ONLINE       0     0     0
      mirror-1    ONLINE       0     0     0
        sde       ONLINE       0     0     0
        sdf       ONLINE       0     0     0
      mirror-2    ONLINE       0     0     0
        sdg       ONLINE       0     0     0
        sdh       ONLINE       0     0     0

errors: No known data errors

All vdevs are shown as online. zpool does not know about the missing drive, even up to 10 mintues later zpool has no idea about the change of its hardware, unless an io has been issued (more on that later).

So what I did in my testing, was to run ‘zpool scrub motherducker’, with the intention of getting zpool to update the status of the pool. However this is what was in fact causing the hang.

‘zpool scrub’ seems to use the information of ‘zpool status’ to figure out which vdevs to operate on. Since all vdevs are supposedly still online, zpool scrub attempts to scrub all vdevs. Obviously, it cant do that, because one vdev is in fact unavailable, which causes zfs to catch on fire and cry for help.

The proper way of notifying zpool about its failed hardware seems to be either:

wait up to 10 minutes for an automatic status update or
issue an io operation, like ‘touch /motherducker/fu.txt’

Then zfs does not hang and ‘zpool status’ looks like this:

  pool: motherducker
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 0h10m with 0 errors on Wed Jan 31 12:45:56 2018
config:

        NAME          STATE     READ WRITE CKSUM
        motherducker  DEGRADED     0     0     0
          mirror-0    ONLINE       0     0     0
            sdb       ONLINE       0     0     0
            sda       ONLINE       0     0     0
          mirror-1    ONLINE       0     0     0
            sde       ONLINE       0     0     0
            sdf       ONLINE       0     0     0
          mirror-2    DEGRADED     0     0     0
            sdg       UNAVAIL      0    46     0  corrupted data
            sdh       ONLINE       0     0     0

errors: No known data errors

Sanity, finally!

This looks like a bug in ‘zpool’ to me. Other operations like ‘zpool clear’ also hang the zfs module when ‘zpool status’ is not in sync with reality.
I would expect zpool to update the status of the pool every time and before an operation with zpool is performed, to avoid taking action based on outdated information and then hanging the whole module.

Granted, this is an edge case, since the lag time of zpool status in 10 minutes max.

Still, in this day and age of computing, this feels like blindly trusting the navigation system in a car which has not been updated in the last 5 years. You will reach your destination 99% of the time and be fine, but 1% of the time things will go horribly wrong.

skullux · January 31, 2018, 2:43pm

oO.o:

In that github thread, the drive that’s pulled actually kills a zvol, and it makes more sense to hang the system in that case so that the admin can try to scrap together whatever data is there (in ARC for instance).

I don’t think it’s supposed to behave that way if a drive is pulled from a redundant array. I tested this on FreeNAS a while back. I was able to crash services by pulling random drives, but not ZFS, and the system wouldn’t hang. I believe the service crashes were due to non-redundant swap partitions on the raidz drives. After I disabled swap, I wasn’t able to crash services by pulling drives. That said, my testing wasn’t exhaustive (I didn’t have time to pull drives all day).

So maybe it’s a bug or compatibility issue with ZFS on Linux? Is it possible to pop in a FreeNAS USB and see if you experience the same issue in FreeBSD?

Tried importing the pool readonly with a live FreeNAS 11. ZFS on FreeNAS actually catches the error, amazing!

When issuing a ‘zpool clear’ after the drive was pulled, it does not blindly trust the last status update like zol, but actually checks if the vdev is accessible.

This check fails, of course, and returns errors to the console like you would expect.
After that the vdev shows up as removed in ‘zpool status’.

So it looks like the issue can be narrowed down to the particular implementation of zfsonlinux.

You mean ‘zpool detach’ and ‘zpool attach’? Yep, this works fine. Though, that is a different use case.

oO.o · January 31, 2018, 3:59pm

Glad you got to the bottom of it!

If you open a ticket with ZOL on Github (I think you should FWIW), could you link it here so we can track progress?

magicthighs · January 31, 2018, 4:02pm

No, I meant by using the command eject, as in eject /dev/sdg.

Glad you got it sorted, btw!

Edit: Another way to safely remove a device would be with echo 1 > /sys/block/sdg/device/delete

Levitance · January 31, 2018, 4:23pm

To me this sounds like the HBA isn’t giving proper updates to the OS. And…

Given your experience here, I would suspect it’s how the driver in Linux handles your HBA. Where the driver in FreeBSD is properly handling the hot removal of a device, Linux is not.

I only suggest this because having ZFS go 10 minutes without realizing a disk doesn’t exist anymore is a completely foreign experience to me. Not that I’m the end all be all of ZoL experience, but I have worked on several different machines, and have dealt with failed disks or just straight up replacing working disks for a larger pool. ZFS has always known when the disk goes away.