ZFS hangs after removing physical drive from zpool

We are running a zpool with 6 drives, consisting of a stripe of 3 mirrors:

NAME                     STATE     READ WRITE CKSUM
motherducker             ONLINE       0     0     0
  mirror-0               ONLINE       0     0     0
    sdb                  ONLINE       0     0     0
    sda                  ONLINE       0     0     0
  mirror-1               ONLINE       0     0     0
    sde                  ONLINE       0     0     0
    sdf                  ONLINE       0     0     0
  mirror-2               ONLINE       0     0     0
    sdg                  ONLINE       0     0     0
    sdf                  ONLINE       0     0     0

When physically removing one of the drives, the pool hangs the next time zfs tries to sync:

kernel: ā€œecho 0 > /proc/sys/kernel/hung_task_timeout_secsā€ disables this message.
kernel: qemu-kvm D ffff885b60ab8fd0 0 21248 1 0x00000080
kernel: Call Trace:
kernel: [] schedule+0x29/0x70
kernel: [] schedule_timeout+0x239/0x2c0
kernel: [] ? __wake_up+0x44/0x50
kernel: [] ? taskq_dispatch_ent+0x57/0x170 [spl]
kernel: [] io_schedule_timeout+0xad/0x130
kernel: [] ? prepare_to_wait_exclusive+0x56/0x90
kernel: [] io_schedule+0x18/0x20
kernel: [] cv_wait_common+0xb2/0x150 [spl]
kernel: [] ? wake_up_atomic_t+0x30/0x30
kernel: [] __cv_wait_io+0x18/0x20 [spl]
kernel: [] zio_wait+0x10b/0x1b0 [zfs]
kernel: [] zil_commit.part.12+0x4ae/0x830 [zfs]
kernel: [] zil_commit+0x17/0x20 [zfs]
kernel: [] zfs_fsync+0x77/0xf0 [zfs]
kernel: [] zpl_fsync+0x65/0x90 [zfs]
kernel: [] do_fsync+0x65/0xa0
kernel: [] SyS_fdatasync+0x13/0x20
kernel: [] system_call_fastpath+0x16/0x1b

When this situation occurs, the zfs kernel module stays unresponsive until the system is hard reset, as it even prevents a normal reboot.

Should you not be able to replace a physical device in a zpool without the whole pool freezing and needing to reset the whole system, since this is the point of having redundant storage, in this case in the form of mirrors?

(running zfsonlinux 0.7.5 on CentOS 7.4)

From the FreeNAS wiki; advice does not just pertain to FreeNAS, but ZFS datasets:

If the disk is formatted with ZFS, click the diskā€™s entry then its ā€œOfflineā€ button in order to change that diskā€™s status to OFFLINE. This step is needed to properly remove the device from the ZFS pool and to prevent swap issues. If your hardware supports hot-pluggable disks, click the diskā€™s ā€œOfflineā€ button, pull the disk, then skip to step 3. If there is no ā€œOfflineā€ button but only a ā€œReplaceā€ button, then the disk is already offlined and you can safely skip this step.

Note

if the process of changing the diskā€™s status to OFFLINE fails with a ā€œdisk offline failed - no valid replicasā€ message, you will need to scrub the ZFS volume first using its ā€œScrub Volumeā€ button in Storage ā€£ Volumes ā€£ View Volumes. Once the scrub completes, try to ā€œOfflineā€ the disk again before proceeding.

If the hardware is not AHCI capable, shutdown the system in order to physically replace the disk. When finished, return to the GUI and locate the OFFLINE disk.

Once the disk has been replaced and is showing as OFFLINE, click the disk again and then click its ā€œReplaceā€ button. Select the replacement disk from the drop-down menu and click the ā€œReplace Diskā€ button. If the disk is a member of an encrypted ZFS pool, the menu will also prompt you to input and confirm the passphrase for the pool. Once you click the ā€œReplace Diskā€ button, the ZFS pool will start to resilver and the status of the resilver will be displayed.

Basically, you need to mark the failed drive as offline before you replace it, then you can begin the resilvering process.

Goto 8.1.10

https://doc.freenas.org/9.3/freenas_storage.html

1 Like

Fair enough, but what if there is an actual hardware defect that causes the drive to vanish from the os? Surely it cant be expected behavior that the whole storage pool becomes unresponsive, right?

Unsure.

Paging @wendell

Iā€™m not sure if this will help, because I havenā€™t been crazy enough to just pull a drive from a running zpool, but try setting failmode=continue on your pool.

is the drive removed hot or cold?
UEFI setting for ā€œsata hotplug enabledā€ ?

1 Like

We tried setting failmode=continue, but have not experience any difference in behavior.
According to the documentation the default value is ā€˜waitā€™, which stops the zpool until a ā€˜zpool clearā€™ is performed. ā€˜continueā€™ is supposed to put the pool into a read only mode, which is still kinda meh.
But from our experience with both settings the pool and the kernel module hangs, so we would no even be able to issue a ā€˜zpool clearā€™ either way. We also tried putting the drive back in, but the zpool keeps hanging with both options until the system is rebooted.

It is a hot remove by just pulling the drive out of the cage, basically simulating a complete drive failure.

Regarding hotplug support, I am not sure. There is no setting in the uefi. The drives are connected to a HP P440ar in HBA mode, which documentation only mentions that hotplug compatible drives can be hotpluggedā€¦ And yes, the drives are hotplug compatible according to their data sheet.

Before the zpool hangs, we see a clean remove event by the P440ar in dmesg.

Other drives connected to the same controller, but which are member of an md raid, disconnect and connect fine without hanging the raid.

1 Like

Someoneā€™s going to ask it, so letā€™s get it out of the way. P440ar is in HBA mode, yes?

Is it possible to test in FreeBSD?

Yes, P440ar is in HBA mode.

1 Like

Found this: https://github.com/zfsonlinux/zfs/issues/3461

Looks like the hang is intended behavior. Also the failmode option seems to be completely useless, since you will need a reboot to restore function to the zpool anyway.

Other file systems dont behave like this, right, so wtf?
Are complete drive failures really that uncommon?

Sounds like thereā€™s a way to recover from this without a reboot. If you put the drive back in, does the machine come back? Maybe it needs to sit for a spell to make sure that the data is accurate?

But I totally agree. This is weird behavior. I mean, given the mission of ZFS, Data Integrity Uber Alles, it kinda makes sense. After a fashion. ??? Still really, really weird.

Yes, we tried putting the drive back in. it just keeps hanging. Supposedly this is beacuse of the pool option ā€˜failmode=waitā€™ at default, which halts the pool until a ā€˜zpool clearā€™ is issued. But then again we cant use the ā€˜zpoolā€™ command, because the whole module is hanging.

We also tried this:

you can set the failmode=continue zpool property to prevent the pool from suspending when the drive is hard removed from a non-redundant configuration. This will result in errors to the applications but it should not hang the system.

It hangs the system, too.

So is this specific to our hardware, or has like hardly anyone ever tested what happens when an an online vdev gets pulled from a zpool? :thinking:

In that github thread, the drive thatā€™s pulled actually kills a zvol, and it makes more sense to hang the system in that case so that the admin can try to scrap together whatever data is there (in ARC for instance).

I donā€™t think itā€™s supposed to behave that way if a drive is pulled from a redundant array. I tested this on FreeNAS a while back. I was able to crash services by pulling random drives, but not ZFS, and the system wouldnā€™t hang. I believe the service crashes were due to non-redundant swap partitions on the raidz drives. After I disabled swap, I wasnā€™t able to crash services by pulling drives. That said, my testing wasnā€™t exhaustive (I didnā€™t have time to pull drives all day).

So maybe itā€™s a bug or compatibility issue with ZFS on Linux? Is it possible to pop in a FreeNAS USB and see if you experience the same issue in FreeBSD?

2 Likes

Just out of curiosity, does the same thing happen when you eject the drive?

The devil is in the details! ā€¦in this case in my testing methodology.

When pulling the drive and then issuing a ā€œzpool statusā€, it looks like this:

 pool: motherducker
 state: ONLINE
config:

    NAME          STATE     READ WRITE CKSUM
    motherducker  ONLINE       0     0     0
      mirror-0    ONLINE       0     0     0
        sdb       ONLINE       0     0     0
        sda       ONLINE       0     0     0
      mirror-1    ONLINE       0     0     0
        sde       ONLINE       0     0     0
        sdf       ONLINE       0     0     0
      mirror-2    ONLINE       0     0     0
        sdg       ONLINE       0     0     0
        sdh       ONLINE       0     0     0

errors: No known data errors

All vdevs are shown as online. zpool does not know about the missing drive, even up to 10 mintues later zpool has no idea about the change of its hardware, unless an io has been issued (more on that later).

So what I did in my testing, was to run ā€˜zpool scrub motherduckerā€™, with the intention of getting zpool to update the status of the pool. However this is what was in fact causing the hang.

ā€˜zpool scrubā€™ seems to use the information of ā€˜zpool statusā€™ to figure out which vdevs to operate on. Since all vdevs are supposedly still online, zpool scrub attempts to scrub all vdevs. Obviously, it cant do that, because one vdev is in fact unavailable, which causes zfs to catch on fire and cry for help.

The proper way of notifying zpool about its failed hardware seems to be either:

  • wait up to 10 minutes for an automatic status update or
  • issue an io operation, like ā€˜touch /motherducker/fu.txtā€™

Then zfs does not hang and ā€˜zpool statusā€™ looks like this:

  pool: motherducker
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 0h10m with 0 errors on Wed Jan 31 12:45:56 2018
config:

        NAME          STATE     READ WRITE CKSUM
        motherducker  DEGRADED     0     0     0
          mirror-0    ONLINE       0     0     0
            sdb       ONLINE       0     0     0
            sda       ONLINE       0     0     0
          mirror-1    ONLINE       0     0     0
            sde       ONLINE       0     0     0
            sdf       ONLINE       0     0     0
          mirror-2    DEGRADED     0     0     0
            sdg       UNAVAIL      0    46     0  corrupted data
            sdh       ONLINE       0     0     0

errors: No known data errors

Sanity, finally!

This looks like a bug in ā€˜zpoolā€™ to me. Other operations like ā€˜zpool clearā€™ also hang the zfs module when ā€˜zpool statusā€™ is not in sync with reality.
I would expect zpool to update the status of the pool every time and before an operation with zpool is performed, to avoid taking action based on outdated information and then hanging the whole module.

Granted, this is an edge case, since the lag time of zpool status in 10 minutes max.

Still, in this day and age of computing, this feels like blindly trusting the navigation system in a car which has not been updated in the last 5 years. You will reach your destination 99% of the time and be fine, but 1% of the time things will go horribly wrong. :confused:

2 Likes

Tried importing the pool readonly with a live FreeNAS 11. ZFS on FreeNAS actually catches the error, amazing!

When issuing a ā€˜zpool clearā€™ after the drive was pulled, it does not blindly trust the last status update like zol, but actually checks if the vdev is accessible.

This check fails, of course, and returns errors to the console like you would expect.
After that the vdev shows up as removed in ā€˜zpool statusā€™.

So it looks like the issue can be narrowed down to the particular implementation of zfsonlinux.

You mean ā€˜zpool detachā€™ and ā€˜zpool attachā€™? Yep, this works fine. Though, that is a different use case.

Glad you got to the bottom of it!

If you open a ticket with ZOL on Github (I think you should FWIW), could you link it here so we can track progress?

No, I meant by using the command eject, as in eject /dev/sdg.

Glad you got it sorted, btw!

Edit: Another way to safely remove a device would be with echo 1 > /sys/block/sdg/device/delete

1 Like

To me this sounds like the HBA isnā€™t giving proper updates to the OS. Andā€¦

Given your experience here, I would suspect itā€™s how the driver in Linux handles your HBA. Where the driver in FreeBSD is properly handling the hot removal of a device, Linux is not.

I only suggest this because having ZFS go 10 minutes without realizing a disk doesnā€™t exist anymore is a completely foreign experience to me. Not that Iā€™m the end all be all of ZoL experience, but I have worked on several different machines, and have dealt with failed disks or just straight up replacing working disks for a larger pool. ZFS has always known when the disk goes away.

1 Like