[Solved] ZFS pool goes kaput in Debian 11, now what?

I created a ZFS pool (with great originality, named tank) and started copying over files from my old degraded NAS with the intent of importing this pool into a potential TrueNAS server in the future (pending a system upgrade on my end) using mv (I know, bad idea, I should’ve used rsync -a instead) and we had a brief electrical outage.

Rebooting back into my system and tank no longer exists, the mount point /tank does but now it’s just a folder.

Here’s what I’ve tried so far:

$ hexdump -C /dev/disk/by-id/ata-STXXXXXXXXXX-XXXXXX_XXXXXXX | head -20

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000001b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 ff  |................|
000001c0  ff ff ee ff ff ff 01 00  00 00 ff ff ff ff 00 00  |................|
000001d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 55 aa  |..............U.|
00000200  45 46 49 20 50 41 52 54  00 00 01 00 5c 00 00 00  |EFI PART....\...|
00000210  b0 63 77 93 00 00 00 00  01 00 00 00 00 00 00 00  |.cw.............|
00000220  af be c0 d1 01 00 00 00  22 00 00 00 00 00 00 00  |........".......|
00000230  8e be c0 d1 01 00 00 00  67 18 8d 11 93 80 47 d0  |........g.....G.|
00000240  80 ca 6f 66 e7 df 44 ec  02 00 00 00 00 00 00 00  |..of..D.........|
00000250  80 00 00 00 80 00 00 00  05 65 ef 52 00 00 00 00  |.........e.R....|
00000260  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000400  c3 8c 89 6a d2 1d b2 11  99 a6 08 00 20 73 66 31  |...j........ sf1|
00000410  1a 0d 2f b2 53 93 4b f2  9b 4c e4 d1 0a 43 62 49  |../.S.K..L...CbI|
00000420  00 08 00 00 00 00 00 00  ff 77 c0 d1 01 00 00 00  |.........w......|
00000430  00 00 00 00 00 00 00 00  7a 00 66 00 73 00 2d 00  |........z.f.s.-.|
00000440  33 00 38 00 35 00 37 00  34 00 34 00 39 00 34 00  |3.8.5.7.4.4.9.4.|
$ zdb -l /dev/disk/by-id/ata-STXXXXXXXXXX-XXXXXX_XXXXXXX
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3
$ zpool import -a -f -d /dev/disk/by-id/ata-STXXXXXXXXXX-XXXXXX_XXXXXXX
no pools available to import
$ lsblk
NAME            MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
[...]
sdc               8:32   0   3.6T  0 disk  
├─sdc1            8:33   0   3.6T  0 part  
└─sdc9            8:41   0     8M  0 part  
[...]
$ zpool import -f tank
cannot import 'tank': no such pool available
$ zpool import -N tank
cannot import 'tank': no such pool available
$ zpool import tank /dev/sdc 
cannot import 'tank': no such pool available
$ zpool import tank /dev/disk/by-id/ata-STXXXXXXXXXX-XXXXXX_XXXXXXX
cannot import 'tank': no such pool available

System Information:

$ uname -ar
Linux debian 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 GNU/Linux
$ zfs --version
zfs-2.0.3-9
zfs-kmod-2.0.3-9
$ lsb_release -a                                                                                                           
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 11 (bullseye)
Release:	11
Codename:	bullseye

Do a file system check and retry.

  • What geometry did you originally use to create the pool, a set of mirrors or raidzX?
  • Does “zpool import -D” show anything?
  • Any errors shown in “smartctl -a /dev/sdX” for each disk?
  • What does “blkid -p /dev/sdc1” show?
2 Likes
  • I used sudo zpool create tank /dev/sda to create the pool

  • Output:

    $ zpool import -D
    no pools available to import
    
  • Output

      smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-9-amd64] (local build)
      Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
    
      === START OF INFORMATION SECTION ===
      Model Family:     Seagate IronWolf
      Device Model:     XXXXXXXXXXX-XXXXXX
      Serial Number:    XXXXXXXX
      LU WWN Device Id: 5 XXXXX XXXXXXXXXX
      Firmware Version: SC60
      User Capacity:    4,000,787,030,016 bytes [4.00 TB]
      Sector Sizes:     512 bytes logical, 4096 bytes physical
      Rotation Rate:    5980 rpm
      Form Factor:      3.5 inches
      Device is:        In smartctl database [for details use: -P show]
      ATA Version is:   ACS-3 T13/2161-D revision 5
      SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
      Local Time is:    Sun Dec 19 13:42:30 2021 IST
      SMART support is: Available - device has SMART capability.
      SMART support is: Enabled
    
      === START OF READ SMART DATA SECTION ===
      SMART overall-health self-assessment test result: PASSED
    
      General SMART Values:
      Offline data collection status:  (0x82)	Offline data collection activity
                          was completed without error.
                          Auto Offline Data Collection: Enabled.
      Self-test execution status:      (   0)	The previous self-test routine completed
                          without error or no self-test has ever 
                          been run.
      Total time to complete Offline 
      data collection: 		(  581) seconds.
      Offline data collection
      capabilities: 			 (0x7b) SMART execute Offline immediate.
                          Auto Offline data collection on/off support.
                          Suspend Offline collection upon new
                          command.
                          Offline surface scan supported.
                          Self-test supported.
                          Conveyance Self-test supported.
                          Selective Self-test supported.
      SMART capabilities:            (0x0003)	Saves SMART data before entering
                          power-saving mode.
                          Supports SMART auto save timer.
      Error logging capability:        (0x01)	Error logging supported.
                          General Purpose Logging supported.
      Short self-test routine 
      recommended polling time: 	 (   1) minutes.
      Extended self-test routine
      recommended polling time: 	 ( 600) minutes.
      Conveyance self-test routine
      recommended polling time: 	 (   2) minutes.
      SCT capabilities: 	       (0x50bd)	SCT Status supported.
                          SCT Error Recovery Control supported.
                          SCT Feature Control supported.
                          SCT Data Table supported.
    
      SMART Attributes Data Structure revision number: 10
      Vendor Specific SMART Attributes with Thresholds:
      ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   081   064   044    Pre-fail  Always       -       114528690
      3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
      4 Start_Stop_Count        0x0032   097   097   020    Old_age   Always       -       3109
      5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000f   086   060   045    Pre-fail  Always       -       392671551
      9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       7267 (207 94 0)
      10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
      12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       193
      184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
      187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
      188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
      189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
      190 Airflow_Temperature_Cel 0x0022   059   044   040    Old_age   Always       -       41 (Min/Max 27/46)
      191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
      192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1115
      193 Load_Cycle_Count        0x0032   072   072   000    Old_age   Always       -       57925
      194 Temperature_Celsius     0x0022   041   056   000    Old_age   Always       -       41 (0 27 0 0 0)
      197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
      198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
      199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1
      240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       4260 (139 4 0)
      241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       42612098970
      242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       45957996994
    
      SMART Error Log Version: 1
      No Errors Logged
    
      SMART Self-test log structure revision number 1
      Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
      # 1  Short offline       Completed without error       00%      2712         -
      # 2  Short offline       Completed without error       00%       155         -
      # 3  Extended offline    Completed without error       00%       119         -
      # 4  Short offline       Completed without error       00%         1         -
      # 5  Short offline       Completed without error       00%         1         -
      # 6  Short offline       Completed without error       00%         1         -
      # 7  Short offline       Completed without error       00%         0         -
      # 8  Short offline       Completed without error       00%         0         -
      # 9  Short offline       Completed without error       00%         0         -
    
      SMART Selective self-test log data structure revision number 1
      SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
          1        0        0  Not_testing
          2        0        0  Not_testing
          3        0        0  Not_testing
          4        0        0  Not_testing
          5        0        0  Not_testing
      Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
      If Selective self-test is pending on power-up, resume after 0 minute delay.
    
  • Output:

    $ blkid -p /dev/sdc1
    /dev/sdc1: UUID="424be1fb-d239-1bce-ba39-38c2ecea6031" UUID_SUB="03ebe2a2-b5f2-c91a-bf41-215eb4c4d2df" LABEL="Vanilla:0" VERSION="1.2" TYPE="linux_raid_member" USAGE="raid" PART_ENTRY_SCHEME="gpt" PART_ENTRY_NAME="zfs-385744946b2f74a8" PART_ENTRY_UUID="b22f0d1a-9353-f24b-9b4c-e4d10a436249" PART_ENTRY_TYPE="6a898cc3-1dd2-11b2-99a6-080020736631" PART_ENTRY_NUMBER="1" PART_ENTRY_OFFSET="2048" PART_ENTRY_SIZE="7814017024" PART_ENTRY_DISK="8:32"
    

Wait… where’s sdc{2-8} … ? what happened here.

did you maybe point ZFS at the whole devices while leaving the partition table backup at the end of device ontouched, and they eventually got restored on top of ZFS superblock somehow … really weird stuff

This matches what my ZFS system does, you give it a whole disk and it partitions as it pleases.

1 Like

This create a single disk zfs pool, if so where did /dev/sdc come from?

Ignore huge numbers like this, this is seagate storing multi-byte values.

I’ve had this indicate a bad sata cable in the past, but if it’s only1 then it’s probably pure chance.

I don’t see anything obviously bad in the smartctl output, so let’s assume that at least that disk is fine. How do the other disks look?

This blkid output is a bit weird, it’s showing linux_raid_member and zfs. Did you reuse a disk for zfs that was previous mdraid, or did you try doing mdraid on top of zfs?

Either way this implies that the partition/disk is part of mdraid and not zfs. Is /dev/sdc definitely the disk you used for zfs?

As an example this is what I get:

/dev/sdb1: VERSION="5000" LABEL="tank" UUID="..." UUID_SUB="..." TYPE="zfs_member" USAGE="filesystem" PART_ENTRY_SCHEME="gpt" PART_ENTRY_NAME="zfs-..." PART_ENTRY_UUID="..." PART_ENTRY_TYPE="... PART_ENTRY_NUMBER="1" PART_ENTRY_OFFSET="2048" PART_ENTRY_SIZE="...6" PART_ENTRY_DISK="8:32"

1 Like

Ahhh I see, …

… never tried giving it raw devices due to bad experiences with bootloaders in the past . I always point at a partition and I try to leave enough buffer at the beginning and end of disk (I pinch 1G on either side).

Apparently this 8M reservation is some weird solaris thing, and apparently normal despite looking confusing to me; good to know.

So actual zfs is on /sdX1 I’m guessing, so zdb should let you walk the data structures on this auto-created partition.

1 Like

/dev/sdc was /dev/sda at one point, it changed because I inserted an SSD into my hot swap bay and I fetched the command from my .bash_history file from a time before that.

The disk was formerly part of a RAID1 array on a WD My Cloud EX2 Ultra (hence the label “Vanilla” that for some reason never went away) , I formatted the entire drive and then created a new GPT partition table but “Vanilla” still stayed as a label so I figured it was just Debian derping out and that it wasn’t representative of what’s on disk.

Also, I presumed it was reporting it as linux_raid because perhaps GParted doesn’t know how to interpret ZFS? Maybe that was foreshadowing that something was gonna go wrong :frowning:

In my defense, I did try to recreate the zpool thrice (after recreating a new partition table in GParted each time) and it still retained itself the label of linux_raid

No damn clue, thought that was ZFS-ism

2 Likes

It’s possible that the data is still there, but linux is only loading the first filesystem it finds. (Very simply there are magic bytes that identify filesystems, and it’s possible for the magic bytes for one filesystem to still exist after a quick reformat with a different filesystem).

Can you try running this command to read and report the filesystems that the system can find on the disk? (The command is called wipefs, but by default it doesn’t do any wiping - it’s very badly named).

wipefs -n /dev/sdc*

I’m guessing it will show both zfs and md-raid.

2 Likes

I always figured that stuff is stored in the partition table à la partition type GUIDs and that any magic bytes would be ignored subsequent to a new parition table being created.

That being said…

$ wipefs -n /dev/sdc*
DEVICE OFFSET        TYPE              UUID                                 LABEL
sdc    0x200         gpt                                                    
sdc    0x3a3817d5e00 gpt                                                    
sdc    0x1fe         PMBR                                                   
sdc1   0x1000        linux_raid_member 424be1fb-d239-1bce-ba39-38c2ecea6031 Vanilla:0
sdc1   0x3f000       zfs_member        8199371577609406066                  tank
[...]
sdc1   0x60000       zfs_member        8199371577609406066                  tank
sdc1   0x3a380dbf000 zfs_member        8199371577609406066                  tank
[...]
sdc1   0x3a380de0000 zfs_member        8199371577609406066                  tank

…I’m feeling pretty stupid right now. Rewriting the whole disk with zeros is time consuming and I had limited time but more time has been spent trying to fix something because I was trying to save time.

yay

1 Like

I would try zpool import -D zpool import With no drive address (nor letters) When it still comes up blank, I would just create a new pool, old one trashed. If it does not come up blank, might need -N, (non destructive check if it can import) and possibly -F / -X in case it needs to dump transactions. Check the manual for what each does first.

It’s just a single drive, so is only for transient/temporary data.

I’m not knocking a single drive pool; I have a laptop with a single drive pool, and a pi.
But they are doomed to die an unrecoverable death, and when they do, no tear shall be shed.

As for initial population, if the source is an existing zfs pool, one could zfs send/receive the required datasets/pool with all the snapshots required?

The patition1 and part9 is normal Solaris behaviour. As is keeping disk labels in tact. Leaving the raid member was a bit odd, but I’ve not used a raid member as part of a pool.

ZFS *Should write a bunch of super blocks through the drive (maybe just 3?) to allow for partial disk write over, but I don’t know if it requires another part of a vdev to validate what is good on a resilver (as in, needing to be part of at least a mirror)

2 Likes

When switching filesystems at the very least you can use wipefs to remove magic bytes (use the -a option to do the dangerous stuff), or use dd to zero the first 4MB or so - most magic bytes are stored there. Though some filesystems also have a footer, like some forms of mdraid.

Partition ids are basically ignored by windows and linux, I think some BSDs and Mac OS still look at them though. Most reformats will wipe the old headers, but mdraid is probably using a footer here.

Okay so I’m of two minds. On one hand the mdraid may have mounted automatically and tried to “correct” any problems wiping your data (check /proc/mdstat), on the other hand it may just mount if you remove that linux_raid_member header. If you really want to try this then make a backup of the whole disk, and read the man page of wipefs on how to remove a single signature. Removing that lvm_raid_member signature may do something. Like, I’d expect zfs to find its super blocks and import the pool, but it may be trusting the blkid identification on linux? At this point assume your data is missing.

The next step is to try running an “undelete” program on your old drive, the data you moved may still be there.

1 Like

When switching filesystems at the very least you can use wipefs to remove magic bytes (use the -a option to do the dangerous stuff), or use dd to zero the first 4MB or so - most magic bytes are stored there. Though some filesystems also have a footer, like some forms of mdraid.

Will keep this mind in the future, didn’t know that non-standard behavior was to be expected, kinda hoped GParted would do this for me by default or would at least suggest me the option if it saw a linux_raid_member and knew of these shenanigans.

At this point assume your data is missing.

Presuming mv doesn’t delete anything until everything is copied, nothing is lost. Speaking of which, does it? If it does, I’ll just zero out the first and last gig of the drive, quick format and run rsync -a like I should’ve. If it doesn’t, I’ll need to recover that data.

1 Like

AFAIK yes, mv waits till everything has been copied before deleting at the source. But you should go through and verify manually.

2 Likes

I’ve run wipefs -fo 0x1000 /dev/sdc1 (the offset specified in the output of wipefs -n /dev/sdc*) and verified that the signature is gone using wipefs -t linux_raid_member -f /dev/sdc* (and by opening GParted, that now reports it as ZFS)

Still didn’t work. Tried all my above commands, a system reboot, a reinstallation of ZFS.

Fuck it, time to start clean.

Here’s a little command I wrote…

function scrubby {
  SCRUB_QUANTITY=2048000 # 512*2048000 = 1.0 GB
  [[ $EUID > 0 ]] && echo "$0: This tool must be run as superuser" && return 1;
  [ -z "$1" ] && echo "$0: No block device specified (e.g. /dev/sdx)" && return 1;
  dd if=/dev/zero of="$1" bs=512 count=$SCRUB_QUANTITY;
  dd if=/dev/zero of="$1" bs=512 seek=$(( $(blockdev --getsz "$1") - $SCRUB_QUANTITY )) count=$SCRUB_QUANTITY;
  return 0;
}

Here’s what it did…

$ hexdump -Cv /dev/disk/by-id/ata-STXXXXXXXXXX-XXXXXX_XXXXXXX | head -20
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000100  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000110  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000120  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000130  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

The ZFS gods are still mad at me for some reason:

$ zpool create tank /dev/disk/by-id/ata-STXXXXXXXXXX-XXXXXX_XXXXXXX
/dev/disk/by-id/ata-STXXXXXXXXXX-XXXXXX_XXXXXXX is in use and contains a unknown filesystem.

The lsof is empty and so is my patience. I’ll just make a plain, boring partition and move on and let future projects be a subject of exactly that.

Thank you everyone for your help! I learned a thing or three about ZFS, even if I didn’t want to right now but I’m running out of time.

Hope this thread is useful to someone, someday.

Cheers!

2 Likes

-f would force zpool to create the new pool from disks marked as unclean.

Instead of wiping parts of the disk, you might just use fdisk / cfdisk, give it a new partition table, then creating the pool?