FreeNAS failed disk

marasm · July 7, 2019, 4:05pm

Hey guys. So, I’ve been having a problem with my FreeNAS for a while now. The pool has been degraded with Disk 7 showing errors.

Last week I decided to attempt to replace the drive with a spare that I had. I followed some instructions I found in a Youtube video, as i have never had to replace a drive in a ZFS pool before.

Basically the process was:

Go to pool, hit cog button, go to status, select dropdown for drive and hit offline
Shutdown, replace drive, startup
Go back into pool, cog button, dropdown for offline drive, replace, select new drive

It then spends a lot of time resilvering, but nothing ever happens. The new disk is in the list, but so it the offline failed disk. It never replaces it.

I tried this twice, and now I have two offline failed disks and the new replacement disks.

I think this might be happening because the pool is encrypted.

Now, I have been on vacation since Wednesday, and today I got this email from my FreeNAS:

Checking status of zfs pools:
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
freenas-boot   111G   759M   110G        -         -      -     0%  1.00x  ONLINE  -
pool01        14.5T  5.74T  8.76T        -         -     0%    39%  1.00x  DEGRADED  /mnt
 
  pool: pool01
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 697G in 0 days 06:45:09 with 3 errors on Tue Jul  2 23:17:14 2019
config:
 
        NAME                                                  STATE     READ WRITE CKSUM
        pool01                                                DEGRADED     0     0     3
          raidz1-0                                            DEGRADED     0     0     6
            gptid/8945c09e-5da0-11e9-8186-38d5477b3538.eli    ONLINE       0     0     0
            gptid/961dd4f6-5da0-11e9-8186-38d5477b3538.eli    ONLINE       0     0     0
            gptid/a43533f7-5da0-11e9-8186-38d5477b3538.eli    ONLINE       0     0     0
            gptid/b0facdff-5da0-11e9-8186-38d5477b3538.eli    ONLINE       0     0     0
            gptid/bee72936-5da0-11e9-8186-38d5477b3538.eli    ONLINE       0     0     0
            gptid/cbace841-5da0-11e9-8186-38d5477b3538.eli    ONLINE       0     0     0
            gptid/d9830107-5da0-11e9-8186-38d5477b3538.eli    ONLINE       0     0     0
            replacing-7                                       OFFLINE      0     0     0
              15864807740488987168                            OFFLINE      0     0     0  was /dev/gptid/e65f4b08-5da0-11e9-8186-38d5477b3538.eli
              4455510982713987333                             OFFLINE      0     0     0  was /dev/gptid/40c390fb-9c82-11e9-bbc5-38d5477b3538.eli
              gptid/90597b18-9d21-11e9-a473-38d5477b3538.eli  ONLINE       0     0     0
 
errors: 3 data errors, use '-v' for a list
 
-- End of daily output --

So, yeah, it’s borked. I have a backup of the data, but I wanted to try to recover this. I’ll be home later today, but I wanted to get this out there to see if anyone can give me a hand in getting this working again.

Thanks

nx2l · July 7, 2019, 7:13pm

just curious…

what model drives?
and
how often do you run a pool scrub?

marasm · July 7, 2019, 11:12pm

They are definitely not the correct drives for this use case: 2TB Seagate ST2000DM001. Just regular desktop-class drives. So far I’ve had to replace two in about four years of use. All eight are running through an IT-flashed PERC H310. Someday I’ll get some 4TB NAS rated drives.

The first failure was a couple years ago, and I was running Fedora with ZFS on top of LUKS. Now, with this one, it’s with FreeNAS and I know very little about it or BSD in general.

I think a scrub happens once a week. Or maybe once a day. I’m not sure. I get a ton of emails from FreeNAS with reports.

This is where the pool is currently sitting:

The two /dev/gpid/… entries shouldn’t be there. I tried replacing, it looked like it worked, then it didn’t. Tried again, looked like it was working, then didn’t, and a second /dev/gpid/… appeared.

I have no idea where to go from here. Like I said, the pool is encrypted, and after reboot I have to supply both the passphrase and the keyfile to unlock it. I don’t know if that is what’s causing this or not.

marasm · July 8, 2019, 2:39am

So, after reading the FreeNAS guide (RTFM, right?) I have pretty much come to the conclusion that this thing is hooped. Like, fucked off into low earth orbit, it’s dead Jim, sort of hooped.

Apparently if you don’t this janky rekey thing immediately after hitting replace drive, and then you reboot the system, “access to the pool might be permanently lost.”

https://www.ixsystems.com/documentation/freenas/11.2/storage.html#replacing-an-encrypted-disk

I mean, I guess it makes some sense in retrospect. But seriously? That couldn’t be taken care of behind the scenes? No, let’s hope the user knows this.

Anyway. If anyone knows if there’s a way out of this, it’d be great to know.

Otherwise, since the data on the pool is still accessible, I’d like to find a way to transfer two of the datasets to another pool. One has a couple VMs, the other has the jails. Plex, to be specific, which I do not wish to recreate at all costs. That was a nightmare to get working right.

I’ve got four WD RE4 500GB drives I can throw in there and use as a second pool just for the jails and VMs. Can I move the datasets from the degraded pool to the new one?

Whizdumb · July 8, 2019, 2:55am

It looks like all of your disks are still there. 8 Disks are online 0-7 and the resilver process says completed. Make sure you have an offline backup of everything first (which it sounds like you do), attempt to remove the offline disks from the pool with their respective menus, do an encryption rekey on the pool and then export a new recovery key. Hopefully everything appears as normal after that. Whatever you do, don’t power cycle any further until you get this straightened out.

marasm · July 8, 2019, 3:04am

All three disks (the two gpid ones and the actual disk da7) are set as Offline. There isn’t any option to remove the disks. Just Edit, Replace, and Online.

Whizdumb · July 8, 2019, 3:05am

da7p2 is listed as online in your screenshot.

marasm · July 8, 2019, 3:06am

It’s not any more

Whizdumb · July 8, 2019, 3:08am

What happens if you try to online any of them?

marasm · July 8, 2019, 3:10am

If I try to online any of them I get this:

thro · July 8, 2019, 3:11am

wow thats fucked

don’t believe any of my pools are using encryption and i’ve avoided it due to the additional complexity i don’t need/want but shit…

Given it looks like it’s an encryption issue and sounds like you no longer have the key(s) you may well be hosed. What a clusterfuck.

Glad to hear you have a backup!

marasm · July 8, 2019, 3:18am

Yeah, it’s pretty terrible. And yes, good that I have a backup. I’m going to have to root around and try to backup the Plex database. The last two or three times I’ve had to set Plex up from scratch because I didn’t get the right files or put them in the right place or something.

Feels like I’m stuck in a loop. If you miss that crucial step that’s buried in the manual in an “oh, by the way” sort of way, you’re hosed. Might be nice to add a warning in the regular disk replacement section for encrypted scenarios.

Anyway, is it possible to transfer datasets between pools? Encrypted pools, maybe?

To be honest, if it’s this much of a hassle to encrypt I might go back to ZoL. The LUKS route worked great, and now ZoL has native encryption. Not sure if I will come across this issue in the future with that.

Whizdumb · July 8, 2019, 3:23am

Were you ever prompted for your Passphrase during this procedure at all?

marasm · July 8, 2019, 3:25am

When I try to replace either of the gpid ones I choose da7 as the replacement drives, then am prompted to enter the passphrase twice.

Now I can’t use da7 because:

Screenshot%20from%202019-07-07%2022-26-20

Whizdumb · July 8, 2019, 3:27am

If you have your geli.key backed up, you can attempt this…

Honestly, at this point, you can try anything if you are confident in your backup. But, as you said, you don’t want to wipe your whole config or screw with any VMs you already have working. So I would attempt anything I could find if it’ll save you that hassle and then just know that worst-case, you might have to start over again.

marasm · July 8, 2019, 3:30am

Checking the force box forced it to use da7. It is resilvering now. I quickly went back to the pool page and clicked Add recovery key.

I have no idea if that really did anything in the long run. It just downloaded another geli keyfile.

thro · July 8, 2019, 3:30am

AFAIK you should be able to transfer between pools whether they are encrypted or not as the disk encryption is “underneath” the ZFS.

just zfs send / zfs receive, etc.

Whizdumb · July 8, 2019, 3:34am

Well, let us know how it turns out.

marasm · July 8, 2019, 3:37am

I will. Looks like it’ll take all night. And another entry appeared.

I’ll wait until the resilver is done. Then I will see if I can truly hotswap by slamming those 500 gig drives in and create a second pool and attempt to move the jails and VMs to that.

marasm · July 8, 2019, 10:27pm

Well, it’s fixed. This morning when I checked the resilver hadn’t done anything. Still had the three gpid entries and da7. Decided to do the -v option like it suggested and it listed three files that were corrupted. I deleted those and did another scrub. An hour or two later I got an email at work from FreeNAS saying the scrub was done, then another saying that the notification for the degraded pool had cleared.

Just checked when I got home and everything’s happy. Dunno.