Proxmox - ZPool Upgrade Error

MazerRackham101 · November 3, 2023, 12:35pm

tldr: “zpool upgrade” failed and now I can’t get to the data in my main ZFS pool.

Despite my better judgement, I ran "zpool upgrade. “zpool status” kept giving me the same status that I needed to run “zpool upgrade” on my main ZFS pool. I did so, and immediately started getting some pretty major errors.

I’ve been fighting this for a few hours now, and I’m still terrified I lost all of my data.

So far:

I can boot just fine into Proxmox if the 5 drives of the ZFS pool aren’t physically inserted.

Inserting the drives causes no errors. “zpool import ZFSPool1” fails, hangs, and causes issues in Proxmox. It starts throwing “task zpool:126913 blocked for more than XXX seconds” errors every 2 minutes. ps aux doesn’t find a task ID 126913.

“zpool import -N ZFSPool1” works perfectly, but doesn’t mount any datasets (what the -N option does) - but it DOES show the datasets.

“zpool import -nF ZFSPool1” to repair the pool ALSO starts throwing those same “task zpool:126913 blocked for more than XXX seconds” or “task txg_sync” errors.

I’m terrified I’ve lost my data again. It’s all replaceable (everything irreplaceable is safely backed up) but would take a LOT of time to regather everything.

The main glimmer of hope I have is that, upon import with the “-N” option shows the pool, the data, and the data usage. I just can’t touch it from Proxmox.

Proxmox “sees” my big ZFS Pool (ZFSPool1)

Errors coming up on the server’s console

Trooper_ish · November 3, 2023, 4:57pm

Hi, I have no useful help (sorry) but feel your pain, and commiserate… When the pool is imported with -N, are you able to send snap-shots to other pool/ replicate / zfs send-> zfs receive?

Log · November 3, 2023, 5:06pm

Don’t lose hope yet, the data is still very likely to be there, but make sure not to thrash around and experiment in a panic.

I’m not all that familiar with this particular scenario but I can have a look much later tonight to see what the best path forward is if someone better doesn’t come along first.

You might try posting on the mailing list. The official mailing list web frontend: Topicbox

And the proxmox forums (the anti-spam can sometimes be too vigilant on new accounts): https://forum.proxmox.com/

MazerRackham101 · November 3, 2023, 7:21pm

I’ll try that later tonight. I’m not sure.

MazerRackham101 · November 3, 2023, 7:25pm

I’m definitely not thrashing around yet. I’ve still got hope - I’ve thought I’ve all of my data to small/stupid mistakes that I recovered from. It looks like it’s there, but there’s something keeping me from touching it.

My intention is to carefully try fixing things…but I’m emotionally prepared to write the data off.

I’ve already posted on the Proxmox forum, but I’ll poke around the Topicbox you linked and maybe post the question in there as well.

MazerRackham101 · November 3, 2023, 11:56pm

What’s the correct protocol I should be trying for the ZFS send command?

Looks like maybe:
zfs send /ZFSPool1/storage/Pictures|/NewPool/storage/Pictures

Trooper_ish · November 4, 2023, 12:07am

First, try and do read-only from main zpool until you know what actually is the problem…

um, for what I was referring to, it is more a case, of having 2 pools. then one can send a snapshot from one to another.

If you don;t already have snapshots then it may be a moot point.

It is like, if one has zpool1/dataset1@snapshot1243 and also has zpool2/datasetmedia, then the command is like

zfs send zpool1/dataset1@snapshot1243 | zfs receive zpool2/datasetmedia/newlycreatedplacetosendto

Also kind of benefits from several datasets, instead of copying a whole large lump.

I was just throwing the idea out there, as you mentioned you had partial success importing unmounted, which implies maybe read-only might work, without changing stuff, in case some major component on zpool1 is damaged.

I would simply try and look at slight poking at little things that do not case any, or many, writes, until the cause of the broken upgrade is discovered.

Deffo wait on Log, as this is something he really delves into, if he has the time.

Also also, I like the name… Even Ender emulates him, with the long space trip before becoming Speaker…

MazerRackham101 · November 4, 2023, 2:54am

Luckily, I do have multiple datasets, including a few rather small ones (<50GiB iirc).

I did some reading, and it LOOKS like zpool send takes and sends snapshots. Like, the “zpool send” process does that for me.

The only issue I have here is that none of the data is visible to me when I mount with “-N” option. I see the ZFS pool mounted in the right spot, and I see the primary dataset structure (ie, the three directories in the root of my zfs pool). Furthermore, the Proxmox WebUI sees the ZFS pool and recognizes it’s 40TiB total size with 27TiB used and 13TiB available.

I may need to try “zpool import -N ZFSPool1 && zfs mount -o ro ZFSPool1” or something like that. That way I can import the ZFS pool without the file system (-N) that gave me some positive traction, and then use the “zfs mount -o ro” to hopefully read-only mount the datasets?

It looks like I may want to try “zpool upgrade -a” as well. If I go this route, I’d unplug all of my drives besides the 5 in question, and boot into Ubuntu on a flash drive. That’d protect the rest of my drives and my Proxmox install.

I’ll wait for Log, though. I can be patient for an issue this big.

Log · November 4, 2023, 6:16am

Alright, gonna head to bed soon, but here’s some stuff I found so far.

First, we’ll want to make sure that one of your disks isn’t going bad. It appears what can sometimes happen is a dying disk appear fine enough but take forever to respond, which causes everything else down the line to block, I guess.

You can hopefully see what disks are in a zpool with zpool list -v. In my case, because my zpool was imported using zpool import -d /dev/disk/by-id npool, an edited output for example would be:

npool <-poolname
  mirror-0 <- The vdev name, also indicating it's type
    nvme-INTEL_SSDPF2KX076TZ_PHAC140402KB7P6CGN <- The physical disk
    nvme-INTEL_SSDPF2KX076TZ_PHAC140402AN7P6CGN 
  mirror-1
    nvme-INTEL_SSDPF2KX076TZ_PHAC1404029K7P6CGN
    nvme-INTEL_SSDPF2KX076TZ_PHAC1404029J7P6CGN

Then the following ls -l /dev/disk/by-id/ | grep -v part will provide this:

lrwxrwxrwx 1 root root 13 Oct 11 12:46 nvme-INTEL_SSDPF2KX076TZ_PHAC140402KB7P6CGN -> ../../nvme2n1
lrwxrwxrwx 1 root root 13 Oct 11 12:46 nvme-INTEL_SSDPF2KX076TZ_PHAC140402AN7P6CGN -> ../../nvme9n1
lrwxrwxrwx 1 root root 13 Oct 11 12:46 nvme-INTEL_SSDPF2KX076TZ_PHAC1404029K7P6CGN -> ../../nvme6n1
lrwxrwxrwx 1 root root 13 Oct 11 12:46 nvme-INTEL_SSDPF2KX076TZ_PHAC1404029J7P6CGN -> ../../nvme1n1

So I can either check my disks with
smartctl -A /dev/disk/by-id/nvme-INTEL_SSDPF2KX076TZ_PHAC140402KB7P6CGN
or
smartctl -A /dev/nvme2n1
They do the same thing, just using different ways to point to the same disk. Labels like nvmeXnY or sdX are subject to change, and not recommend to use for setting up a zpool, as things can get confused. This can be fixed by exporting the pool, and using zpool import -d /dev/disk/by-id your_pool

You likely have HDDs, so they probably show up as sda, sdb, etc so you’d use smartctl -A /dev/sd<X>

You’d then look through the output for signs of errors showing up, though frankly the output can be very cryptic and require googling. You may also consider running a selftest on your disks with smartctl -t <short|long> /dev/sd<X>

After that, my suspicion is potentially something in regards to the zpool import cache file stuff, but I’m unfamiliar with that for now.

MazerRackham101 · November 4, 2023, 6:11pm

And, yeah, the “smartctl -a /dev/sd” output was SUPER cryptic. This was for /dev/sda, I think, but they all gave nearly identical outputs. Nothing jumped out at me.

However, “smartctl -t short /dev/sd” gave MUCH clearer feedback, and every disk showed as passing, no failures. Here’s my output from that: Zero smart errors.

In all of the reading I’ve been doing over the last ~2.5 days, it seems more and more likely that there’s something broken in the zpool import or zfs filesystem headers or SOMETHING that’s causing this stupid stuff to happen. I just can’t diagnose or resolve it for the life of me.