ZFS fails to import a zpool on Proxmox server startup

lazyflame2 · August 2, 2021, 4:37pm

Hello everyone, I’m a beginner with a proxmox server mostly used for network storage, making images, and lab projects. I have two lvm pools and one Zfs zpool. I only used the zfs pool for some images and as a network drive for VM’s. I can delete it but I’d rather try to recover what I have and learn from it.

The zpool is called ssd-pool (creative I know), it’s three SSDs in a raidz1 config, cache = write-through.

Two weeks ago we lost power and on reboot the import service failed when importing the zpool.
I am also seeing AHCI errors so maybe I need to investigate further.

Startup error: "Failed to start Import ZFS pool ssd\x2dpool. ". But zpool status shows the pool is online, and it’s mounted although in the defaults page I set it to not auto-mount. I also added a ZFS_POOL_IMPORT=“ssd-pool” line to the defaults file.

Zpool import “ssd-pool” results in “cannot import ‘ssd-pool’ : a pool with that name
already exists”

I noticed that the log appeared to have the name wrong, the dash is replaced by “\x2d”.

the ZFS cache page has the name written correctly, and “zpool status” shows it is online.

Im going to try renaming the pool.

If anyone has any ideas or if I need to add more info, please let me know, many thanks!

Trooper_ish · August 2, 2021, 5:29pm

Have you checked the property “mounted” in case the mount point was already grabbed by the OS before ZFS could mount the dataset to it?

By which I meant there are 2 components, the Pool, and the filesystem on it. Looks like the pool is ONLINE but the filesystem/dataset might not be?

If it isn’t, could try ZFS mount -a?

lazyflame2 · August 2, 2021, 6:15pm

OK, thanks for the suggestion. I checked using “zfs list -o mounted” and it reports yes.

I unmounted the pool and restarted, it auto mounted, but I looked closer at the logs and saw its because ZFS or proxmox does another import using the cache file and then mounts zfs file systems. It’s failing right before it does the cache import.

The server keeps booting into emergency mode, I assumed it was because of the failed import, but now that I see it is importing successfully from the cache file, I’m thinking it’s the acpi errors that are making proxmox freak out.

gordonthree · August 3, 2021, 1:00pm

Do you have pool members connected to a SAS HBA? I recently ran into an issue on my server where the ZFS import was attempted too early, before the SAS controller had finished enumeration. Adding a 10 second delay in the systemd script for zfs import fixed the issue.

ThatGuyB · August 20, 2021, 8:57pm

Fugg. I was absent for 7 days and my most important proxmox server now fails to start with “Failed to import ZFS pool usb/x2ddisk” - I have no USB storages plugged into my server and never had, what’s wrong with it? I am really far and don’t have a Pi KVM, so would need to keep the debugging to a minimum because I am having my computer illiterate mom inserting commands. I’m running proxmox 7.

Trooper_ish · August 20, 2021, 9:52pm

I wonder if it might have something to do with udisks2 or gvfs-udisks2 or something?

sorry, I never played with proxmox.
I guess you might be setting up a jump host pi in future, doesn’t help at all…

HaaStyleCat · August 20, 2021, 10:02pm

Uh oh, I may have to boot up mine and see if it’s working or a update maybe messing with zfs and proxmox? If I can today or tomorrow I’ll rig mine up and see what it does.

ThatGuyB · August 21, 2021, 11:53am

Now that I had some time to think about it, it might be a race condition or something going on. But then again, I don’t have my root on ZFS, just on 1 ext4 SSD. My raid-z2, even if it fails to initialize the ZFS pool, should not block the main os to boot up, should it?

I need to buy a Pi-KVM and send it to my mom… Thankfully I have 1 / 3 proxmox servers still functional and I could flash the software on a SD card either through that, or through my mom’s laptop.

ThatGuyB · August 25, 2021, 7:33pm

I can’t figure this out. The systemd error I get when my system starts is:
“Failed to start import ZFS pool usb\x2ddisk”

Proxmox just hangs there and does nothing. I can import the pool just fine when I’m into rescue mode. When I’m in rescue mode, journalctl shows no errors. dmesg neither.

I really wish I had a Pi-KVM… Pfff… Online I cannot find anything related to this issue. After I import the zpool in rescue mode, zpool status says everything is just fine.

Of note is that my root is on a single ssd, ext4 partition with a small swap and lots of unallocated space, just my storage pool is on ZFS. Also, while this is not really necessary, I delayed the root mounting in /etc/default/grub, like you would do for the “cannot mount rpool” error (when you have root on zfs).

anon86748826 · August 25, 2021, 9:02pm

Is usb\x2ddisk maybe a pool you had created before e.g. for testing purposes on any of the currently or former attached storage media? It is maybe in the list of datasets for zfs-mount-generator to mount them at boot and this is, the message of failure to mount, that you receive. These lists usually reside in /etc/zfs/zfs-list.cache directory, with a file for each pool as far as I am aware.

ThatGuyB · August 26, 2021, 11:49pm

I have not created any other pool. I also removed the ZFS cache file from etc while I was in rescue mode and recreated it with only my zpool imported. After a reboot, I got the same error. Maybe Proxmox 7 / Debian 11 does something? And I do not remember ever plugging a ZFS formatted thumb drive or HDD in my server via USB…

I just wish I knew how to debug this issue further, since everything seems fine when I’m in rescue mode, I can’t find any errors. I thought maybe it was LXC (I recalled something about using ZFS backend or something), so I just deleted my 2 LXC containers (they were disposable anyway).

ThatGuyB · October 31, 2021, 2:04am

I’m reviving this topic, since I had my posts above, so there’s context for this comment. Thankfully is just 2 months old.

I’ve randomly looked through my last working proxmox server and it seems I have a failed sytemd startup service (actually 2) related to “x2d” called

[email protected]
zfs-import@zastoru\x2dmini.service

zastoru is my ZFS pool name. How did systemd fail such a simple task as importing zfs? Also, my zfs pool seems to be working, otherwise my VMs wouldn’t be running and I wouldn’t be able to do basic zfs commands on the pool (which I can).

I now see a lot more “x2d” related stuff, like:

sys-devices-virtual-block-dm\x2d0.device                                                 loaded active     plugged         /sys/devices/virtual/block/dm-0
sys-devices-virtual-block-dm\x2d1.device                                                 loaded active     plugged         /sys/devices/virtual/block/dm-1
sys-devices-virtual-block-dm\x2d2.device                                                 loaded active     plugged         /sys/devices/virtual/block/dm-2
sys-devices-virtual-block-dm\x2d3.device                                                 loaded active     plugged         /sys/devices/virtual/block/dm-3
sys-devices-virtual-block-dm\x2d4.device                                                 loaded active     plugged         /sys/devices/virtual/block/dm-4
sys-devices-virtual-block-dm\x2d5.device                                                 loaded active     plugged         /sys/devices/virtual/block/dm-5
systemd-fsck@dev-disk-by\x2duuid-5B59\x2d1AD8.service                                    loaded active     exited     File System Check on /dev/disk/by-uuid/5B59-1AD8
system-lvm2\x2dpvscan.slice                                                              loaded active     active     system-lvm2\x2dpvscan.slice
system-pve\x2dcontainer.slice                                                            loaded active     active     PVE LXC Container Slice
system-systemd\x2dfsck.slice                                                             loaded active     active     system-systemd\x2dfsck.slice
system-zfs\x2dimport.slice                                                               loaded active     active     system-zfs\x2dimport.slice

Now I’m afraid to reboot my last server too. Anyway, does anyone know what “x2d” could be referring to?

Since I have no idea how to debug systemd without being booted into it, can anyone help me here too? I still have the USB live environment on my poor old server and can ssh into it, but if I chroot into my proxmox installation, I cannot run systemd commands (it complains because it is not PID 1). I can see that in /usr/lib/systemd/system there is a similar system-pve\x2dcontainer.slice and system-systemd\x2dcryptsetup.slice (the former that I can also see on my working proxmox installation, while the later not). But since I am in chroot, I will run an apt upgrade, there are quite a lot of updates, including a kernel update (pve-kernel-5.4 → pve-kernel-5.11) and some more, maybe, just maybe, it will fix it. I don’t have anything to lose, since I moved all my stuff over to the last server.

While I was surfing the web, I found this github issue:

That got me thinking, so I kept looking:
https://manpages.debian.org/stretch/systemd/systemd.unit.5.en.html

Some unit names reflect paths existing in the file system namespace. Example: a device unit dev-sda.device refers to a device with the device node /dev/sda in the file system namespace. If this applies, a special way to escape the path name is used, so that the result is usable as part of a filename. Basically, given a path, "/" is replaced by "-", and all other characters which are not ASCII alphanumerics are replaced by C-style "\x2d" escapes (except that "_" is never replaced and "." is only replaced when it would be the first character in the escaped path).

So I might have hit a bug (or a feature™) in systemd. Systemd had given me plenty of headaches, but I wasn’t a systemd hater despite of them. But if it is the case that my proxmox server failed because of systemd, I might start hating it for serious.

Now, the question really remains the same. Why in the world does proxmox think that I even had a ZFS pool called usb? So ridiculous…

On my failed server and on my working one, there is no /etc/zfs/zfs-list.cache. There is a zpool.cache, but it’s a binary file. When I try to view it in less, it spews out garbage (obviously), but cat seems to spew some information… Looks like a conf file with my HDDs UUIDs (which I used to create the ZFS pool and to mount it). The file looks strange (obviously cat cannot display it correctly), but the information displayed is only of the 10 HDDs (UUIDs) from my pool, all which work correctly and I can confirm by running zpool status (pool is fine, all disks show up, no errors reported).

However, I have no idea how to check if a zfs pool called “usb” existed on my previews server, if I can through chroot, that would be awesome.