My ZFS Pool Is Corrupted... How?

I have OpenZFS 2.1.5 running on an Ubuntu 22.04 system that I use to host a handful of virtual machines via KVM. Each VM gets a dataset and a VM file contained therein.
The pool runs on a mirror of two Samsung PM983 3.84 TB NVMe SSDs.

On Monday, the largest and most active of my VMs (running a XenForo forum site) hung. I reset the VM, but it would not power back on due to the following error:

libvirtd[1914]: internal error: qemu unexpectedly closed the monitor: 2023-05-08T23:10:55.259267Z qemu-system-x86_64: -blockdev {"node-name":"libvirt-2-format","read-only":false,"discard":"unmap","driver":"qcow2","file":"libvirt-2-storage","backing":null}: Failed to get "write" lock#012Is another process using the image

No, nothing was using the image. lsof reported nothing locking the file. I ended up having to reboot the entire host to resolve this. The VM fired up fine after a host restart.
The following morning (Tuesday), I rebuilt the forum’s search index (elasticsearch) because it had not started up with the VM. It wasn’t set to autostart, so no big deal there.

Then on Tuesday evening, I started to get reports of the site again having issues. This time I could actually log in since the system was not hung, but the database was in a restart loop:

May 10 03:01:01  kernel: [98042.450194] blk_update_request: I/O error, dev vda, sector 1329356896 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
May 10 03:01:01  kernel: [98042.450802] Buffer I/O error on dev vda2, logical block 165894156, async page read
May 10 03:01:01  kernel: [98042.452217] blk_update_request: I/O error, dev vda, sector 1329356896 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
May 10 03:01:01  kernel: [98042.452892] Buffer I/O error on dev vda2, logical block 165894156, async page read
May 10 03:01:01  mariadbd[441371]: 2023-05-10  3:01:01 0 [Warning] InnoDB: 10678272 bytes should have been read. Only 2338816 bytes read. Retrying for the remaining bytes.
May 10 03:01:01  mariadbd[441371]: 2023-05-10  3:01:01 0 [Warning] InnoDB: Retry attempts for reading partial data failed.
May 10 03:01:01  mariadbd[441371]: 2023-05-10  3:01:01 0 [ERROR] InnoDB: Operating system error number 5 in a file operation.
May 10 03:01:01  mariadbd[441371]: 2023-05-10  3:01:01 0 [ERROR] InnoDB: Error number 5 means 'Input/output error'
May 10 03:01:01  mariadbd[441371]: 2023-05-10  3:01:01 0 [Note] InnoDB: Some operating system error numbers are described at https://mariadb.com/kb/en/library/operating-system-error-codes/
May 10 03:01:01  mariadbd[441371]: 2023-05-10  3:01:01 0 [ERROR] InnoDB: File (unknown): 'read' returned OS error 205. Cannot continue operation
May 10 03:01:01  mariadbd[441371]: 230510  3:01:01 [ERROR] mysqld got signal 6 ;
May 10 03:01:01  mariadbd[441371]: This could be because you hit a bug. It is also possible that this binary
May 10 03:01:01  mariadbd[441371]: or one of the libraries it was linked against is corrupt, improperly built,
May 10 03:01:01  mariadbd[441371]: or misconfigured. This error can also be caused by malfunctioning hardware.
May 10 03:01:01  mariadbd[441371]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs

That is no bueno for sure. I/O errors inside the VM. So I check the host and the ZFS pool and I see this:

zpool status -v
  pool: VMSTOR
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:16:58 with 0 errors on Sun May  7 00:46:59 2023
config:

        NAME                                                      STATE     READ WRITE CKSUM
        VMSTOR                                                    ONLINE       0     0     0
          mirror-0                                                ONLINE       0     0     0
            nvme-SAMSUNG_MZQLB3T8HBLS-00AAZ_S5GUNE0N121688-part3  ONLINE       0     0 1.58K
            nvme-SAMSUNG_MZQLB3T8HBLS-00AAZ_S5GUNE0N124416-part3  ONLINE       0     0 1.58K

There are checksum errors on both drives and data corruption in the VM’s image file. I have a backup of all the datasets via ZFS send/receive to a separate server, but the sync was behind due to the data churn from the events the day before. I have more recent snapshots on this pool, but I’m not sure when the corruption started so I’m not sure how far I need to go back.

Any thoughts on what might have caused this, how I can find out when it started, and/or how to prevent this from happening again?

SMART looks clean on these two drives, and the other VMs in the other datasets don’t seem to have any problems. I’m surprised a ZFS mirror could crater this hard. As Wendell said in the RAID video a while back, ZFS has “paranoid levels of paranoia” around data integrity.

Do you have any “Unaligned write command” in the logs (might be in dmesg) on the physical machine?

If you do it might be caused by SATA power management, run this on the physical machine:
cat /sys/class/scsi_host/host*/link_power_management_policy

It should return “max_performance” for all SATA links, if it doesn’t you need to configure ALPM for performance instead of power saving.

EDIT: Missed the NVME part of your machine config, I might be barking up the wrong tree.

I’ve gone through the syslog of the host and don’t see anything that sticks out. Everything there seems to be normal.

I also ran another scrub on the pool (it runs weekly via cron) and that did nothing - repaired nothing and still shows the checksum errors.

This is a rented box, so I’m not sure exactly how the drives are connected but there may be some U.2 adapters in play. I don’t think there’s an HBA here.

As the box is rented, this should be an issue for the hosting company, IMO.

1 Like

I’ve opened a ticket with them. Is there anything in particular I should ask them to do here to investigate? I’m not sure where to start other than asking if there have been any power issues recently and to tell me how the drives are connected to the server (cable, backplane, adapter card). Seems pretty easy for them to go with the “no issues here, sorry” reply.

zpool list -v (edit: or was it zpool status -v ?) should give you the files that were affected and give you certainty.

The vdev and pool are still online. So this was an isolated incident on some blocks. ZFS will totally shut down the pool if this is would be a recurring problem. So far it doesn’t seem like it.

Your scheduled scrub from 00:30 on May 7th seems to indicate that it happened after that. Without knowing any other facts, I’d assume any snapshot prior to this scrub will be fine.

What happened is that ZFS sent two copies of the file + checksum to the drives. When asked today/yesterday, the stored data doesn’t match with the checksum. And ZFS doesn’t return bad data, so you get I/O error. But it can’t repair it by itself. You can still have correct data on one side of the mirror but ZFS can’t verify this and rather throws I/O error than returning bad data.

Cause can be a thousand things. Usually it’s some hardware/firmware/cabling thing. @dali pointed out a possible problem in the OS handling things.

I’d restore the files in question from backup. It certainly isn’t rampant corruption that is continuously happening because we would be talking about offline pools and how to import in read-only now.

I’d monitor and scrub regularly in the coming weeks to check if it was indeed an isolated incident.

You can also reset the error counter via zpool clear.

It has. This is why you get I/O errors instead of running possible corrupted code. And ZFS can only work with the redundancy and checksum you defined. If some bad breakout cable sends different stuff to both drives connected via the same cable, you still have a single point of failure. And you can’t always avoid this.

3 Likes

It’s a good idea to establish the order of events: did the VM hung b/c the ZFS errors or were those caused by the subsequent hard reboot you performed? And even if the hosting company would have swapped drives, they should have told you in advance, especially if it was planned maintenance, but at the latest (in case of an emergency repair) by the time they’d finished said repairs.

As it happens, I’m aware LTT Linus had issues with his EPYC 24 NVMe drive monster server for video editing some time ago. He did a explanatory follow-up video about the cause of his troubles (which mentioned @wendell ) and yours may have a similar root cause too.

The connection to the drives may be at fault here too. And poking at it with zfs scrub may make it worse. Best to pull the drives and image them to do the analysis. I’ve seen pcie risers, backplanes and repeaters go bad and cause all sorts of stuff like this. Especially if both nvme are on the same root port. Those drives may even have the same cable back to the mobo…

1 Like

This. I’ve seen it too many times. Bad cable or even a bad connection on one drive corrupting the data stream.

I always path my mirrored drives on discrete cables back to my RAID card.

1 Like

I even go as far as splitting each side of my mirrors to different controllers. And no full mirror is connected to the same backplane.

You can’t avoid all single point of failures in a single system (which is why we need backups/remote replicas), but if you can avoid single components that bypass drive redundancy on a failure, you absolutely should.

1 Like

I was talking about my home system and mirrors for my example, but professionally I would have redundant links to all storage / hosts through independent switches. All with dual PS with A and B power shared between them.

You can’t avoid disaster, but you can delay it. :slight_smile:

Same here. Rather easy to do with mirrors, but once you deal with more complex RAID configs, things are getting problematic. But there are certainly low-hanging fruits to reduce risks.

Agree. And even ECC memory can only delay the inevitable. Murphy is gonna get you eventually…so better be prepared.

Had a janitor named Murphy plug an industrial floor polisher into an electrical outlet on a UPS. 10 minutes later, alarms and emails going off to all IT staff as a UPS went down. I got to the server room just as he was looking for another working outlet to plug into.

So, yeah. :roll_eyes:

3 Likes

Yeah, this makes sense, thank you for the explanation. I would have hoped that with such an error only one of the drives would have bad data so that the checksum would be correct on one side of the mirror and would be used to rebuild the data. But both drives failed somewhere, probably due to some of the reasons mentioned in this thread.

Agreed. I wish I could narrow in on the time when it happened so I can limit the data loss. I’ll probably need to experiment with a few of the snapshots I do have and see if I can find one where the database starts up correctly.

Well honestly that might be for the best if I can get it to fail again because that helps zero in on the actual issue and may get it fixed. Right now this could just be a cosmic ray event that can’t be blamed on the hardware.

The hosting provider has requested to power down the server so they can pull it and have a look, so that’s some progress I suppose. Thanks for all the ideas and suggestions so far!

ZFS certainly doesn’t think this is the case and would have repaired these blocks on the fly or during a scrub. Without knowing the cause or precise impact, I’m not sure if even a 3-way mirror or RAIDZ2 could have avoided this. Certainly not your fault.

zfs receive -c (corrective receive) can use a snapshot to only heal corrupted data on the target. Maybe this works for you and there is no need to rollback the entire pool or dataset. LINK to the relevant man page

Make sure the snapshot in question is healthy.

2 Likes

Here is a recent zpool events -v entry after starting the VM back up. I’m not sure what this means:

May 10 2023 20:21:08.405065262 ereport.fs.zfs.data
        class = "ereport.fs.zfs.data"
        ena = 0x52e4691d2703c01
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0xd1186eb8422828d7
        (end detector)
        pool = "VMSTOR"
        pool_guid = 0xd1186eb8422828d7
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "wait"
        zio_err = 0x34
        zio_flags = 0x808881
        zio_stage = 0x1000000
        zio_pipeline = 0x1080000
        zio_delay = 0x0
        zio_timestamp = 0x0
        zio_delta = 0x0
        zio_priority = 0x0
        zio_objset = 0x111
        zio_object = 0x2
        zio_level = 0x0
        zio_blkid = 0x1e69e95
        time = 0x645bfcb4 0x1824ce2e
        eid = 0x7f

I was under the impression that ZFS did a checksum on the data upon read. If a block is corrupt, will check any other provider for a redundant copy, and if that matches, it will serve the good copy, and overwrite the bad.
On the fly.
If both / all copies are bad, then it refuses to serve the data, and shows the corrupt file error?

Regular scrubs detect (and repair) errors regularly

appologies, I mis-read this, as if zfs doesn’t

https://manpages.debian.org/testing/zfsutils-linux/zfs-events.5.en.html

The errno for a failure when handling a given I/O. The errno is compatible with errno(3) with the value for EBADE (0x34) used to indicate ZFS checksum error.

I’m not an expert in debugging and error codes, but this moon shot google search seemed to match

This is a recent event. It probably doesn’t report the time the checksum error occured (because they occured much earlier) but rather why there was an I/O error at this time. We already know ZFS refuses to send bad data and you probably attempted to read that block/file at

2 Likes

Yes exactly. This was my conclusion as well.

The hosting provider replaced the PSU and the NVMe adapters on the server. I’m not sure they had any concrete evidence beyond what I provided them, but at least that is something.

I believe I have located the first instance of the error. The VM was in the middle of writing to the InnoDB log and encountered a write error:

May 10 02:30:56  kernel: [96236.573358] blk_update_request: I/O error, dev vda, sector 1329356896 op 0x0:(READ) flags 0x0 phy
s_seg 1 prio class 0
May 10 02:30:56  mariadbd[1185]: 2023-05-10  2:30:56 1448055 [Warning] InnoDB: 3584 bytes should have been written. Only 1536
 bytes written. Retrying for the remaining bytes.
May 10 02:30:56  mariadbd[1185]: 2023-05-10  2:30:56 1448055 [Warning] InnoDB: Retry attempts for writing partial data failed
.
May 10 02:30:56  mariadbd[1185]: 2023-05-10  2:30:56 1448055 [ERROR] [FATAL] InnoDB: write("ib_logfile0") returned I/O error
May 10 02:30:56  mariadbd[1185]: 230510  2:30:56 [ERROR] mysqld got signal 6 ;
May 10 02:30:56  mariadbd[1185]: This could be because you hit a bug. It is also possible that this binary
May 10 02:30:56  mariadbd[1185]: or one of the libraries it was linked against is corrupt, improperly built,
May 10 02:30:56  mariadbd[1185]: or misconfigured. This error can also be caused by malfunctioning hardware.
May 10 02:30:56  mariadbd[1185]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
May 10 02:30:56  mariadbd[1185]: We will try our best to scrape up some info that will hopefully help
May 10 02:30:56  mariadbd[1185]: diagnose the problem, but since we have already crashed,
May 10 02:30:56  mariadbd[1185]: something is definitely wrong and this may fail.
May 10 02:30:56  mariadbd[1185]: Server version: 10.10.2-MariaDB-1:10.10.2+maria~ubu2204

This makes me wonder if it’s possible to clear the error in ZFS and let MariaDB attempt to recover the data itself (since right now ZFS is not even allowing the read to go through). This might be better than restoring from backup as far as data loss, but there may still be something corrupt left over here.
The way it is behaving (the database is stuck in a restart loop) seems to point to one specific sector/block in a single file from a single transaction that was corrupted, so the damage seems mostly contained.

UPDATE:

I have gotten things back and functional as far as the VM and the forum site. I found a post by mercenary_sysadmin on Reddit that mentions you can use dd to copy file contents while skipping the I/O errors and replacing the bad bits with zeroes:

https://www.reddit.com/r/zfs/comments/11wqmlr/data_corruption_and_accessing_files/jd1v0pg/

I identified the file causing the problem in the VM. It was the ib_logfile0 file. That’s pretty important for InnoDB, but MariaDB could maybe recover it if it’s not totally mangled… so I tried it.

dd if=ib_logfile0.bak bs=4k conv=sync,noerror of=./ib_logfile0

dd: error reading 'ib_logfile0.bak': Input/output error
260108+0 records in
260108+0 records out
1065402368 bytes (1.1 GB, 1016 MiB) copied, 1.61114 s, 661 MB/s
dd: error reading 'ib_logfile0.bak': Input/output error
260108+1 records in
260109+0 records out
1065406464 bytes (1.1 GB, 1016 MiB) copied, 1.61177 s, 661 MB/s
dd: error reading 'ib_logfile0.bak': Input/output error
260108+2 records in
260110+0 records out
1065410560 bytes (1.1 GB, 1016 MiB) copied, 1.61249 s, 661 MB/s
dd: error reading 'ib_logfile0.bak': Input/output error
260108+3 records in
260111+0 records out
1065414656 bytes (1.1 GB, 1016 MiB) copied, 1.61309 s, 660 MB/s
262140+4 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.88976 s, 568 MB/s

Honestly looks pretty good. Only a few errors and few inconsistencies in the data. Once I copied this file out and renamed things, the database fired right up. So now things are running, and crucially, the CKSUM counter is not incrementing anymore.

Now I need to figure out what needs to be done to get the zpool errors cleared entirely. Since from the host/ZFS perspective, the error exists inside a single VM image file, I may still need to copy the entire VM image file out or do a manual data migration to a new VM or something like that. But at least now the website is up and I have some time to consider my options without unhappy users. :slight_smile:

2 Likes