Failed data corruption experiment

Harder_Style · January 21, 2021, 11:10am

The goal. simulate data corruption on a virtual block device to demonstatre data integrity features of ZFS.

The Setup. Ubuntu 20.04 LTS with zfs installed

zpool create tank mirror disk0 disk1
echo “gibberish” > test.txt
md5sum tets.txt
test.txt
zfs snap tank@snap
zdb -ddddd tank@snap
…reveals the block where our test file is 0:1234567:8910 or something like that

dd if=/dev/disk0.img bs=512 skip=$(((0x1234567 / 512) + (0x400000 / 512))) count=1 | hexdump -C
…Shows the clear test of or test.txt file

…This should be the line that corrupts the file

dd if=/dev/urandom of=/dev/disk0.img bs=512 seek=$(((0x1234567 / 512) + (0x400000 / 512))) count=1

…that looks like that did it lets do it to the other disk

dd if=/dev/urandom of=/dev/disk1.img bs=512 seek=$(((0x1234567 / 512) + (0x400000 / 512))) count=1

…okay now we check the img files

dd if=/dev/disk0.img bs=512 skip=$(((0x1234567 / 512) + (0x400000 / 512))) count=1 | hexdump -C

… that did it! lets run the checksum
ITS THE SAME!
HOW?
HOW DO YOU CORRUPT A SECTOR ON A DISK AND NOT DAMAGE THE DATA THAT’S ON IT!?

zpool scrub tank
zpool status tank

…DEGRADED
permanent errors on test.txt

cat test.txt
…reveals the original text of the file

md5sum test.txt
… checksum unchanged

Conclusion. zfs detected that there was corrupted data, but fixed it before I could calculate the sum OR Linux is doing something in the background that I am not aware of OR I’m missing something that’s obvious, but can’t figure out or understand. Can anybody with the appropriate experience point me in the right direction?

If you want to view the basis for what I’m doing check out this awesome article by Bryan Erlich called " Causing ZFS corruption for fun and profit (and quality assurance purposes)" It’s hosted by datto engineering.

gordonthree · January 21, 2021, 11:23am

Did you reboot between all these steps? Maybe try the test again with a file larger than your system ram? I’m thinking cache may have affected your results.

mauli · January 21, 2021, 3:53pm

Hi & welcome.

Link for anyone not wanting to search engine their way: https://datto.engineering/post/causing-zfs-corruption

ZFS is awesome and scary in some regards.
Being an active user on FreeBSD and having read into eg. FreeBSD Mastery: Advanced ZFS and similar I can tell you: damn - so much one can learn

What ZFS (same on linux) is doing when you are reading a file (md5sum, or any other type of access on the filesystem level) the kernel code part will fetch metadata/information it has on the file and verify all correct before giving it to you, silently fixing stuff in the background (silent for the program accessing it).

Maybe something in the background was accessing the file while you where doing your testing or caching was involved that allowed ZFS to fix it. I did only see you mention hexdump verifying from disk0.img, but not from disk1.img, might be something to check as well.

What definitely helps is to export the pool before altering/mingling with the raw-data and re-importing it afterwards, to emulate “offline” bit-rot and not have anything touch the zfs logic portion of the data.

Harder_Style · January 22, 2021, 9:28pm

So a quick update to my experiment I tried mounting the files to block device via loop, but when i alter the block data and remount the drives, the ISO files shrink down below the 64MB minimum to be imported by ZFS. Ill keep working it and let you know, but my brain is just mush at this point.

FYI, I am not a professional coder or systems administrator, i’m just a guy who like to try and make computers do interesting things. Think non-criminal hacker.

Harder_Style · January 24, 2021, 7:42pm

TL;DR> creating and describing a scenario where the computer says that your data is fine, but the human eye can plainly see that it’s not fine.

So i finally found out what i was doing and have since updated my testing methodology. Ill be brief.

step 1 create 2 100mb img files and mount them to 2 separate loop-back devices
create a mirrored pool from the 2 loop devices

#zdb -ddddd pool

reveals the metadata about the pool. extract the indirect block address to find where the file starts

zpool export pool

this will ensure that zfs cant mess with us while we are manipulating the block devices in the vdev

dd if=/dev/urandom of=(location of the loop devices) bs=512 seek=$(0x(indirect block) / 512) + (0x400000 / 512) count =1

this will overwrite the block device at the specified location of the file we want to corrupt

dd if=(same-loop-device-i broke-on-purpose) bs=512 skip=$(((0x(indirect block) / 512) + (0x400000) count=1 |hexdump -C

that did it

read the other device, data is intact; as expected.

zpool import pool
zpool status -v pool

this shows that everything is fine

md5sum textfile

checksum matches

dd if=(same-loop-device-i broke-on-purpose) bs=512 skip=$(((0x(indirect block) / 512) + (0x400000) count=1 |hexdump -C

data is fine
zfs recovered the bad sector the way it was supposed to.

Now if we repeat this procedure but we corrupt the entire sector in BOTH disks…

zpool status -v tank
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: {LINK REDACTED}
scan: scrub repaired 0B in 0 days 00:00:00 with 1 errors on Sun Jan 24 11:14:40 2021
config:

NAME        STATE     READ WRITE CKSUM
tank           DEGRADED     0     0     0
  mirror-0  DEGRADED     0     0     0
    loop12   DEGRADED     0     0     4  too many errors
    loop13   DEGRADED     0     0     4  too many errors

errors: Permanent errors have been detected in the following files:

    /tank/test.txt

#md5sum /tank/test.txt
md5sum: /tank/test.txt: Input/output error

we did it gang. we proved that if a sector becomes overwritten on every disk in a mirrored array, the pool is borked. At leat ZFS knows its borked, but that really doesn’t help us.

What are the odds that the same sector is gonna go bad on two disks on the same mirror before you can resilver another drive? I’d call that a statistical no-no.

unless…

a ‘bug’ is introduced to a system-level process
system process randomly introduces corrupted data onto sectors of system’s block devices
‘bug’ is unnoticed for years
corruption reaches critical mass
try to restore from backup
sysadmin restored permanently corrupted data
tries older backup
oh-crap-we’re-screwed-dotcom

Trooper_ish · January 24, 2021, 10:05pm

wait, it didn’t serve up data which doesn’t match it’s checksum, did it?

If a file is bad, the pool should not serve it, and use a parity copy. If that is also bad, it should show an unrecoverable error, and simply not serve any file.

Did it serve bad data, or no data?

thro · January 24, 2021, 11:13pm

I’d say the best way to test ZFS corruption detection/repair is to use a virtual machine, or machine with removable media so that you can corrupt the blocks off-line, then import the disk back and see what it does with it.

Whilst double-disk corruption in the same sector on two disks is rare, double disk total failure, or controller failure, or cable dodgy-ness is not

Harder_Style · January 25, 2021, 6:12pm

You are correct. ZFS did not serve any data because the sector holding the start of the file was corrupted on both disks. If the data on the drive(s) doesn’t match the checksum, it will throw an error.

Trooper_ish · January 25, 2021, 10:35pm

And during a scrub, should throw an error, saying unrecoverable file.
The we’re can be a bit hard to clear, if the file is referenced by several snapshots.

Have you come across Jim salters’ Ars article about checksummed COW filesystems? Or the L1 videos about raid being obsolete because of them as well?

Jim Salter’s article about silent corruption,

Video referencing one sided file corruption, not double sided RAID Obsolete? Part 2: Failure Testing Linux's RAID: md, h/w & BTRFS - YouTube

Harder_Style · January 30, 2021, 8:54pm

I just finished the ars technica article that Trooper posted, and I have seen the video posted on the L1T YouTube channel. I’m not as skilled a programmer as Wendel, but I wanted to utilize the skills that I did have, combined with other research, to demonstate ZFS data integrity features working properly in real time versus a simple RAID array.

Since we are sharing so much educational content, I would like to point you to a fantastic ZFS training video that helped me out a bunch when learning ZFS.
Open ZFS Boot Camp - Linda Kateley 1:42:26

Trooper_ish · January 30, 2021, 9:03pm

She also split that up into 8 “easier” to consume parts

Harder_Style · January 30, 2021, 9:16pm

I completely agree. There are other risks to consider. I would be curious to figure out if one can space out drives across different host bus adapters so that each VDEV isn’t dependent on a single HBA card. like so

|HBA1| |HBA2| |HBA3|
disk1 disk2 disk3 VDEV1
disk4 disk5 disk6 VDEV2
disk7 disk8 disk9 VDEV3

as opposed to

In both scenarios we would stripe across the 3 VDEV’s, and each vdev would be a zaidz1 but in the first scenario, a failure of a single HBA could not break the entire pool.

Thoughts?

Trooper_ish · January 30, 2021, 9:47pm

That’s not a terrible idea in theory, and is used in the enterprise; SAS dual port drives are designed to be attached to two controllers in a similar fashion.

A bit overkill for me; If one the HBA’s on my system goes, it’ll take out a whole array clean, and ~~degrade another until I replace the card~~ on second thoughts, I have one array on one card, and the other card runs a ref arrays through a multiplier. so if either HBA goes, the array’s would be entirely removed, but still stable when HBA replaced

thro · January 31, 2021, 3:57pm

You can most definitely do that, and if you have enough controllers it is very much worth doing so.

For a relatively small deployment (e.g., home, small office) its not as critical, as controller failures don’t generally corrupt stuff they just break the array/pool (in which case assuming you’re using drive GUIDs rather than device/termination identifiers to add drives to your pool, just plug them in another port and all good), but if it is a mission critical business facing storage system, spreading your vdevs so that a single controller death won’t kill the pool is a good idea.

edit:
actually i think zfs will find drives related to a pool however you plug them in or however they’re labelled… so controller redundancy is more about availability rather than protection from corruption so much.