Injecting Errors on Purpose To Test RAID

IrY100Fan · April 23, 2024, 5:15pm

Hi Everyone,

I am putting together my next iteration of storage/media server for my home. I just watched Wendall’s “Hardware Raid is Dead and is a Bad Idea in 2022” video. I would like to try and test my Areca 1226 RAID controller and see how it handles errors but I’m not sure how to go about it. How do I “inject” and error into a drive? I cannot think of any way of testing without the drive or array things my “injection” is not legitimate data being written to the drive/array. Anyone have any thoughts?

Thanks.
-Brian

twin_savage · April 23, 2024, 7:03pm

I wouldn’t test drives by injecting an error in the manor from the video; I think the case that it reflects is a little contrived, but it did illustrate the point being described.

With AF format drives, it is very very rare for the drive to ever give wrong/bad data to the controller, the drive will simply refuse to read a given block if it comes down to it. It is far more likely that EMI or bad connector contact are the source of error (and software itself is an even more likely cause).
The way I’d test for this case is get a near field probe and drive it with an amplified function generator ← before applying this to the SAS/SATA cable I’d test the setup out on an o’scope first to make sure we’re not producing too large of a voltage though. I’d probably pick a dithering waveform with a 150MHz center frequency, and only use short chirps.
you could also try jiggling the SAS/SATA connector around while running to see if you can induce a bad contact while writing.

Another common failure mode for arrays is caused by power loss or sudden panic, but this can be hard to test for since software can be the cause of corruption in this case instead of hardware.

diizzy · April 23, 2024, 7:07pm

Uhm… that seems backwards?
Just connect one of the drives to a standard AHCI controller and write random bytes in a few sectors

IrY100Fan · April 23, 2024, 9:33pm

What I am trying to test is how does the Areca ARC-1226 card react to errors. I have been in contact with Areca but have not gotten any definitive answers how the card handles different scenarios. So I am trying to figure things out for myself.

What concerns me with Wendell’s video is if you purposely write random data to a drive, the drive will think it is legitimate data and should not report back any errors. I understand it is not correct data from a filesystem point of view, but it is from a hardware point of view. I can’t think of any way of writing bad data to the drive sector without the drive automatically updating the checksum.

@twin_savage
I don’t know that I have any test equipment that can generate enough signal to alter the SAS signal via radiation. (SAS uses twisted pair plus the protocol uses CRC and retransmission to combat transmission errors.)

The idea of wigging the connector is valid but I feel it yields unrepeatable results. But it is still an interesting idea. I will try it just to see what happens.

@diizzy
I tried writing some random data to the middle of the drive using a separate HBA, but when I put the drive back in the RAID array, the controller marked it as failed instantly. I find it hard to believe that within one second it could have scanned it’s way to the middle of the drive to determine something was altered. I must have unintentionally made other alterations to the drive.

Thank you for the ideas.
-Brian

wendell · April 23, 2024, 9:52pm

This is true,100%. However, for that exact areca controller, I’ve found cases where a single 4kb sector write was apparently missed and the drive consistency check later made it worse. This could have been connection issue or cable issue, or my hunch is, firmware bug.

Based on areca testing it does not seem to do read-verify but a later patrol read may catch the error. You’d need raid6 to recover from the situation, but it is unclear to me if the driver is written in such a way as to do that. The controller seems to handle the case well that a drive reports an error, or a drive times out, or a drive is missing entirely. The edge cases I’ve dealt with… you have to be really unlucky when the “wrong” 4k sector corruption leads to a cascade of failures.

Just the other day infact,

status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 132K in 05:52:47 with 0 errors on Sat Apr 20 20:01:45 2024
config:

        NAME                                                      STATE     READ WRITE CKSUM
        ssdpool                                                   ONLINE       0     0     0
          raidz1-0                                                ONLINE       0     0     0
            nvme-SAMSUNG_MZQLB7T6HMLA-1-part3  ONLINE       2     0     0
            nvme-SAMSUNG_MZQLB7T6HMLA-2-part3  ONLINE       0     0     1
            nvme-SAMSUNG_MZQLB7T6HMLA-3-part3  ONLINE       0     0     0

errors: No known data errors

This is 100% caused by the nvme backplane and/or cables being the slightest bit sketchy, which has slowly been increasing the last few months. We have already rotated drives and picked up a hot spare. This type of “well, it’s probably the backplane” is rarely, if ever, well-tested by hardware raid controllers. Patrol read is supposed to catch this type of errors the way that zfs does HOWEVER the reality is that it doesn’t really do much to correct the error… it corrects the inconsistency. perhaps clobbering the data in the process. LinuxMD assumes that the data is correct but not the parity, too, fwiw. Whereas ZFS knows exactly which thing is incorrect, and fixes it.

If you happen to be using 520 byte sectors instead of 512 byte… your raid card also knows where the inconsistency is, very similar to zfs, and can fix it that way. But in normal 4k or 512 byte? you rely on the drive to know 100% of the time.

and if you hit a software bug or inconsistency or, as in the above, just bad cables? sorry about your data

Testing

I would suggest zeroing the whole drive/array and then creating a 1 megabyte file filled with a test pattern and then scanning each drive stand-alone to see how the controller broke the test pattern across the drive. You could corrupt some part of it – no more than 4kb. Even just zeroing ONE 4kb block on ONE drive. And then see what happens when reading back that 1 megabyte file.

IrY100Fan · April 24, 2024, 7:50pm

Well things didn’t go as I hoped.

For my first test, I setup a standard RAID6 with a small partition and tossed a ZIP file in it that filled the entire partition. I then took one drive out and hooked it up to another system via a simple HBA controller and overwrote one byte in the file. I then put the “modified” drive back in the RAID. At boot-up the RAID didn’t detect any issues, which is what I would expect. When I copied the entire file off of the RAID, no issues detected. (Again, this is expected behavior because I didn’t think the RAID controller was checking reads against the parity info.) Extracting the ZIP failed because it had been corrupted. (Expected behavior. One of the reasons I am using a ZIP file.) I then ran a scrub of the array, which detected the error and corrected it. I could now unzip the ZIP file completely.

Next test…
I reformatted the drives to 4104-byte sectors (4K native drive) and created a new RAID. The drive status now indicates “T-10 Protection Information: Enabled (Type 1/ChkMask 7)”. Did the same setup as before. Created a small partition and copied a ZIP file that filled the entire partition. I then took one drive out and modified it on a HBA like before. (At this point, I am assuming that the extra T-10 info will not be updated as isn’t that the job of the RAID controller.) Overwrote one byte in the ZIP file, placed the drive back into the RAID array. Copied the file off the array and got no errors. (This isn’t what I expected.) Again the ZIP file would not extract as it was corrupt. I then scrubbed the array and it corrected everything and the ZIP file work again after the scrub.

Now I feel dejected. I was hoping that by enabling T-10 PI, the RAID controller would at least notify me that something was wrong, but alas that is not reality. I guess the T-10 PI is just another method of verification during scrubbing and is not used in real-time.

Maybe I screwed up my testing methodology. If someone can point out where, I would be grateful.

I guess I have to think things over again. Everyday you learn something new.

-Brian

twin_savage · April 24, 2024, 9:21pm

Both of these cases are expected behavior.
The reason the second case is expected is that the HBA connected drive that is getting written to outside of the array is having the correct T10 PI/DIF data written to it during the “injection”, so there are no problems as far as it’s concerned. The T-10 DIF info only helps verify LBAs aren’t corrupted after they have been written; and since they were correctly written to via hba, the raid is not sensing a problem during initial read (a scrub would be required).
I think you were expecting LBAs to be consistent with each other to form the filesystem but the DIF info only helps ensure the LBA is consistent with itself.

If you wanted to make extra effort to check that files are consistent upon every read you’d want a file system like ReFS/ZFS (I’m not going to recommend Butter-FS or even Better-FS for parity raids) or maybe even dm-integrity.
These more advanced/complicated filesystems aren’t all pros though, they have cons to them as well; and depending on the situation could be a bad fit.

tangent medium-warm take: T10 DIF fell out of favor when advanced format drives came out. DIF became redundant and an inefficient use of space because of it’s weaker error correcting capabilities compared to the ECC baked into AF drives.

EDIT:
An interesting experiment you could do to prove PI/DIF is working in real time is setting your HBA to a different PI/DIF Type than your raid card is set to before doing the injection write.
← wait, maybe the dif type is encoded in the drive format itself, I can’t recall.

IrY100Fan · April 25, 2024, 1:58am

I guess I misunderstood how T-10 Protection really worked. I assumed the RAID card wrote an extra checksum after the data in the DIF. Then when that sector was accessed, the RAID card would verify the extra checksum and report if things did not agree. I also assumed that writing to the drive outside of the RAID controller would update the data but cause a checksum mismatch for that sector since the HBA didn’t support T-10 DIF. (Or so I thought.)

I have played around with ZFS but have been unable to get any kind of “reasonable” performance from it. That is one of the reasons I was looking into going back to hardware RAID.

Brian

IrY100Fan · April 26, 2024, 10:27pm

As an experiment, I thought I check out how my method of corrupting data would be handled by ZFS. I replaced my RAID controller with an HBA, swapped boot drives and installed TrueNAS SCALE.

I did the same procedure as before. Created a ZFS RAID-Z2, put my test ZIP file on it, removed one drive from the array, corrupted the ZIP and then placed it back into the array.

When I booted up the array, it didn’t know anything was wrong. (As expected.) Copied the ZIP file and the array still said everything was okay. I extracted the ZIP file and that completed successfully. (I’m assuming that ZFS fixed the data on the fly.) However, ZFS still did not say that were any problems. I then did a scrub of the pool, ZFS now says the “Pool Is Not Heathy”. The only way to clear that error was to offline the disk I modified, replace it and re-silver the pool. (Is this expected behavior?)

-Brian

twin_savage · April 28, 2024, 9:32pm

What would a zpool status show if you ran it right before the scrub?
It almost sounds like the pool was running in a degraded state and the “injected” drive was already at a bad state before the test.

alkafrazin · April 29, 2024, 12:38am

I don’t think it’s the raid that makes ZFS slow, so much as the advanced COW filesystem and featureset. ZFS is slow on a single drive, too, as is any comparably feature rich filesystem, like BTRFS or BCacheFS.

IrY100Fan · May 1, 2024, 3:21am

The status of the pool was determined through the TrueNAS GUI. I tried issuing the “zpool” command from the shell but I just get a command not found error.

Like I said, the indicated pool status was normal before and after me modifying the data on the drive. Only after doing a scrub was the pool status changed to not healthy.

I decided to erased everything (including TrueNAS), installed a different set of drives and then reinstalled everything again. (Fresh start.) Got the exact same results.

IrY100Fan · May 1, 2024, 3:28am

Not sure why ZFS is so slow, I just know it is. Using my Areca ARC-1226 with eight Seagate Exos X20 20TB SAS drives allows me to easily saturate my 10G network for both reads and writes. (1.1GB/s)

I originally tried ZFS using TreNAS SCALE. Couldn’t get read speeds over 450MB/s and write speeds were below 200MB/s. (NFS, async) I even tried Ubuntu Server and setup ZFS myself but got the exact same results.

For fun I built a pool of eight Optane P1600X drives as 4x mirrors. Even that could not get over 290MB/s writes.

Username2 · May 1, 2024, 8:16pm

Can you describe how you tested the throughput?

My first thoughts on this:

Software “RAID” has to use the CPU and Memory on the system. Whereas RAID cards usually have their own processor. Theoretically, on a loaded system or a slower CPU you can see a difference. In practice, I can’t say I’ve seen this as an issue in a very long time as compute power, bus speeds, etc. has grown significantly.
Did you monitor CPU/Memory when running the test?
Did you run the test using sequential or random I/O? Did you run it long enough to saturate the cache? Are there some optimizations on the card that do prefetching?
Did you run the test locally or remotely? I’m a big fan of running the tests locally.
What did you use to generate the I/O?