#Video Idea: The Great Raid System Shootout

So I see a lot of misinformation and misunderstanding around RAID, “softraid” filesystems like zfs and btrfs. I am no great authority figure – my largest arrays are probably only about 30 spindles – but we ourselves have been accused of not understanding either, but I don’t believe this to really be the case. So I have been experimenting in the background the last few months.
It has been tedious and laborious.
This type of basic test is what I’ve come up with, and the results are getting interesting. I could probably turn it into a video, but these types of things are unlikely to get any traffic. But it’s kind of interesting for my own learning.
Scientific method requires documenting one’s steps, and a reproducible result. So I would begin again, going even slower, to document the experiments.
So I thought I’d outline what I plan to do, in order to see if anyone spots any trouble with my experiments. I want the simplest experiments that provide the most concrete direct results.

What’s being tested?

For me, the priorities of a RAID system are:
1. Don’t allow the data to experience bitrot and become corrupt. The data you put, should be the data you get, even if years later, even after hardware failure.
2. Replacing a failed drive should work, and not corrupt data.
3. Corrupt data, ahead of outright failure, should be detected, reported, and corrected.
4. Perform reasonably for the # spindles involved.

Hardware Permutations:

For each setup, we will use 5 750gb Seagate Drives. Why this? It’s what I have on hand. In the following configurations:
- Raidz1 on FreeNAS
- LSI hardware raid controller w/BBU (raid 5)
- BTRFS (raid 5 metadata and data)
- md raid5 on Linux w/ext4 filesystem.

The Tests:

  1. Pull the plug – this test, a copy to the drive is in progress, and power is pulled. What happens?
  2. Pull the ram – a DIMM is removed while the system is on. (? May not do this one) What happens?
  3. Pull a drive – a drive is removed while the system is on. What happens?
  4. A Drive has gone “evil” – the systems are shut down, and one disk in the array has random “bad” data sprinkled throughout the drive. Approximately 1 megabyte of 4 kilobyte blocks of zeros is written randomly throughout the evil drive. This hopefully simulates a particular failure mode of a drive where it just can’t read, or is returning corrupt information, but otherwise works.

Aftermath of each test:

A large collection of files is copied to each array. The count of files, filenames, and directory layout on the filesystem is documented off of the array. A database of md5 hashes of all of the files on the array is stored on another system. After each event, an automated script scans the filesystem using ordinary filesystem utilities and compares the md5 sum of the files to the md5 sum stored in the database and flags any mismatch. File not found is also flagged.
After the script is complete, the system logs are scanned manually by an operator for any indication of errors, and noted for our review video.

Some notes so far:

  • btrfs does not seem to have a ‘scrub’ command yet like zfs does. Though it “seems like” simply scanning all the files on the file system will cause the filesystem to repair corrupt info.
  • The LSI hardware controller and md on linux are hilariously bad at test #4 (for data integrity). I am surprised by this for the hardware controller.
  • It doesn’t seem fair to test btrfs raid 5. Btrfs raid 5 barely works, and seems to be a thorny quagmire of not-ready-yet code when you actually experience some sort of failure. Please, someone tell me, that they have experienced a drive failure on btrfs and recovered easily?
  • zfs machine sometimes hangs when pulling a live disk.

If there is any commentary, or ideas for adjusting the tests (or doing new tests) without creating ludicrous amounts of work for us, let us know.

5 Likes

Sounds like a great plan for a video but I have an issue with #2, pulling the in op ram.
Wouldn't pulling ram while it is in operation destroy the cpu, mobo and the ram that was in the memory channel that the pulled stick was a part of?

The media library in my office is on a maximum 64 disk array on currently four 71kQ series adapters, hooked up to standard SAS backplanes, so a little older already, I think I bought them in 2012. Adaptech adapters have a really long lifespan in my experience, because the Adaptec adapters I had before that, were SCSI-III ones on Dell backplanes that I bought back in 2002 or something, so that's a decade of service, and I never had any problem with them. I would recommend such an arrangement to enterprise users. The time saved is enormous. It literally takes ten minutes for a non-specialized staff member to configure a RAID with up to 16 SAS devices, and that includes autonomous caching arrangements, that can even easily and without extra time investment be reinforced with SSD's. I wouldn't even consider a softraid for the main media storage server, because there is just no way to run a softraid in such an application with the same degree of economy and safety. The performance is crazy good, and gets even better when integrating SSD's into the already present caching. An Adapter like that costs less than 1000 EUR, a staff member that is specialized in software RAID configs costs about 150 EUR per hour all in. That means that the staff member should not spend more than 40 minutes per year on softraid maintenance and setup... impossible.

As to large storage file systems, we're using ZFS, XFS and brtfs in different arrangements on different machines. My personal photo library backup at home is kept on a 12 disk array that runs on btrfs since about a year, and it that time, two disks have given warnings of imminent failure (another big advantage of linux I guess is the countdown clock that shows you how much time you have left on a disk before it fails catastrophically, and the fact that btrfs automatically compensates for any data corruptions until catastrophic failure), and they were just hotswapped for new disks, and it striped perfectly. RAID 5 is not recommended in btrfs, even though in my experience, it kinda works, but there is no real benefit in it, and I would not advise anyone to go against the clear recommendations of the btrfs devs on this. I almost exclusively use btrfs on my workstations, standalones and portables, since Fc20 I think, now all of them on OpenSuSE Factory, and I've never had a problem with it, and all the advertised features just seem to work. The performance has exponentially increased since the early btrfs versions though, in the beginning, the performance wasn't that good. Now I wouldn't even think of using anything else on anything that doesn't have a large hardware controlled array. I find the performance since OpenSuSE 13.2 very good, and even better on Factory. I find the filesystem performance in general on Fedora usually to be even a tad better, but not much. I find the implementation of btrfs and ZFS on OpenSuSE to be the most finished and well-implemented and integrated since 13.2. On 13.1, the integration with many system packages was still pretty abysmal, but since 13.2, they've done a great job.

I've never had ZFS crash when a disk was hotswapped, but I've only tried it once on a machine with software RAID arrangement, and that was by mistake basically on a freshly installed system, but nothing bad happened at all.

The "what raid" debate is an enterprise vs soho thingy in my opinion. I would go for a btrfs arrangement in a soho environment because it's basically free and chances are that the machines will be multi-functional anyway, and btrfs works really well in such an application. With the latest OpenSuSE and Fedora, btrfs also offers really good performance and stability, and all the tools just work as they should. Especially in conjunction with the new snapper on OpenSuSE, a lot of peace of mind and efficiency can be had for no money at all. On an enterprise level however, things are completely different, and machines are dedicated. There is no doubt in my mind that a good hardware controller based RAID solution with native autonomous caching is the way to go for enterprises, because the economic and efficiency benefits can't be beat with any software solution, you can't even come close.

I would be interested in this for sure...

This is very good info.

Is the btrfs that you are running on 12 disks on top of an md array of those disks, or is btrfs managing the disks directly? I agree that the devs sort of poopoo raid 5/6 -- does that mean that you are running raid10 if it's native access to the disks/no md managing that?

I kind of want to try scenario #4 with an 'expensive' setup. I have some old EMC Windows Storage Server 2008 stuff I could test with, with some smallish SAS drives. I've seen (relatively minor) bitrot sneak into these EMC platforms though. Not to get too far off topic but I had this happen recently on an EMC SAN: HyperV VM was running from an NTFS partition, and merging the snapshots back to one vmdk would fail with "invalid operation" , MS Pro support said this was due to file corruption, and no problem with the underlying disk. Eventually, I used Linux to copy the file, where Linux reported a hardware problem. Strangely nothing in the event viewer on windows, nor anything logged on the SAN. After I copied the file with linux, it was able to merge the snapshots no problem. MS Pro support had no idea what was going on, and the dude was incredulous Linux fixed the problem. hahahahaha. BUT the underlying OS on that EMC san was not windows storage server. So I don't know what to make of that.

That's a really round about way of saying I don't think even some of these enterprise things really know when a drive is lying. If the drive self-reports a problem, it works fine. And it's true this is a big difference between the consumer and enterprise drives. And that's also why the "raid" edition of consumer hard drives has firmware tweaks to "give up" trying to read a lot sooner I think.

btrfs on a stand alone drive (or volume) does seem to work nicely.

XFS is still my favorite for web servers or any servers that have loads of teeny tiny files. I need to check this but I think file systems with built-in compression will also be stupid fast for web servers, but I haven't tested that extensively. Informal testing of ZFS with with SSD l2arc/ZIL and compression "feels" as stupid fast as XFS though I haven't formally benchmarked it. I haven't been brave enough to test this workload with btrfs given problems with anything more complex than raid1 with btrfs on raw devices.

I'll also switch to openSUSE to re-test some things in btrfs. It's probably some braindead thing in debian-testing

The "Pull the ram" test reminded me of a story a coworker told me.
Apparently a couple of teenagers that worked in his department pulling cable and the like pulled a bunch of ram sticks out of some running servers. They destroyed some data, and everyone's week. They were very fired.