RAID: Tech in Transition | Level1techs


So we see a lot of misinformation and misunderstanding around RAID, “softraid” and filesystems that are not exactly raid, and do something a little different from RAID (like zfs and btrfs).


This video kicks off the introduction to the experiments we've been doing in the background the last few months with linux MD, hardware RAID, btrfs and zfs.


It has been tedious and laborious.


The results are getting interesting.


Scientific method requires documenting one’s steps, and a reproducible result.


So I thought I’d outline what's coming in this video series. I plan to do, in order to see if anyone spots any trouble with my experiments. I want the simplest experiments that provide the most concrete direct results.


What’s being tested?


For me, the priorities of a RAID system are:


1. Don’t allow the data to experience bitrot and become corrupt. The data you put, should be the data you get, even if years later, even after hardware failure. 2. Replacing a failed drive should work, and not corrupt data. 3. Corrupt data, ahead of outright failure, should be detected, reported, and corrected. 4. Perform reasonably for the # spindles involved.


Hardware Permutations:


For each setup, we will use 3 750gb Drives. Why this? It’s what I have on hand. In the following configurations: - Raidz1 on FreeNAS - LSI hardware raid controller w/BBU (raid 5) - BTRFS (raid 5 metadata and data)* - md raid5 on Linux w/ext4 filesystem.


The Tests:



  1. Pull the plug – this test, a copy to the drive is in progress, and power is pulled. What happens?

  2. Pull a drive – a drive is removed while the system is on. What happens?

  3. A Drive has gone “evil” – the systems are shut down, and one disk in the array has random “bad” data sprinkled throughout the drive. Approximately 1000 4 kilobyte blocks of ones are written randomly throughout the evil drive. This hopefully simulates a particular failure mode of a drive where it just can’t read, or is returning corrupt information, but otherwise works.


Aftermath of each test:


A large collection of files is copied to each array. The count of files, filenames, and directory layout on the filesystem is documented off of the array. A database of md5 hashes of all of the files on the array is stored on another system. After each event, an automated script scans the filesystem using ordinary filesystem utilities and compares the md5 sum of the files to the md5 sum stored in the database and flags any mismatch. File not found is also flagged.


After the script is complete, the system logs are scanned manually by an operator for any indication of errors, and noted for our review video.




This is a companion discussion topic for the original entry at https://Level1techs.com/videos/raid-tech-transition
13 Likes

Yay for scientific method, rigorous notes, and the use of the simplicity of astrophysicsy type naming (aka evil drive :P )

I take it md5 are created prior to any initial tests? Also was this by any chance what you had running in the background of the inbox video...
(waits for video to process...)

Thanks for the video Wendell. I was gitty about it since it was announced and I wasn't disappointed.

1 Like

@wendell
I have experienced corrupt RAID 5 array at one of the clients for my work.

Basically, there was a bad drive no one knew about that failed a long time ago (lights that show error code were facing the wall and not in plain view. Facepalm so hard go through face to back of head )

Anyway, turns out the cache battery for the RAID array was also dead. Power had a brown out, crap data wrote to the array and BAM it was completely gone and unrecoverable due to the dead battery and a bad drive on top of that. Yes they had a UPS but it was also gone. And to think this shop was proud of them spending so little on IT over the years...

Thankfully we were able to get most of the data back via tape backups they had but WOW that was a few long nights :(
AD had to be rebuilt.

I or my company weren't at fault here for this neglect, we kind of inherited it for we were brought in to replace the old server. The old server decided to die 1 week before it was scheduled for us to replace... all it had to do was last one more week!

Anyway love the video, love your approach to this stuff. I plan on showing this video to spark discussion at my work as well :)

Great Video cant wait to see the next ones, I have been using zfs for years and still learnt a thing or two already just from your overview keep it going

That video about recovering by using DD and repairing the partition tables and start of the partition? That's like.. the eleventh time I've had to do something like that for folks that were total SOL. Probably could have done "micro surgery" on your array to get stuff back lol. Another time it was a similar scenario: Raid 5, one drive had died but was "working" and I could hear it being weird but the controller (Dell PERC5 w/BBU) was reporting everything was just peachy. I shut down, ran drive diagnostics on each drive out of the array one at a time (well, okay, like each drive in its own computer because ain't nobody got time for that) and identifed the one I thought was most likely to have lost its marbles, booted up without that drive and everything was 100%.
This was SAS, not SATA. Turns out the issue was something with the PCB. Happened to have another identical drive, swapped PCBs and issues with that drive went away. Hence my "sooner or later, you'll have a drive go evil" and that is the stuff of nightmares. Five nights at freddys? More like five nights fighting raid five.

5 Likes

Thanks for replying! I plan on doing some testing and learning more with dd.
You've definetly inspired me to dig deeper into this stuff, considering this type of knowledge and skill in the industry is pretty valuable. Plus the statisfaction factor of doing it successfully is definitely a plus!

In my case though the decision was out of my hands... I'm just the junior resource. Instead of trying to rebuild the array it was decided we recovered what we could via the known good backups and rebuild from the ground up on new hardware.

I intend to upgrade (more like rebuild) my freenas server with 3 more drives and a several more GB of ECC RAM over the summer. I need to "upgrade" it since when I built it last summer it was mostly just an experiment to see if I can keep it running and to check it's temporary usefulness instead of ensuring complete data integrity and insane amounts of storage for the next century. Therefore I only have 2 drives using RAID 1 in it at the moment which I'm guessing doesn't have the same protection against data corruption as RAID 6 has.

Anyways just release that freenas 2.0 video, you guys teased in one of The Tek (or Inbox.exe) episodes, before June/July starts and rebuilding my server will be a lot quicker and easier. And of course I'm looking forward to this series since although I have yet to lose data from a drive failure (Mainly since I hadn't had a drive fail on me yet) I still believe that data integrity is rather important if I want to continue being data loss free.

Wendell and I must be on the same wavelength. Most of the first part of that video was what I have said in this thread over the last few days. https://forum.teksyndicate.com/t/nasty-ass-nas-freenas-help/75043/18

@ElfFriend Raid 5 and 6 or z and z2 are mostly solutions that deal with needing capacity and redundancy. If you can have the required capacity with a mirror then definitely do that. A mirror on ZFS also gives you a handy dandy easy backup solution in the form of ZFS SPLIT. ZFS split can split a mirror so you can take that disk to another system as a backup. Put a new disk in and attach it to your original array, let it resilver. For real data retention purposes you would want your primary array to be a triple mirror though so even after the split you have a two disk mirror.

@wendell Perhaps some zfs tests with a seperate zil device and sync off might be useful.

1 Like

Hey Wendell thanks for this very educational. I've been working in the enterprise for quite some time now but I'm on the devops side of the house. I've learned a lot from this video. Looking forward to the next one.

Excellent post Wendell! I work in an enterprise environment and i would like to say I did indeed pick up a couple pointers. Once again, great video. I really do hope to see more!


I made these in paint. These were a lesson I had to do for Digital/Computer Forensics Class. It was in class.

2 Likes

Good stuff @wendell. Most of it was beyond my current level of knowledge, but it's nice to learn new things. Looking forward to more videos on this and the Linux channel.

I like the facial hair. It makes you look a lot less vulnerable. (You had a case of "fat face" in your office tour video.) No offense.

Loved that new video on Tek Enterprise! Well-made, informative, and down to earth. I love your videos Wendell.

Wendell, have you used HammerFS before?
If so, how would you say it compares to ZFS & BTRFS?

I've only moved HDDs in software RAID between machines once and it was relatively painless. How does transferring HDDs with BTRFS RAID between different machines compare with other Linux software RAID types?

Thanks for the video. I'm not in the enterprise server field, but every so often I get the bug to do a little research in the area. I learned a lot. The video was well planned and very informative. I'll be waiting for the next video, Professor Wendell.

I wonder what the chances are of a BTRFS or ZFS install running into bitrot trouble over the specific block that handles the filesystem sanity checks?

@wendell I had just setup my 2 disk 3TB raid 1 array using MD on my linux server when this video was announced, I was very concerned as to what you were going to say, having watched it my fears were well founded, The conclusions I have drawn from this video are
A: Get a UPS
B: Setup some form of Drive scrubbing, be it at the least a cron job or preferably a program that does it whilst the array is not in use.

Whilst I have come to these conclusions I have no idea how to go about completing B, also I'm well aware that this still wouldn't be optimal, but an optimal set-up is also going to cost much more money, which I don't have.