Question about RAID5/6 and/or ZFS raidz1/2 theory

oO.o · January 27, 2020, 8:25pm

Ah ok. That’s neat. I’m still unsure how you want to implement this on the tapes. Are you trying to protect against losing an entire tape or just protect against partial file corruption? Without splitting the data across multiple tapes, the parity won’t do you any good in case of a complete tape failure.

That’s correct. ZFS uses checksum to detect corrupted data and to verify either data or parity in case of a mismatch. It is not used to repair data. In the case of par2, it appears to inherently trust the parity. If the parity became corrupted, you’d need to regenerate it from the data. You could add a checksum if you wanted to account for a situation where parity and data don’t match and you aren’t sure which to trust.

I imagine that the size of the parity becomes prohibitive at some point. If you could use parity to reconstitute a file from 1% of the original, while taking up less space than the original file, you’d essentially have a form of compression.

oO.o · January 27, 2020, 8:37pm

What confuses me is that you usually use tapes specifically because they are resistant to bit rot. It is more likely to lose a tape, accidentally overwrite it, expose it to emf or similar in which case, you’d lose all of it, or likely much more than 10%.

The 2nd level of parity is especially overkill. I think you’d be better off storing checksums of the data so you can detect data/parity mismatches and know which is corrupted.

I can’t personally think of a use for it for me, but it is neat. I’m glad it exists.

You can do this with manual changes to a plain text file.

Well, sort of. You can restore a partially corrupted file. If your drive dies or your filesystem becomes corrupt, it’s not going to help you. RAID with snapshots and a backup offers a lot more protection.

I think to achieve what you’re describing here, you’ll want to calculate parity on a file with 50% redundancy, cut the file in half somehow and store the two halves on separate tapes, both with the parity. That would effectively be like RAID5 if you had the parity mirrored on the 2 data drives instead of on a dedicated drive.

To replicate ZFS checksum behavior, you should also store checksums of data (and parity I guess) on each tape.

Depending on the size of the parity files at 50% redundancy, this may or may not be superior to simply maintaining 2 copies.

You could also scale it up. For instance, you could calculate parity at 25% redundancy, split the file into 4 equal parts and then distribute it across 4 tapes and/or store parity on a separate tape instead of mirroring it on each tape. That is essentially how RAID works. In each case, you would be able to lose one tape and still rebuild the data (hypothetically).

In that case, you’ll need to do what I first described with the 50% redundancy and mirrored parity. I suspect that this is no more space-efficient than simply mirroring the data though, as you are essentially trying to use parity as a form of compression.

alpha754293 · January 28, 2020, 1:05am

But that’s the thing – if you expose it to EMF, that depends on whether it’s local or global to the tape.

In regards to accidentally overwriting it, that also depends on how you mount ltfs.

In regards to the checksums, I think that it was starting from LTO-5, it has checksum built into the tape already and there is supposed to be some kind of data validation/recovery mechanism.

However, I don’t really always particularly trust technology because I’ve been burned in the past via over-reliance on advertising/marketing claims of a given technology, from the developers/manufacturer.

With par2, this gives me a much more overtly explicit way of recovering from any potential data failures that may arise and/or be rectified/corrected using parity. And the principles behind the second level of parity is somewhat to mirror what happens with more “conventional” RAID6 implementations, but again, applied to tape (mostly because I don’t want to drop another two grand on tapes so that I’ll have triplicates, in which case, the capacity efficiency would be even worse than RAID6 (or it would be double mirroring, when data is stored in triplicate form).

Yeah…I mean…it depends on how big of a file you want to play with.

Like I said, one of the files that I backed up onto tape is 4.4 TB, so the benefits of par2 is probably easier to be seen/realized with a significantly larger file than a smaller file.

Yes, that would be the inherent assumption behind it as it doesn’t offer full fault tolerance like a true RAID system does because it won’t be able to handle a full drive failure like said true RAID system. That’s a true statement, and a good point, but then again, I also know of nobody who boots off of a RAID5 array either. (Even for server applications where they have a boot/OS disk module, usually the highest level of RAID they offer for that is just mirroring/RAID1).

See my previous comments about checksums being already an integral part of the LTO standard, I think starting from as far back as LTO-5.

Well, the idea with 50% redudancy is that your parity data will be 50% of the size of the underlying data for which you are generating/calculating the parity for.

Therefore; in my case where I have that single 4.4 TB file, if I set the level of redundancy to be 50%, then the total amount of parity data should be equal to 2.2 TB, which means that a total of 6.6 TB will be written to tape, overall.

The problem with splitting the data and the parity acrossed multiple tapes is I don’t have a tape autoloader, nor do I have multiple tape drives, so to try and execute a recovery, it would be incredibly slow if I write the parity data across multiple tapes. The idea with using the LTO-8 is that I would be able to shrink that back down onto a single tape, and then I usually will write as least a second copy so that the data is duplicated (at minimum).

That way, the data and the parity is at least “mirrored”.

That’s not quite true.

The parity isn’t for compression reasons.

Compression is performed using 7z, to the point that the LTO SHOULD be able to compress the data, but I don’t even bother with that because it’s far more computationally efficient for me to pre-compress the data using 7z before computing the parity, and then the parity-on-parity, and then writing the data, the parity, and the parity-on-parity, all to tape.

Again, other than having a triplicate, this is for the purpose of protecting against data corruption.

But I get the sense that it would appear that no one really knows nor understands how traditional RAID systems and/or ZFS really handles or how it determines how much can something fail before the RAID system and/or ZFS can no longer correct it.

oO.o · January 28, 2020, 4:28am

What piece of information do you think you’re missing?

It sounds like you have a system with par2 that you’re happy with. You had initially stated that you want to replicate raid behavior with par2. This is how you would replicate a 5-disk raid5 with par2:

This would give you 4 data and 1 parity. In RAID5, these would be equal in size distributed across 5 drives so that you would need any 4 of them to successfully read/write data. There are nuances to how different raid systems do this and enhance it, but that is fundamentally how it works.

If you’re more interested about the actual math of raid parity, here you go:

https://igoro.com/archive/how-raid-6-dual-parity-calculation-works/

From what I understand, in your case, you are keeping all the data and parity components on a single tape. This means you just need to come up with whatever arbitrary amount of redundancy you want to maintain, which you already have. In a RAID, the amount of redundancy is dictated by the quantity and capacity of your drives. In your case, there is no restriction since it’s all stored on a continuous filesystem.

alpha754293 · January 28, 2020, 6:32pm

The idea was to do it LIKE a RAID5 system, but ON a single device.

Again, as I have stated, this is to replicate the parity mechanism found in RAID5/6 (hence the title of Plank’s paper) without it being actually “RAID” because you can’t “email” “RAID” (for example).

(Say that you have a photo that’s stored on your RAID5/6 array. On your array system, it has parity to protect it against data corruption/failures. If you were to then email that same photo from your RAID array, that file, by itself, is send WITHOUT any of the parity data because a “normal” RAID5/6 implementation doesn’t expose the parity data to you that you can send along with your photo in your email in order to ensure that the receive gets the photo without any corruption or failures.

This is where par2/QuickPar can come into play. The process of emailing a photo has no guarantees nor protections against data corruption while it is being sent. Therefore; if you want to ensure that the person receiving your photo won’t get a corrupted copy, you can compute the parity data of the photo, and email the parity data along with the photo itself so that if the photo does become corrupted, they can use the parity data to recover/repair the photo file which was sent, which is now corrupted.)

The replication of the behaviour isn’t the device or number of devices.

It’s the replication of the parity calculation and the storage of the parity data and how the parity data is then can be used to recover/repair corrupted data, beyond just the detection of corrupted data via its checksum.

The IDEA behind the RAID5 calculation actually should not stipulate that it actually NEEDS 3 devices minimum. If you take a disk drive, partition into three paritions, put the three partitions into a RAID5 array and there you might be able to have (in theory) a RAID5 array on a single disk/device.

If you then understand the computational or computer science theory behind how RAID5 parity is actually calculated, then you can use that at any level, be it disk/device, block, file, or object.

This is the heart of the question – what is the computational/comp sci theory?

Via your link, where they talk about single parity using the logical operator exclusive OR (XOR), again, you can apply that at any level, be it disk/device, block, file, or object.

Par2 uses the Reed Solomon matrix and applies it to the file level. My question was what whould my level of redudancy be set to and how many blocks should I be using?

Igor Ostrovsky’s blog doesn’t cover the other two parameters in the explanation, which can be input arguments to the par2 command line.

But again, that is the question – what should be or what is the appropriate level of redundancy that I should be using? Should the size of the blocks be fixed or variable? Should the number of blocks be fixed or variable?

These are the RAID theory questions that I was hoping to obtain answers/responses to.

(Again, those can be input arguments to the par2 command line, which is why I am asking. Because if I can get answers to that, based on how “actual” RAID5/6 systems set up those parameters, then I would be able to mimick it, apply it to the file level (via the par2 tool/application/command) and then write it out linearly and directly to the LTO tapes.

I hope this makes sense.

The checksum on the tape can help detect errors, but I am not sure if the tape can use that to correct the errors.

With par2, I am trying to expose the correction mechanism in a vastly more explicit manner.

oO.o · January 28, 2020, 6:49pm

Using par2 at all is like a raid5 system on a single disk (technically more like a raid4 system, but splitting hairs at that point).

I don’t think the way you are paritying your parity resembles a RAID6. RAID6 is more like another layer of parity across data and parity. In your case, you can recover some of your first parity, but it doesn’t add any redundancy to the data. I would use that space to add additional redundancy to your existing parity. IIRC, it does say in the readme that par2 will detect corrupted parity as well as corrupted data.

Yes, you can do that.

In RAID, you calculate risk based on drive mtbf and weigh that against budget/space constraints. In your case, you’d need to know the chances of a tape being partially corrupted and to what degree it will become corrupted. I have no idea where you would find that information or if anyone has ever done the extensive, expensive testing to figure it out. In most cases with tape, you just keep additional copies because the media is relatively inexpensive.

freqlabs · January 29, 2020, 1:32am

To answer the question, RAID/RAIDZ does not corresponded to a percent redundancy. How much redundancy is provided by Z1/Z2 etc depends on how many disks you are using. With 3 disks in RAIDZ1 you have 33% redundancy. With 4 disks in RAIDZ2 you have 50% redundancy. With 4 disks in RAIDZ1 you have 25% redundancy.

is to tolerate drive failures. N drives in RAID5/RAIDZ1 needs to tolerate 1 drive failure, so you have an overhead of 1/N to store 1 drive worth of parity spread across N drives. Likewise RAID6/RAIDZ2 you must tolerate 2 drives failing, so the overhead is 2/N to store 2 drives of parity across N drives.

alpha754293 · February 4, 2020, 10:40pm

So, to mimick RAID6, I would have to run par2 again, but this time using both the original archive file, plus the first “level/layer” of .par2 files as the input.

I can do that.

It’ll create more parity data than the original (since it’s 10% of the total volume of data, if I specify it with the flag ‘-r10’, but that’s certainly something that I can do.

So…that USED to be true.

With the price of hard drives going the way they’re going, that’s not necessarily going to be true any longer.

What is and will be true is that tapes take up less physical space, and weigh less than mechanically rotating drives. And rather than constantly sleeping drives that are in use, spinning the mechanical drives up and down, tapes don’t need any power at “idle”. But tapes also take longer to get the data from compared to hard drives.

My thought was that in the long run, my TCO for using tape will be lower than using drives, because the cost of the drive vs. tape would be about the same (~$143-ish for a 12 TB LTO-8 tape right now), but I don’t have to pay for power to keep them running. I will only need it when I need to retrieve data from it.

And the “old”, but I think, still prevailing way of thinking is that tape is still a better backup method/medium than spinning drives.

I thought that the point of parity was to be tolerant and mitigate against data loss. I was under the impression and understanding a drive doesn’t have to fail before the RAID5 parity would kick in, so that if there’s any corruption in the data, even without a hardware failure, that the parity information can be used to recover the data.

The fact that you can then use that so that it can tolerate an entire drive’s worth of missing and/or corrupted data is a “fringe” benefit.

Again, I was under the impression and understanding that RAID5/raidz1 was for data resiliency and fault tolerance of the data itself because you can corrupt data without a hardware failure.

(e.g. abnormal loss of power where ZFS would then check itself, repair any corrupted data, and execute the remainder of the ZIL)

oO.o · February 4, 2020, 11:06pm

In RAID, the percentage and dispersal of parity is based on losing data in increments of whole drives.

Varies by implementation. For ZFS, yes. Hardware RAID and other software RAID implementations vary. RAID parity does not inherently have checksum capability so while it may detect a mismatch between data and parity, it won’t necessarily know which is correct. ZFS stands out here because it checksums and also has filesystem awareness where traditional RAID only sees raw data and has no understanding of how it relates to actual files.

Other way around.

ZFS has a multitude of features beyond traditional raid.

alpha754293 · February 5, 2020, 3:02pm

I was under the impression and the understanding that the principle of the XOR logical operation in order to compute the parity – wouldn’t that mean that the parity is always correct?

Or asking the same question a different way – if you have a RAID5 system with three drives, and the parity gets corrupted and therefore; fails, how would the system ever know that the parity has been corrupted/failed?

In other words, per your statement here:

You can lose a whole drive, but also like you said with HW RAID5 implementations, the parity itself can also get corrupted/fail itself, and therefore; when the XOR is recomputed during the rebuild/(re-)sync process, by your statement there, it would be rebuilding with the corrupted parity data in RAID5 in which case, even with the failed drive, because the parity data itself is corrupted, therefore; it will corrupt the rebuild and the absence of checksums will mean, like you said, that it won’t be able to tell nor know which is correct.

Therefore; wouldn’t that completely negate the point and purpose of a RAID5 system if this is true, and that the parity data itself can get corrupted, and therefore; if you lose a drive, the rebuild/(re-)sync will also be corrupted?

I’m confused again.

Yes. I was working/interacting/interfacing with some of the ZFS kernel development engineers at Sun Microsystems directly in late 2006/early 2007.

I’ve been using ZFS since it was made public in Solaris 10 6/06 (U2) when they added ZFS to Solaris 10 (where ZFS originally was released in OpenSolaris in November 2005), so I’ve been using it for the last 13-and-a-half years.

oO.o · February 5, 2020, 4:08pm

Answered your own question.

Traditional raid would not know. Modern implementations implement a scrub (like ZFS) that steps through all of the data and verified that data and parity still match. If there is a mismatch, it throws an error and of course ZFS will repair it for you.

This was originally the only purpose of RAID. It wasn’t built to repair randomly corrupted data. Hence the saying, “RAID is not a backup.” It was assumed that it would be paired with rsync backups or similar which can detect corrupted data via checksum.

That is correct.

freqlabs · February 9, 2020, 4:27am

In summary, RAID does not protect you from data corruption, it protects you from drive loss.

Marten · February 9, 2020, 6:57am

We are on a consumer forum sometimes I wonder if the guys at facebook using BTRFS with huge data loads and the guys at redhat who dont have BTRFS experts so are re-inventing the wheel and BSD guys who do now cross over to linux but are not the size of facebook is using ZFS ? Are they laughing at us ?

What file system does google use and Apple …I think apple buys cloud services from others.

Its all secrets

P.S even Wendells 3990X bentchmarks has an Intel linux setup kernel to optimize performance thats a lol right there.

alpha754293 · February 14, 2020, 9:01pm

Well…I learned that the HPC cluster at my work (> 10000 cores) uses a proprietary system that I can’t talk about.

On my home business/home/home lab front, I tested GlusterFS using RAM drives and didn’t see a huge performance improvement. But it also didn’t help that in my home lab, I can’t dedicate GlusterFS DS nodes separate from MD nodes separate from the client nodes.

This is part of the reason why I ended up putting it on my headnode/scratch disk node instead and consolidated it to that system instead.

alpha754293 · April 22, 2022, 3:00am

Just a quick update on this - I just experienced the first failure where one of the files inside one of the archive files that was written to tape had a bad CRC checksum.

Using par2, I was able to repair said corrupted data; which means that RAID (e.g. RAID3/4/5/6 - the raid levels that have parity) CAN be used to protect against data corruption (which is what ZFS was built to do, in part, is to protect against bit rot/silent data corruption).

So, with the parity data intact, I was actually able to repair said file within said archive file.

It is nice when things work they way that you thought/hoped they would (because that doesn’t always happen).

risk · April 22, 2022, 5:52am

The rumors exist that iCloud data is stored between Google and AWS.

Facebook, Google, Microsoft all use homegrown written from scratch proprietary distributed filesystems.

too much detail

Facebook used to use HDFS, now mostly HACFS. They’re both based on GFS and Colossus from Google in terms of design and architecture.

Google used GFS until 2010/2012, and have been using Colossus afterwards. These are not file systems in a POSIX “I will mount it on my workstation kind of way”, best way to think about them is to think of “object stores”, where objects have IDs that look like filenames and they’re really good for archival storage - can’t do random writes.

GFS design is simple, you take 5 of the biggest machines in the cluster make them hold all metadata and they’re kept in sync using Paxos libraries (think etcd), and you take another 10,000 machines and make them chunkservers. You can:

“open and create a new file with some attributes”, “write data sequentially”, “close a file”.
“open an existing file”, “read data at any offset”, “close a file”.

When opening and creating a new file, attributes include replication, engineers who are end users for GFS/Colossus choose their own encoding. In GFS master reply from creating a file, you’ll get the IP addresses of some chunk servers you should be contacting.

You’d them contact a chunk server, you’d give it data, tell it what file and what offset the data is from, and tell it to replicate the data to two of it’s friends and have 3 copies total, and you give it a crc32 checksum.

r=3 replication was typical for GFS, data is stored 3 times in the same cluster. In order to have good availability when reading (no raid reconstruction is needed, just look at another server). The 3 machines

When you close a file, you update the data on the GFS master with checksums of all chunks you’ve written so far.

Typical chunk size was 64M.

Chunkservers would store chunks as files on ext4, each disk would have its own filesystem. There’d be no raid, no read caching. There’d be some write buffering.

Google started experimenting with Reed-Solomon on GFS, popular encoding was 9.4. each 72M “stripe” of data" was split into 9 x 8MiB chunks and 4 recovery chunks were generated. These 13 chunks were spread across 13 racks.

A rack going offline (e.g. top of rack switch going away together with chunk servers)

Facebook likes using 10.4 .
Google later switched to 6.3 (with 48M(6x8MiB+3recovery) stripes; so BigTable can save ram and increase concurrency).
Azure used LRC(6,2,2) for a long time.

All of these distributed filesystems in use at the companies have limited semantics and pluggable encoding algorithms that you can change on a case by case basis by rewriting individual files differently.

Colossus in 2010 was a huge improvement.

Metadata for a cluster was no longer stored on a single machine (… or 5 because of Paxos), it was stored in BigTable. Each row in one of the tables holds 1 file’s metadata. (name, attributes incl encoding, list of chunks with checksums and hostnames for each chunk). BigTable itself is just a Log-Structured-Merge (LSM) system for storing large key-values. Data in BigTable (Colossus metadata) is sorted by key, and split into 64M tablets, such that each is responsible for a key range. So… each tablet is responsible for a small part of the overall filesystem metadata, and didn’t have to be the biggest machine in the cluster, it’s also usefully persisted outside of ram, and tablet data is stored in a GFS cluster and disaggregated from compute/serving it. So you have two layers, smaller GFS cell that houses metadata for a larger Colossus cell.

When writing or overwriting new row of data (Colossus metadata), you send a row to a BigTable tablet server responsible for the keyrange (filename) and it goes into two places, a hashmap in BigTable tablet server memory, and it gets appended to an already open GFS logfile (crash recovery/persistence). Eventually, in memory hashmap for a tablet would get sorted into an overlay tablet. When reading a row you’d contact the tablet server responsible for the row and it’d look at the hashmap, then the overlay tablet, then the base tablet. Sometimes you’d get more than one overlay tablet, and tablet server would merge them into base in the background and garbage collect the old ones - they’re sorted, merging is trivial.

First Colossus ran on Chunkservers, they were deemed too be overly smart and complicated and were replaced with D servers (think FTP servers). Chunkservers used to know how to replicate the data in GFS. In Colossus data replication and distribution is something that the File library does, and it allows for higher availability as it allows clients to read/write data independently of any single D server.

What used to be r=3 on GFS is replaced with r=3.2 with 8MiB chunks in Colossus. Send 3 copies (3x8MiB) of data down the network, Write call in API returns when any 2 work out, 3rd write can linger a bit and get there eventually… or not. Some kind of background curator thing will do a scrub and figure out if it needs to restore the chunk. The two copies aren’t even stored on disk, they just made it to ram and will make it to disk probably, eventually.

With Reed-Solomon the CPU cost for writing and read-recovery is on the clients. Some reads where data chunks are damaged can be slow.

Key lesson with Colossus is that Metadata is data, and you can layer systems to build bigger systems and de-amplify write rates and write volumes. For example, for every 8M of data written across 3 machines, you only store maybe 100bytes of metadata. To cause an 8M of metadata to be written you’d need 80,000 8M writes of data. Because clusters are big. If your metadata is larger than e.g. what fits on one machine, you can still store it on multiple machines… if you need to keep track of multiple machines worth of metadata, which is just data, one machine is probably enough, and that’s probably ok. One machine can do a lot these days, and this one machine is not in the critical path of some write happening 2 layers up.

With Google Cloud VM disks (aka. persistent disk), the actual data is stored on Colossus files over the network. When you write a block to a device, a log record is appended to a file, and a cached in memory map entry is modified to say what block, is in which log file at which offset… over time, these log files are compacted, so that you end up with contiguous ranges of blocks, which reduces the size of the map. And allows old data to be garbage collected/freeing up space. Not sure these days, but writes are not guaranteed to be on stable storage when the VM kernel issues a sync, they’re only guaranteed to be in multiple machines RAM… but they’ll get onto stable storage soon enough. Stable storage is Colossus on SSD, regardless of what you choose in the UI (there’s just artificial iops rate limiting happening for fairness between different users of storage).

Machine running your VM has no local storage (except a small crappy 64G boot ssd; I’m not sure if it’s usb2.0 or some other interface)

Also, @alpha754293 since you mentioned par2 specifically, keep in mind that it’s very slow, and you’re not going to be uploading each one of your datasets across a 100+ tapes, probably.

You may want to look at snapraid instead – it uses virtually no CPU (gigabytes per second per core) doing 10d+4p style raid computation (need 10/14 to recover).

They have this table:

Do I need one or more parity disks?
As a rule of thumb you can stay with one parity disk (RAID5) with up to four data disks, and then using one parity disk for each group of seven data disks, like in the table:

Parities	Data disks
1/Single Parity/RAID5	2 - 4
2/Double Parity/RAID6	5 - 14
3/Triple Parity	15 - 21
4/Quad Parity	22 - 28
5/Penta Parity	29 - 35
6/Hexa Parity	36 - 42

alpha754293 · May 1, 2022, 6:08pm

Can I use this for LTO-8 tapes?

I had a very quick look at it, and it again, looks like that it was designed to be deployed for disks (as it states) rather than for tapes, which is what I actually need.

re: par2 being slow
Ummm…I mean…I haven’t really found a faster, parallel parity calculation tool/program (or one that can parallelise the computations).

What I’ve done now though, in order to try and help things along is instead of having one colossal 7 TB file for the system to chew through, I am trying to break it up a bit, so, if I was testing or simulating different things, instead of archiving the entire overall project at the highest level, I started archiving and processing it at each of the (sub) testcase level, which means that I can throw more computing resources at the problem, in parallel.

At that point, the eight HGST 10 TB SAS 12 Gbps 7200 rpm HDDs becomes my bottleneck, so the whole thing isn’t going to get faster until I can run Gluster on ramdrives again, and have enough RAM installed acrossed my cluster so that I can have a massive distributed parallel ramdrive that’s been exported over NFSoRDMA to my 100 Gbps Infiniband network, so that I can run/store all of the data and work on it entirely in said gluster distributed striped ramdrive.

But right now, they’ve stripped that capability out of Gluster as of I think it’s either version 5 or version 6, so I can’t do that anymore anyways.

And the only reason why I am using RAM drives is because unlike NAND flash memory modules/chips, the RAM modules I think either doesn’t have a finite number of erase/program cycles, or if it does, it’s so stratospherically high, that in practical terms, it doesn’t really matter if such a limit on the individual RAM modules wouldn’t have an adverse, practical nor operational impact over the entire life of the system.