Question about RAID5/6 and/or ZFS raidz1/2 theory

alpha754293 · December 27, 2019, 3:05am

For “conventional” RAID5/6 implementations and/or ZFS raidz1/2 implementations, what is the “level of redundancy” normally set at (where “level of redudancy” can be described as either a percentage of the total file size or based on the redundancy target size)?

I am trying to understand how the parity in those implementations work so that I can mimick it using par2cmdline.

Thanks.

The par2cmdline README file talks about how you can set the level of redundacy using one of those two methods described above, and/or either the total number of blocks or the block size (in bytes), and the combination of which will produce enough parity data so that it would be able to tolerate x number of errors totalling y bytes.

I’m trying to understand how the parity theory works so that I can create it manually using the par2cmdline tool and the reason for this is because I am creating 7-zipped archive files that are then written to LTO-8 tape, and I want to create the parity of the original 7-zip file, and also create a second level of parity data on the parity files itself (e.g. “double parity”) so that instead of retaining triplicate copies of the backed up data, I might be able to get away with only using duplicates.

Thank you.

thro · December 27, 2019, 10:19am

raid 5 = one parity disk
raid 6 = 2 parity disks

you can lose up to 1 or 2 disks respectively without a array failure. if you do raid 5 with 3 disks you blow 1/3 of your space but you could do raid 5 with 7 disks and still only blow one disk on redundancy. how many disks you want in the same group is up to your comfort level.

raid-z is the same thing except that writes are always = full stripe and copy on write (and block hashes for data integrity checking on reads) to avoid an issue with traditional parity raid that can result from incomplete writes and/or bit-rot (silent data corruption)

Trooper_ish · December 27, 2019, 10:57am

And like some other raid systems, the parity and data are spread out, so there is no one “parity” drive, all disks are equal.

There is also checksumming to every chunk stored, so corruption is detected/automatically removed.
And number of copies - you can choose to have multiple file copies if you wish (not really used much, unless only 1 drive)

alpha754293 · January 4, 2020, 2:17am

But this is where I am a tad confused:

By the “conventional” understanding of RAID5 (n-1) parity, that means that the redundancy as a percent of the total capacity, given by 1/n drives varies with n.

Therefore; expressed at the filesystem level, each file is only 1/n redundant, which, if you have silent corruption in the parity AND the actual file/data, you’re kind of screwed.

Again, this is what I am getting/coming the to the understanding of by reading the par2cmdline README file and I want to check whether my understanding of the parity theory is correct – in that if there is silent data corruption with the parity data and also with the actual file data, you’re screwed if you don’t have parity on top of the parity, or double parity, correct?

(And by virtue of that, that means that with each increasing layer of parity, you’re only reducing the probability of having an unrecoverable error in the data akin to O^n, correct? Where n is the number of levels of parity)

nx2l · January 4, 2020, 6:13am

maybe you should watch this series

Trooper_ish · January 4, 2020, 11:25am

Um, I presume the par2cmdline thing just makes a checksum/ parity file for each data file? Does it also detect mis matches on read, write and scrub?

Raid5 and raidz1 only have a standard of 1bit of parity for each “file”
So by default, parity isn’t over n drives, it is x1 if standard number of copies, or xwhatever if multiple copies.
But each drive has a mix of data and parity, so any of the drives can fail completely, and array okay.

If any corruption, zfs will detect on a read, a write or a scrub, and automatically restore from parity, or rebuild parity from data. Or it will mark the “file” as unrecoverable, and you can source a replacement.

Raidz2 can have 2 drives die, so there are 2 bits of parity per data shared among the pool, like raid6.

Raidz3 is one more level, with 3 way parity, plus data across the pool.
Any 3 drives can die, and data still good (no actual raid equivalent)

But over all, every file is split into blocks, and the block is checksummed, and spread around the pool, so it’s not one solid data file, and one solid parity file, it’s distributed.

alpha754293 · January 6, 2020, 11:54pm

@nx2l
ZFS is not relevant/applicable to LTFS.

Please correct me if I am wrong, but to the best of my knowledge, you cannot format a LTO-8 tape using ZFS.

alpha754293 · January 7, 2020, 12:16am

par2 calculates the parity for a file or multiple files together, the latter being especially useful if you’ve broken up a large archive file into separate volumes, usually equal-sized volumes. That way, you can calculate the parity on the whole/sum of all of the data for all volumes which will be used to recombine to make the whole.

For checksumming, I am using a separate command for that - sha256sum.

I’ve taken to generating a SHA256 checksum for the original archive file as well (the archive being upto 4 TB in size, in a single file), as well as the first and second (e.g. “single” and “double”) parity data.

Again, this is where I am confused because if that were true, then ALL RAID5/raidz1 implementions should give you 33% level of redudancy, regardless of the number of drives in the array/pool, given that with the bit level XOR computation, you have 2 operands and 1 result.

0 XOR 0 = 0
0 XOR 1 = 1
1 XOR 0 = 1
1 XOR 1 = 0

(Source: https://kb.netapp.com/app/answers/answer_view/a_id/1002948/~/how-is-raid-5-parity-calculated-using-the-xor-operation%3F-)

Therefore; it isn’t that your parity is over n drives, but your leve of redundancy is.

If you have 3 drives in a RAID5/raidz1 array, 1 is for parity and 2 is for data.

Therefore, your number of parity drives divided by the total number of drives is 1/3 or 33%, which is your level of redundancy.

If you have 8 drives in a RAID5/raidz1 array, again, it’s 1 drive for parity and n-1 drives for data, or 8-1=7 drives for data. Therefore; again, the number of parity drives (1) divided by the total number of drives (8) will give you a level of redundancy of 1/8 or 12.5%.

If you have a 100 drives in the RAID5/raidz1 array, it’s still 1 drive for parity, and 99 drives for data. Therefore; the number of drives for parity (1), divided by the total number of drives in the array (100) will give you a level of redundancy of 1/100 = 1%.

But this doesn’t help to explain why the requirement for RAID5/raidz1 ISN’T fixed a 33% of the array’s capacity, regardless of the number of drives you have in the array (min 3) – in other words, why is it that with RAID5/raidz1 require you to have muliples of 3 drives in the array, so that it will always retain a 33% level of redundancy if that statement is true.

Why is the formula for the total available storage capacity always n-1 drives as opposed to 2/3*n, where n must be an integer multiple of 3?

But that’s part of the question.

With par2, you can specify the number blocks as well as the size of the blocks and the level of redundancy.

With RAID5/raidz1, the level of redundancy decreases as the total number of drives in the array increases, and no one seems to know how many blocks or what is the size of each block that is used for the purposes of calculating the parity.

Again, the goal is to understand the theory behind these RAID calculations (be it RAID5/raidz1 and/or RAID6/raidz2 and/or “RAID7”/raidz3) so that they can be mimicked using par2 for the purposes of writing the data to LTO-8 tape.

The whole point about ZFS is irrelevant with respect to LTO-8 tapes because you cannot format a LTO-8 tape as anything other than LTFS.

This last point is absolutely critical and crucial to the entire discussion.

Trooper_ish · January 7, 2020, 12:21am

does this help?
https://calomel.org/zfs_raid_speed_capacity.html

tape is seperate. Don’t use ZFS to format a tape. you can send/ receive to/from a tape,

alpha754293 · January 7, 2020, 12:35am

Not really, because part of my original question is specifically about writing the data to tape:

The entire question is predicated on this premise because this is what I am trying to use it for.

I want to use the par2/par2cmdline tool/program in order to calculate the parity on my data so that I can write both the data and its parity (or how ever many levels of parity, depending on the probability of failure/how paranoid I might be able said failure) to said tape.

The question is that par2 has some options that is part of the tool, and I am trying to understand how “conventional” RAID systems determines the level of redundancy, the number of blocks, and the size of the blocks so that I can then enter that into as options to the par2 tool, so that I will generate/compute the respective parity data, like a “conventional” RAID5/6 system, so that I can then write my original data, along with any and all of the parity data to said tape.

This is the goal all along.

This is why I am trying to understand the theory behind the RAID calculations so that I will know how to set the level of redundancy, the number of blocks, and the size of the blocks because those are options to the par2/par2cmdline tool.

Thank you.

Trooper_ish · January 7, 2020, 12:48am

Thanks, yeah, okay I gotcha. Sorry for the mis-understanding.

Good luck researching partiy and raid.

nx2l · January 7, 2020, 1:44am

i doubt it,

but i thought you were looking for more info on how zfs works… my bad

oO.o · January 7, 2020, 2:39am

Looks like everyone skimmed your OP because based on your title it appeared that you were asking a very common question about fundamental raid/ZFS concepts. I would change your title to something about par2cli and ltfs.

How many tapes do you want to distribute data across?

Also, there is a verification option in the ltfs copy command that will checksum for you.

alpha754293 · January 11, 2020, 3:35am

Well…I think that it just goes to show, once again, that people only read the headlines and not the body of the text which contains the actual question and then make assumptions about what the question is, because they haven’t read it.

My philosophy is if I don’t write it, that’s my fault. If the audience doesn’t read it, it’s their’s.

In regards to your question “How many tapes do you want to distribute data across?”

Zero. Each tape must be self-sufficient and stand on its own. This is why the par2 command acts/works on the file level as oppose do the disk/drive/tape level or block level (for block array/pooled devices).

Yeah, I know that LTO tapes will write and generate a checksum per block on the tape itself, and I am not sure how it would actually correct itself, but I figured that the par2 command would be vastly more explicit and would presumably give me a more visible way of being able to detect and correct any errors with the data that the tape may have or develop over time.

And then I am using a father-son (duplicate) storage method only because grandfather-father-son (triplicate) is still a little too expensive of a pill for me to swallow, although, yes, I recognise that’s preferred.

oO.o · January 11, 2020, 6:12am

Ok, but parity only works in at least a 3-part storage system. You need 3 drives for raid5 — 2 data and 1 parity. There is no 1 data, 1 parity solution. If there were, a parity alternative to raid1 would exist. It doesn’t.

For tape, the best solution is to copy with checksum verification. If you need redundancy, then you should mirror to another tape.

If you’re concerned about “bit rot”, the whole point of tapes is that they are resistant to that sort of data corruption. You put them on the shelf and they shouldn’t degrade for 30 years…

nx2l · January 11, 2020, 7:22pm

given the ideal temperature and humidity

alpha754293 · January 27, 2020, 6:04pm

With par2/QuickPar, you CAN compute the parity data against itself.

It uses the Reed Solomon matrix /coding in order to compute the parity information so that it can then be used to heal/repair files that may have been damaged.

I strongly encourage readers to google the par2cmdline readme file and read it so that they can get an understanding of how par2 works. At the bottom of the readme is the reference/citation to Plank’s paper “A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems” (1996), which will provide an even further, in-depth, technical discussion about the mathematical details that is used by par2 in order to compute the parity information that can be used on a file, upon itself, in order to repair itself should the file become damaged in any way.

(You can actually take a file, and purposely inject a bunch of random garbage in the middle of the file and use par2/QuickPar to correct it.)

The intention of this isn’t only to protect against bit rot, but also to protect it against other failure modes as well (again, because the tapes is a magnetic media (BaFe)) , so in addition to bit rot, if the tapes themselves aren’t protected from a magnetic field, you could, in theory, induce errors/damage to the data via that failure mode as well.

Checksums might be good to detect a problem, but I don’t know if checksums are good enough to fix or correct a problem. (Because if checksums were sufficient, then RAID systems wouldn’t need parity data.)

Yup.

oO.o · January 27, 2020, 6:43pm

github.com

Parchive/par2cmdline/blob/master/README.md

# par2cmdline

**par2cmdline** is a PAR 2.0 compatible file verification and repair tool.

To see the ongoing development see: <https://github.com/parchive/par2cmdline>

OpenMP multithreading was originally developed by Jussi Kansanen: <https://github.com/jkansanen/par2cmdline-mt>

The original development was done on Sourceforge but stalled.

For more information from the original authors see <http://parchive.sourceforge.net>

This is also the place for details on the PAR 2.0 specification and discussion of all things PAR.

## What exactly is par2cmdline?

par2cmdline is a program for creating and using PAR2 files to detect damage in data files and repair them if necessary. It can be used with any kind of file.

## Why is PAR 2.0 better than PAR 1.0?

This file has been truncated. show original

The fundamental difference between RAID parity and par2 is this:

These PAR2 files will enable the recovery of up to 100 errors totalling 40 MB of lost or damaged data from the original

In RAID5, you have 3 components (2 data, 1 parity). You can lose any one of them, so either half the data (but a specific half) or the parity. This is designed around each piece being on a different drive. If a drive dies or is removed, it can be replaced. However, if a piece of data or parity becomes corrupted, you could rebuild it, but you wouldn’t necessarily know which is correct (parity or data).

In ZFS, this is solved by checksums, so it will know whether to trust the data or the parity in case of a mismatch. In the case of Linux’s MD raid, it will just report that there is a mismatch and you’ll have to figure it out from there. On hardware RAID controllers, I’ve seen settings like “always trust parity” or similar.

In RAID6, you have 2 data and 2 parity, so if there was a mismatch, you’d presumably have 2 matching and 1 mismatch. In that case, you could hypothetically trust 2 vs 1, but I’m not sure if/where that is implemented. Also possible that I am over simplifying it (or don’t fully understand it). I don’t think you can have MD raid do automatic data repair based on that though. This would certainly be the case in a RAID1 array with 3+ disks, but again, I’m not sure a majority rules repair exists in any RAID.

Based on the par2 readme, the use-case is more like a checksum with limited repair capability than RAID. In the example, it is only able to recover the file if 95% of it is intact. It is not equivalent to saving multiple copies as you describe here:

Assuming 5% is consistent, you’d need to spread the data across 20 tapes plus 1 tape for parity just to tolerate a single lost tape.

alpha754293 · January 27, 2020, 7:53pm

You can control that.

That’s my point.

One of the files that I computed the parity data for is 4.4 TB in total size.

10% of that file (as given by the command line argument -r10) will result in a total amount of parity data of 440 GB.

Also, further, is that you can either have a fixed constant total number of blocks or you can have a fixed constant block size (but vary the total number of blocks), and the combination of at least two of the three elements can determine how “tolerant” of a fault, damage, or error the file itself can take.

(P.S./BTW – par2/QuickPar is a nifty little program because unlike RAID systems which depends on a minimum of 3 drives for RAID5/6 (which is either a maximum of 33% to 67% fault tolerant), by implementing it at the file level, individual files can have varying levels of fault tolerance which you cannot do/control with RAID systems unless you have physically separate hardware whereby the RAID arrays are managed separately (either physically or logically).

The point about “level of redudancy” is related to the idea of physical RAID systems whereby the total capacity of the RAID5 array is n-1 drives * lowest capacity (and n-2 drives * lowest capacity for RAID6 systems), which also implies that the level of redudancy for RAID5 systems is 1/n and 2/n for RAID6 systems.

In other words, if you have the minimum 3 drives, RAID5 would have a level of redundancy of 33% and RAID6 would be 67%.

If you have 8 drives, it would be 12.5% and 25% respectively.

If you have 100 drives, it would be 1% and 2% respectively.

With the par2 command line argument “-r#”, you can specify the level of redundancy on a per-file basis. So if you have something that’s hyper critical, you can specify 75% level of redundancy ("-r75") where if the file is 100 MB, you can compute and write out 75 MB of parity data that can be used to recover that 100 MB file.

Conversely, if you have something that’s less critical, you can set it to be like 1% level of redundancy ("-r1") so that same 100 MB file would only have a total of 1 MB of parity data.

For RAID5/6, it isn’t necessarily always 3 components, i.e. the number of drives/devices permitted in an array isn’t required to be a multiple of 3.

If you have eight 10 TB drives in a RAID5 array, the usable capacity available for data will be 70 TB ( (8-1)10 TB = 710 TB = 70 TB).

Note that there are still eight drives (“components”) to that. The minimum is 3 for RAID5/6, but you are not required to have a configuration that is only multiples of that minimum number.

(edit I just realised that I made a boo boo. In all spots where I wrote that the minimum number of disks for RAID6 is 3, that is where I am wrong. The minimum is four with a total available capacity of (n-2 drives)*lowest capacity. Sorry about that. My bad.)

I think that ZFS uses the checksum to quickly “spot check” whether the is a problem with the data (e.g. I tend to think of it as the “checking” part of “ECC”), but I don’t think that checksums are enough to actually do the “correcting” part. (Because if checksums were sufficient, then in theory, there shouldn’t be a need for parity data.)

It is my understanding that even with md RAID, that if you have RAID5, it will use your system’s main CPU to calculate the parity information and store it (which is really NOT that different than ZFS for the same reason/purpose) or hardware based RAID controllers.

The example they use in the README, I think talks about using 5% level of redundancy.

But you can actually specify that in the command line arguments for the par2 tool/command.

In theory, I might be able to specify -r99 (99% of redudancy) so that it will be fault-tolerant to 99% of the file.

So if you have a 100 TB file, you can write replace (in theory) 99 TB of that 100 TB file with garbage and with the parity data that it generates/calculates, it should be able to recover from that.

In my experience with using par2/QuickPar, it is definitely more than just checksumming because as I mentioned, my understanding is that checksum can only be used to detect problems, but it can’t correct them, and I’ve definitely used par2/QuickPar to fix/correct issues with them.

In at least two cases, out of a multi-volume archive, I was actually missing entire “chunks” of the archive (i.e. some of the volumes of the multi-volume archive was just outright missing), and using the parity data from par2/QuickPar, it was able to reconstruct what that missing volume was supposed to be, so it’s definitely more than just checksumming.

And no, even with x% level of redudancy, the data isn’t spread acrossed 1/x number of tapes. When I write the data to LTO tape, I write my original file, plus the .par2 file and also my .par2.par2 file (I’ve implemented double parity where the second level of parity currently is on the first level of parity. Not sure if that’s what I SHOULD be doing (hence the question here), but that’s how I’m doing it currently.

As far as the tape is concerned, it just sees the .par2 files and the .par2.par2 files as just another file that’s being written to tape, and is practically indistinguishable from any other file that’s being written to said tape.

alpha754293 · January 27, 2020, 8:04pm

@oO.o

If you have a file that you don’t particularly care if it gets toasted or not (or you can make a backup of it and store it safely somewhere), and you are more proficient than I am at being able to inject/overwrite a portion of that file somewhere in the middle using some kind of shell scripting or something along those lines – I think that you might find the par2 tool perhaps interesting to you, in terms of what it can do.

To me, the biggest advantage is that rather than getting RAIT (redundancy array of independent tape drives) and having to make sure that groups of tapes are working in a synchronous manner for RAIT to work, I can work with the parity at the file level, and that I can then write the resulting parity files/data to tape as just more data to be written to tape.

So with my LTO-8 tapes, they have a native 12 TB capacity. They recommend to leave about 10% of free space so that the tape itself can work more efficiently/effectively, (which leaves about 10.8 TB of available space) and then I’m using 10% level of redudancy, so I am writing about 9.612 TB worth of ACTUAL data onto said tape and about 1.188 TB of parity data (1.08 TB for first level of parity, and an additional 0.108 TB for double parity). But in total, I’m still writing around 10.8 TB total, to tape.

I would make the recommendation that you try and play with it because its’ a nifty little tool where you can compute parity on data, files, and systems that otherwise may not have parity protection (given that not all systems are behind a RAID array.)

And then, like I said, if you know how to inject/overwrite some stuff in the middle with rand(), you can then test and find out how par2 works to recover the file that you’ve overwritten with random stuff in the middle.

Try it.

Sounds to me like you might have a little bit of an interest in it.

I use it (as you can see) for stuff that aren’t protected via some kind of RAID system. (e.g. I can protect my boot drives without needing the 3-drive minimum for RAID5) nor do I use PXE to boot my systems. (PXE over GbE = too slow. And I don’t have iSER set up yet, which introduces other issues.)