Question about parallel and/or multithreaded 7-zip decompression

rcxb · March 1, 2022, 6:49am

You can use mbuffer just like “cp” to copy from the disk to the tape mount, and back:
mbuffer -m 1.9G -i /disk/path/filename.tar.gz -o /tmp/lto8/filename.tar.gz

In my experience, rsync gives about HALF the I/O performance of other file copying tools, even non-buffering ones, so I would avoid it anywhere performance matters…

In that case, not a good fit for you. Of course you could split the file into parts, but that’s troublesome.

There are several different compression options, so there’s no simple answer.

mksquashfs is used just like “tar” to create the file system with the files you want on it. The squashfs volume is just a single file, you can copy to any media.

alpha754293 · March 1, 2022, 7:00am

My understanding though, is that with rsync, you are trading off some of the performance with some of the synchronisation checks that rsync does, correct?

(Please correct me if my understanding of how rsync works is inaccurate.)

But it has a maximum file size limit of ~1.75 TiB according to the documentation, which means to be able to handle files larger than that, I’d still need a different/alternative solution.

(In the back of my mind, I’m still trying to figure out how I would be able to use or leverage SquashFS on LTO-8 tape. Like my thought process is that if it is a read-only filesystem, then I can see how it would be useful, for example, for CDs and stuff, but I am struggling a little bit to see how it would be useful for a LTO-8 tape, where the tape itself has a 12 TB uncompressed capacity.)

rcxb · March 1, 2022, 7:30am

Yes, we’ve covered that.

SquashFS is just a replacement for TAR/7-zip in your process. It takes a bunch of uncompressed files and converts it into a single, compressed file. It doesn’t replace any other parts of your LTO archiving process. You can put many, many SquashFS files on your LTO tape.

rcxb · March 1, 2022, 7:57am

Not sure what you mean.

By default, rsync mostly just checks if the size and datestamp of the source is the same as the dest. (You can do that with a little bit of scripting without rsync). It only does a thorough check if you add the -c option.

If the file is the same, skip it. If not, and there is a difference, then here comes a whole lot of overhead to calculate where the difference lies, which with local files can take longer than just copying the whole file again (-W option).

rsync also does a checksum of the source and dest of the files it does decide to copy, but it doesn’t flush caches to prove what’s on disk is correct (versus in memory) or write the checksum to disk where it can be used later, so the utility of that part is limited, too.

alpha754293 · March 1, 2022, 2:22pm

Gotcha.

Yeah, due to the file size limitation, it is interesting to learn about it, but unlikely that it will get deployed on that account.

(I wonder though, if I have a lot of tiny files, can the resulting SquashFS compressed file be > the 1.75 TiB file size limit or if that’s the largest size of file that it can take into said SquashFS file? (i.e. is that filesize limit an input or an output?))

Right.

But that’s where I was trying to go/get at in that rsync does more than just a straight, “unintelligent” copy.

It works great if you are certain that you want to replace it, but if the contents are the same and it can be skipped, then there is a potential for rsync to be faster than a straight copy only because it can skip what it has figured out that can be skipped.

Hope that makes sense.

(It’s one of the reasons why I use rsync despite knowing the fact that in a pure race for speed, rsync would be slower than an “unintelligent” straight copy.)

alpha754293 · March 4, 2022, 10:29pm

I’m trying to decompress a 3.1 TiB .tpxz file that was compressed with pixz and the error message that I am getting is:

Error decoding stream footer

Anybody know how I can get around this and “force” it to decompress my file to get my data back?

Thanks.

*edit
The specific command that I am using in my attempt to decompress the file is:
pixz -d "file with spaces in the filename.tpxz"

(The precise/exact file name has been redacted/anonymised. But I can tell you that it only contains Latin alphabet characters (no numbers, no special ASCII characters, none of that stuff.)

So the expected result is that it should decompress the file to file with spaces in the filename.tar.

Thanks.

alpha754293 · March 5, 2022, 3:32am

Quick question for you - I was doing some more research online and I am wondering why you suggested using zstd instead of pzstd?

I just want to get your perspectives and insights on that?

I was doing some more research online and it seems like that pzstd is faster and the resulting compressed file size is only maybe like 1% or 2% larger (or less), so where I am trying now (besides trying to figure out how to get my data back from the failed pixz decompression run) is to get/find a replacement for said failed pixz.

Thoughts?

Your help is greatly appreciated.

Thank you.

alpha754293 · March 6, 2022, 3:58pm

Update:
So, I think that I’ve found TWO files now where pixz produced invalid archives.

I ended up reaching out to the developer of pixz on GitHub and he’s given me a few things for me to try where I can test the integrity of the .tpxz archive that were giving me problems. (It started with one, but I think that I might have found/stumbled acrossed a second file that has issues with it.)

I think that one of them has already reported “unexpected EOF”, so I am not sure what happened there because pixz, when the .tpxz file was being created, didn’t report any problems nor issues during the file creation process.

Suffice to say that with the current issues that I have encountered, any possible advantages and reductions in time/gains in speed is negated by the fact that I am now having to and needing to run integrity checks on the archives that were produced and with any luck, hopefully, I’ll be able to “force extract” said archives to get my data back.

sigh…