Question about parallel and/or multithreaded 7-zip decompression

risk · February 26, 2022, 5:58am

After decompression, extracting tar files is really simple.

The file format is just a forever repeating file header and file data. It doesn’t really use any CPU whatsoever compared to compression/decompression or even just checksummming. It’s most just hanging data to the filesystem.

If you’re decompressing onto a modern filesystem that supports checksummming and encryption, then the underlying filesystem might end up causing some CPU to be used but it should do so on the background, and it shouldn’t noticeably slow down writing in any significant way

Maybe if you’ve a bunch of small files - that might benefit from some kind of parallelism because of os stuff?

Or maybe if the filesystem or OS enforces some consistency stuff, maybe parallel / lazy closing of files from different threads might help e.g. as on windows with bit defender antivirus? Not sure, generally it’s not really an issue, and there aren’t that many untarring utilities.

alpha754293 · February 26, 2022, 8:34am

Thank you.

rcxb · February 26, 2022, 9:14am

The polar opposite of a tar.gz is a SquashFS file:

Squashfs 4.0 Filesystem — The Linux Kernel documentation

Your data can remain as a single compressed file on disk, and you can loopback mount it immediately, read any files you want from it, and copy them elsewhere as needed.

Not sure if it’s useful in your circumstance, but it’s another option you now know about.

regulareel · February 26, 2022, 11:47am

How come I dont hear about this FS as much? Are there any problems you encounter that you think is holding it back from mass adoption?

xzpfzxds · February 26, 2022, 1:29pm

Background:

LTO-8 tape - should be able to get 300MB/s
100G network - not a bottleneck
7355 GB of data, call it 8T conservatively for overheads
“… it takes me about a day for me to pull down the data from tape, and then at least another week to 10 days for me to decompress the archive file”

Question: What is the bottleneck for the 10 days?

8TB of data at 300 MB/s is just over 7 hours, uncompressed.

Why not keep it simple and keep it uncompressed? Are the number of tapes an issue - surely the cost of a tape is far less than the time spent even thinking about it?

If the compression is less than the tape speed, you’re wasting time with compression.

zstd-1 on a single core has ample performance to keep up, but that assumes the reader of the tape is able to stream the data through zstd to decompress it before storage.

If the reader of the tape needs the final dataset as uncompressed files, you’ll want a method that supports streaming rather than creating unwieldy temporary files.

If you need to minimize final storage, using squashfs is a great suggestion, but remember you’ll need to recreate it from scratch every time you need to modify/update/redact any files. If you need to update files, you could use a ZFS pool in a file (works fine, using zstd-9+ would probably be suitable).

If you need to minimize transfer time, assuming both sides can tar/zstd streaming to/from tape, I’d just pipe tar through zstd -T -19.

gnat · February 26, 2022, 1:45pm

Uses zstd all cores. Perserves file mode and permissions.

Archive: tar -I 'zstd -19 -T0' -cf ./archive.tzst .

Extract: tar -I 'zstd -T0' -xf ./archive.tzst .

You’re welcome.

gnat · February 26, 2022, 1:52pm

Bonus: SSH Encryption using your GitHub private key, using Age by the head of security of Go programming language: (https://github.com/FiloSottile/age).

Get age: curl -L https://github.com/FiloSottile/age/releases/download/v1.0.0/age-v1.0.0-linux-amd64.tar.gz | tar -xz age/age --strip-components=1

Encrypt: curl -L https://github.com/you.keys | ./age --encrypt -R - archive.tzst > archive.tzst.age

Decrypt: ./age --decrypt -i ./your_key ./archive.tzst.age > ./archive.tzst

That ones on the house, gents.

Don’t use 7-zip for important shit.

@wendell Please educate the peoples about Age if you can, its PGP / GPG but with normal SSH keys! Open, single all-in-one binary.

thetazman · February 26, 2022, 7:57pm

Pardon my ignorance but wouldn’t a storage / file server be more applicable for this use case scenario?

Allow you to backup to the tape drive / be able to work on the client(s) work faster and be able to host the boot server and related files.

Or am I just looking at this the wrong way? My apologies in advance if I am.

risk · February 26, 2022, 9:34pm

I’ve been busy staring at code at work and never got around to implementing distributed / chunked compression and decompression client/server.

… but now that @thetazman mentioned it, imagine if both your storage … and your compute were distributed.

You could compress 8TB of data in 8 seconds with enough cores and HDD/SSD and a decent couple of network switches

rcxb · February 27, 2022, 1:50am

SquashFS is commonly used in LiveCDs, AppImage files, some embedded systems, even a bit in Android. I assume it’s just not a common problem most people come across, and is just one of those quietly useful back-end utilities people don’t hear much about.

jlittle · February 27, 2022, 4:21am

If the target is a random access device, using a file system with compression is more flexible and faster, and allows speedy updating, say with rsync or an incremental send/receive on ZFS or btrfs. For distributing software or data, tar is the incumbent, with half a century (ish) of inertia.

alpha754293 · February 27, 2022, 6:31pm

Can you use this on a LTO-8 tape?

I don’t think you can, but I am not confident in what I am thinking either.

The key/operative word here is SHOULD.

In practice, it REALLY depends on the file that it is working on because I have seen my drive dip down into the tens of kB/s and usually tops out at around 200 MB/s.

Lack of parallelisation during the decompression stage, mostly.

It’s actually a non-trivial task to decompress a ~6.2 TB 7-zip archive to ~7.3 TB of data.

In my case, this is, unfortuantely, not true.

For a professional IT department with a professional IT department budget, I would agree with you.

For home use, every penny counts. Disposable income is a commodity and therefore; proposed solutions have to assume no to the lowest, minimal budget that you can squeeze the solution into. (In my experience in speaking with IT professionals, this is an opportunity that said IT professionals can get better.)

So, for my case, running the compression/decompression is entirely driven by trying to eek out and to save space, which means that I don’t need nor have to buy as many tapes if I can be more efficient with the storage of the data itself by using the compute power that’s available to me via my micro HPC cluster.

Like if you have a tape library that’s the size of a small room, with 8 robots running around grabbing tapes, and at least 8 tape drives, maybe even more, then sure, storing the raw, uncompressed files would be trivial, relatively speaking, and who cares how many tapes it takes.

But when you have to try and squeeze 240 TB into 20 tapes, and because I’m still using a father-son tape topology, which really means that you are trying to squeeze 240 TB into 10 tapes, explicit compression (rather than depending on the tape drive’s compression algorithm) really makes a difference.

This statement would be true if and only if you are ignoring the space/sizing constraint of the problem.

Does it use all cores for decompression as well, because I haven’t been able to find any documentation that explicitly states this?

How does the resulting file sizes compare against different compression tools/algorithms (that also support parallel decompression)?

Good to know.

It’s “moderate” in terms of importance (re: the data that I am sending to tape).

It’s just CAE results/project files/folders, so encryption isn’t hyper critical.

No worries.

I now have five servers, I guess, technically, that (was) hosting the data.

The data got sucked into/consolidated onto my cluster headnode and it is being prepped to be sent to tape (which is what this parallel compress/parallel decompression is all about).

Live servers (e.g. RAID) aren’t a backup.

Yeah, so the tapes are too slow to work on/work with for live data because of the linear nature of said tapes. (i.e. moving around on the tape, e.g. seeking on a hard drive) is ridiculously slow and isn’t made for that.

So a part of the idea with parallel compression/decompression of an archive file is so that the archive file can be written to the tape as a sequential transfer rather than a series of random reads/writes.

Yeah…they have that.

I’ve played around with Gluster for a little bit.

The problem with distributed computing is whether each distribution is to handle its own, distinctly separate and unique task, or whether it is distributed parallel processing.

(For example, I haven’t seen a MPI implementation of any of the aforementioned parallel decompression tools.)

From what I’ve seen/experienced, Gluster FS makes more sense for large(r) organisations/enterprises where you might not have one giant, centralised computer/server room that’s smack dab in the middle of your office, but maybe, perhaps, a bunch of distributed compute closets (so you’re not running kilometres and kilometres of wiring back to the central switch), so you have “leaves” in different areas of the building. So, to users and sysadmins, it looks like it’s one giant server. But in physical reality, it’s a bunch of smaller servers combined together.

Gluster does it. Lustre does it. Ceph does it.

And there’s parallel distributed computing (e.g. MPI) and/or an OpenMP/MPI hybrid where going it uses MPI for internodal communications, but uses OpenMP for intranode/intersocket communications, which I’ve also used before as well. (Really only scales once you’re > 128 cores, because at about > 128 cores, the MPI overhead starts to really hit the scalability ratio.)

Yeah…if I was running NVMe over fabric, you can do that.

Actually, what you USED to be able to do with GlusterFS version 3.7 is you can create a ramdrive (tmpfs) and then run the distributed stripped Gluster volume on that, so you had a DISTRIBUTED, RAM drive that ran on the 100 Gbps IB network. And because I was testing it with the Xeon E5-2690 (v1), said 100 Gbps IB network was actually FASTER than the socket-to-socket bandwidth within the same node (which is pretty crazy when you think about it). Sadly though, as of Gluster version 5, I think, that ability was deprecated and completed removed in Gluster version 6.

It would be interesting to see how well and/or how poorly, this would work on LTO-8 tapes (if at all).

I think that it’s important for people to keep the LTO-8 tape in mind as a lot of the solutions may drop out of the running once you add this “layer” back into the picture.

xzpfzxds · February 27, 2022, 6:43pm

Well that sounds like a big problem, I’d look into why that is happening. I always use mbuffer to smooth-out the hiccups keep an LTO drive streaming at full speed.

risk · February 27, 2022, 7:26pm

@alpha754293 do you program in any language?

Doing a “bag of tasks” where each task is a proxy for a piece of the tar file wouldn’t be “that hard”. Maybe me or someone else here can help you with it.

rcxb · February 28, 2022, 2:37am

You can put any kind of a file on an LTO-8 tape. I don’t expect mounting it directly from a tape would perform, if that’s what you meant.

alpha754293 · February 28, 2022, 7:44am

It doesn’t happen very often. Like there are times where it would pause, and then it would partially eject the tape, and then it would suck the tape back into the drive, and continue writing/reading. So, I assume that there might be some like calibration thing that’s going on or in a layperson’s terms “it got tired, needed a quick break, before continuing to write the files onto said tape”.

I really have no idea, to be honest.

Pure write speed isn’t usually that big of an issue because the compressing/decompressing of the underlying data usually takes more time than it would take for me to write the resulting compressed file to said tape.

And when you google “parallel decompression”, a LOT of results come up for parallel compression, but for parellel decompression, not much comes up. (Hence why I’m here.)

If I am being honest, I had to quickly google mbuffer.

Right now, I am just formatting the tapes using mkltfs command and then creating a mount point (e.g. /tmp/lto8) and then mounting the formatted tape to that mount point, and then using rsync to send the data to tape.

So…I’m not sure how I would integrate mbuffer with rsync for that purpose.

It would be nice to be able to keep the stream going so that it would be able to keep writing at its fastest speed, but again, the write speed is generally, less of an issue for me compared to the parallel compression, and now, my foray into parallel decompression methods.

Sadly, no.

My faculty advisor, in university, said that I was quote “completely useless to him as a mechanical engineer” on account that I can’t and I don’t program.

(I do some really like, simple, stupid shell scripting, which is really just a Linux version (in my mind) of writing a batch file (that I used to write for DOS/Windows).)

But anything more complicated than typing exactly the same commands I would type in a terminal/console, and dumping it into a batch file - that, I don’t do, unfortunately.

Yeah, I wasn’t sure what you meant, specifically, when you mentioned it; i.e. that I would format the LTO-8 tape using SquashFS as opposed to using LTFS.

(I wasn’t sure if this was something that you could do. You don’t know what you don’t know, and so, for all I know, I didn’t know if the proposal was to format the tape using SquashFS.)

I did quickly peruse through the documentation link that was provided though, and this caught my eye:

“The index cache allows Squashfs to handle large files (up to 1.75 TiB) while retaining a simple and space-efficient block list on disk. The cache is split into slots, caching up to eight 224 GiB files (128 KiB blocks). Larger files use multiple slots, with 1.75 TiB files using all 8 slots. The index cache is designed to be memory efficient, and by default uses 16 KiB.”

So in my case where my uncompressed data is 7.3-ish TB and if I compress it as a 7-zip file, it’s about 6.2 TB file, how would SquashFS handle large files like this?

(One of the files from the CAE/HPC application itself, when compressed into its own archive format, is currently about 5.5 TB itself.)

On the surface, it would appear that SquashFS won’t be able to handle a single file of that size, no? (It says in the table, max file size: ~2 TiB)

So, I’m not sure if this would work for me due to the fact that I have files that exceed those sizes.

I would be interested to see:

a) what kind of compression ratios that it’s able to get?

b) and whether or not it can do parallel decompression?

And then, it also states that it needs other tools like mksquashfs to create a populated filesystem, so I’m not sure how that works and how that would interact with the LTO-8 tape.

(Like if I created the compressed SquashFS populated filesystem, how do I send that to tape then? And would it have to decompress it to write to tape or would it write the compressed state to tape?)

risk · February 28, 2022, 8:20am

Hey, talking about useless, I’m a programmer who works on ads, and I don’t think I’ve ever placed an ad or rented website space for an ad to be shown. Mechanical engineering without programming sounds a lot more useful.

…

Ah, I’ll have to think about this a bit… (challenge accepted?-maybe…)

So you use gluster mounted storage?
Are you limited in the amount of temporary storage?

(e.g. is it ok to have one or two more copies of the data in temporary directories?)

How many machines or cores do you think you could use in total towards this goal? And how many HDDs do you think you have (spindles suck, reads/writes should be in big blocks and as sequential as possible for optimal aggregate throughput).

Reason I’m asking is I think it might be possible to parallelize compression and decompression, maybe even untarring across multiple machines using a few simple shell scripts.

Maybe the best way to explain it would be like a “folding at home” setup, but you just have a shell script listing a tempdir every minute or so… when a file appears one of these 5 or 10 machines compresses it.

Since you can cat .xz files together, this would work just fine.

gnat · February 28, 2022, 10:11am

Does it use all cores for decompression as well, because I haven’t been able to find any documentation that explicitly states this?

yes. -T0 = all cores. zstd was developed to do this specifically.

How does the resulting file sizes compare against different compression tools/algorithms (that also support parallel decompression)?

zstd was openly created and standardised by facebook specifically to compress media/text/bin files very quickly on multicore systems, and do it well. The linux support is excellent too.

IMO it’s essentially the best in the industry currently.

7-zip, while being a step above something like rar in 2022, is just far less auditable, proprietary and unmaintained compared to zstd at this point. Especially on Linux. p7zip (linux version) hasn’t been updated in forever and the main Windows clients are prebuilt hosted on sourceforge (read: the build process is NOT transparent).

Typical 7zip binaries are basically a black box now, and not a good idea to rely on 7zip moving forward for these reasons.

jlittle · February 28, 2022, 1:34pm

No, the 7-zip is the problem. Using squashfs you wouldn’t do that, it does the compression, and zstd is supported.

alpha754293 · February 28, 2022, 5:12pm

Varies.

I’ve been able to get by/get away with NOT knowing how to program, as a mechanical engineer.

(There are other people who DO know how to program (and LIKE to), that do a MUCH better job at it. I can write a MATLAB script in a pinch, but again, I make the distinction between scripting (automating myself) vs. programming (where you make the computer do stuff where I am NOT automating myself).)

No, I’ve played with it.

I don’t have Gluster deployed anymore.

(Turns out that with Gluster 3.7 and a distributed stripped Gluster volume using RAM drives, it wasn’t getting that much faster than just writing to said RAM drive, “straight up”. The only advantage that Gluster provided was instead of having to have a monolithic system with like 512 GB of RAM (to create a 256 GB RAM drive), I was able to take four nodes, each with 128 GB of RAM, carve out 64 GB of RAM each for a RAM drive, and then tie it all together using Gluster v3.7 to create a “256 GB RAM drive”. So that was where the real advantage of using Gluster was. But since that feature/option was deprecated and removed in the newer versions of Gluster, I haven’t been able to redeploy that again, which is a shame, because I think that HPC clusters can really use that. So instead of using NVMe SSDs and burning through the program/erase cycles of the NAND flash chips/modules by using said NVMe SSDs as your scratch disk, you can use the Gluster distributed stripped RAM drive as your scratch disk (think using AMD EPYC, each node with 4 TB of RAM, and you present as much of that as you can to the IB network, tied together with Gluster, and you’re using RAM as a scratch disk - no finite number of program/erase cycles to worry about). That was my motivation for trying out Gluster.)

Yes and no.

Micro HPC cluster headnode now has ~70 TB of available storage (eight (8) HGST 10 TB SAS 12 Gbps 7200 rpm HDDs in RAID5), so I use that now as my scratch space.

Right now, what I am trying to do is actually manually de-duplicate some of the existing data that I have, so that I can condense some stuff back down to a single copy/instance of it, and then I’m going to be packing the data back up and sending it back to tape.

I am actually going in the opposite direction where I am consolidating/reducing the number of duplicate copies of data/files that I have of the exact same stuff.

I think that at last count, I have something like 140 “cores” in terms of total compute power that’s available to me. But the speed and performance of each core vary in a very significant manner. (e.g. the ARM cores that’s in my QNAP NAS units are significantly less powerful than the performance cores that’s available to me on my 12900K). So, a straight “core count” doesn’t really say much of anything because I can have each system be working on a task and how fast one system completes the task compared to another varies significantly. (Like I wouldn’t want the ARM processors on the QNAP NAS units to be trying to do the compression task on 5 TB when I can put my 12900K on that.)

Total number of HDD spindles, I think, it’s something like 39 now, split up between the micro HPC cluster headnode and 4 servers.

Yeah…I get the idea.

And it works so long as there are n number of independent jobs for n number of independent systems.

This is where I am trying something a little bit differently where instead of trying to have like a single, monolithic archive that contains all of the files; if there are sub-projects within a given project, that I would compress each of the subprojects separately and then tie it together with a bigger compressed file of that. That way, if I want to extract only one of the sub-projects, I should be able to do that.

And a part of the reason for doing it at the sub-project level is that way, if I have a multi-TB project folder, I can parallelise the compression/decompression by taking on the sub-projects as the independent tasks in the grand scheme of the larger task. So, I’m trying that now.

But I get what you’re saying.

Alternatively, my PREFERRED method would be to use MPI and used distributed parallel processing, but that’s a LOT more difficult because I don’t know how you would “mesh” (and partition, then distribute) a task like compression/decompression.

One theory is that you scan all of the data, and then you just take x number of bytes and divided by k number of partitions (must be 2^n) so that you can split it up and then each node/process compresses just its own chunk of the entire dataset.

But I can’t read code, so I’m not sure if this idea would even work. But if it can work in theory, this is one of those times where and when it’s a shame that I don’t know how to program because if I did, this could be something useful.

Interesting. I did not know that.

Yeah, I originally was using 7-zip because of it’s parallel compression prowess, especially on Windows where you don’t have some kind of Linux emulator for other Linux and/or Linux-like tools like those mentioned above.

But as I get more and more into this, I started to realise/identify that one of my biggest process bottlenecks is the lack of parallel decompression.

I’ll have to play with zstd to see how well or how poorly it works for my datasets.

Thank you.

As I mentioned, I do have a file that’s generated by the CAE/HPC application itself that is multi-TB in size (> 1.75 TiB).

According to the documentation linked above, SquashFS won’t be able to handle this large of a file, which, sadly, would take SquashFS out of the running for me.