Question about parallel and/or multithreaded 7-zip decompression

alpha754293 · January 24, 2022, 6:01am

Does anybody know if there is a way to use parallelisation and/or multithreading for 7-zip decompression?

I know that I can use multithreading for compressing a 7-zip file (whether it’s in Windows and/or in Linux).

Does anybody know if there is a way to use parallelisation and/or multithreading to DEcompression a 7-zip archive file?

Your help is greatly appreciated.

Thanks.

risk · January 24, 2022, 6:28am

>.7z is an archive format, files inside can be compressed using different algorithms.

Some algorithms can’t be parallelised internally efficiently because every next chunk of data ends up depending on the previous one, either when compressing or when decompressing.

Naive approach is to split input into e.g. 64M blocks, compress those… into variable length buffers and store them with some header into the resulting file… this approach works for everything, except files smaller than 64M, but then there’s overhead of being unable to rely on data from one such block in another block.

Not a feature of 7z.

This custom scheme works well if you’re compressing data for yourself (not for distribution) and don’t care about standards.

There is 7z with zstd and brotli support here:

See Modern7z

Same caveat re: standardization applies.

alpha754293 · January 24, 2022, 6:43am

@risk
Yeah…so I’m using the LZMA2 algorithm so I’m not sure if that I fully understand how that can use parallelisation/multithreading during the compression/creating the archive part of it, but it appears that I CAN’T use parallelisation/multithreading when I am DEcompressing the same archive file.

Like I get what you are saying in terms of how the thing that tends to prevent parallelisation is the dependency of the current answer from the previous answer and I understand that.

What I don’t fully understand is that if it works for compression, then why wouldn’t it work for DEcompression?

I looked at the link for Modern7z and it looks a lot like the 7-zip tool for Windows, but it doesn’t really say whether it would be able to use parallelisation/multithreading for DEcompression.

Thanks.

risk · January 24, 2022, 6:59am

It’s a plugin for 7z that lets you specify more stuff.

It’d allow you to swap lzma2 for Zstandard (aka. zstd or zst), which is dramatically faster at decompressing at similar compression ratios to xz.

If you’re looking for “gigabytes per second” decompression speeds, I think you need to look away from 7z and look at something like pzstd probably. (or pixz if you really still want lzma2 for some reason, not sure why you would).

alpha754293 · January 24, 2022, 1:31pm

@risk
Thank you.

The two reasons why I was looking at LZMA2 is because I have found that for my files, LZMA2 seemed to compress better, but I haven’t tested it exhaustively nor extensively due to the sheer total size of files that I am compressing. (i.e. it results in a 7-zip archive file of around 7.2 TiB).

I might have to experiment with that more to find out if that’s pzstd is an option for me.

What’s weird/ironic is that a lot of benchmarking sites for example (e.g. Anandtech) will test 7-zip compression and decompression speeds, but when it comes to how I am using it (-m0=lzma2 -mx=9 on compression), the decompression speed is nowhere close to that.

So clearly, there is room for improvement.

Thanks.

edit
It looks like that for pixz, if I want it to run in Windows, I would have to compile those myself, correct? (Not sure if you know or are aware of pre-compiled binaries for Windows)

Thanks.

edit #2
Also, for pixz, is there a way to have it decompress the archive file AND unpack it using tar in the same command line?

The man pages for it that I found online says that for compression, you can use this command where it will archive and compress using tar:
tar -Ipixz -cf output.tpxz directory but I haven’t been able to find any examples where it reverses this process in a single line.

I tried using pixz -x * < input.tpxz | tar x but it gave me an error instead.

It looks like that for Windows, it would have to be the Modern7z plugin that you cited earlier.

If what I have written here, or that I attempted to summarise is not accurate, please correct me.

Your help is greatly appreciated.

Thank you.

regulareel · January 25, 2022, 3:30am

Any particular reason why you need decompression to be that fast? We got some bright people here that can really help you if you can further elaborate on your problems: the more specific use case is, the better.

What I’m saying is, why is waiting a bit more so undesirable for you?

alpha754293 · January 25, 2022, 6:43am

So…the current 7-zip archive that I am compressing using my 5950X (Asus X570 TUF Gaming Pro WiFi, 4x Crucial 32 GB DDR4-3200 unbuffered, non-ECC RAM) is at 94% done compressing after 2 days 1 hour.

The total volume of data being compressed is 7355 GiB. (It is the working directory for a CAE/HPC project.)

Therefore; if it takes over two days to compress the file, it will DEFINITELY take longer than two days to DEcompress the file when I need to do so.

So if there is an opportunity for me to speed up the decompression process (where it will use multithreading where/when possible vs. not using multithreading), then I am not utilising the system’s resources to its fullest potential.

(Total 920 folders, 6609 files that’s being compressed.)

Thus, in regards to your question, if there is a way to parallelise the decompression or where it can, for example, decompress multiple files in parallel, which would then reduce the total wall clock time for the entire decompression operation, then it is a benefit that can be had that’s just waiting for me to employ, born out of necessity.

(An example scenario is that the client that had asked me to perform this analysis for them wants me to do more work for them, and my response is “I can’t start for two weeks because it takes me about a day for me to pull down the data from tape, and then at least another week to 10 days for me to decompress the archive file before I will have access to the previous work that I had done for you.” I can’t imagine if a client would be okay with an answer like that. But even if they were okay with that answer, if I can speed up the entire process, I don’t really see a reason not to. I understand and recognise that for most people, it probably doesn’t matter enough. But when I am dealing with and handling multi-TBs of data, what might not have mattered before, suddenly matters a LOT now.)

I know that there is a way where you use, for example, bzip2 to compress each file individually, and I can use bash shell scripting for that, but what I typically do after the 7-zip archive is created is I then calculate and write the double parity data using par2 and then once that is all done, all of that data is written onto a LTO-8 tape.

So if there are opportunities to speed up the process on BOTH sides of the coin (compression AND decompression, creating the archive AND testing the integrity of the archive, creating AND testing/verifying the double parity data) – then at least I would be faster in doing/performing these tasks and/or more efficient than where I currently stand and/or have a higher utilisation of available system resources for these data management tasks.

I appreciate all of the help that people are willing to offer and provide.

Thank you.

edit
The other use case that I have for this is I am using .tar.xz as a PXE boot image.

When one of my systems boots up over PXE, it will download the .tar.xz image, and then unpack it into a tmpfs/RAM drive, and then finish the booting process.

Periodically, I would need to update a few things on the PXE boot image, so between the decompression of the old image, updating it, and then compressing the new image, and then having the system that’s booting off of PXE to be able to decompress the .tpxz file on boot - that can, again, also help speed up the entire process.

Again, if I can get it to work with help from members here, then it’s just performance that is to be had, that I am currently leaving on the table because I am not using the parallelisation where this might present an opportunity to use more parallelisation and do thinks faster, better, and more efficiently. (So, if it is there and available, why not use it?)

Mastic_Warrior · January 25, 2022, 2:43pm

Does it have to be that large? Like, compress the bootable part, but store the static data on a NAS/SAN and make reliable backups?

Sending that much data over the wire is a little bit inefficient for a bootable system.

alpha754293 · January 25, 2022, 2:45pm

I think that you might be mixing the two use cases.

The CAE project is separate from the bootable PXE image.

The bootable PXE image is only (currently) around 518 MiB, so it only takes about a minute or two to send over the network.

The CAE project is that big because it has a lot of data from a time-dependent, transient solution and it would take me longer to re-run the simulation each time I need the data back vs. just storing the results.

Mastic_Warrior · January 25, 2022, 3:16pm

Thanks for the clarification on that.

In that case could you not just use a BSD/GNU/Linux system to do the compression/decompression?

These are the parallel tools that come to mind.

You can use Mingw to compile a Window Native binary for as long as you can meet all of the required dependencies.

alpha754293 · January 25, 2022, 5:11pm

Thank you.

My apologies - I actually edited my questions about getting some of those tools working in CentOS out of my replies.

Maybe I should have left it in there.

I would like to be able to do it in both because there are sometimes, there is a need for me to be able to get access to the data via a Windows systems (some of the graphics handling in the application doesn’t work as well in Linux compared to Windows and it’s bad enough that the vendor for said software application actually has a dialog box that pops up warning you about said graphics issues in Linux, but the same dialog box does NOT show up in Windows).

On the other hand, most of the compression/decompression tasks are already done in Linux (because Linux has NFSoRDMA, and I needed a Linux system to be able to run/compile the LTFS system that is needed by my LTO-8 tape drive).

In regards to the source/reference that you have provided, a lot of them will talk about using parallelisation for compression, which is true, but if you read through some of the comments, there is one comment that said that really only bzip2 is the only one that supports parallel DEcompression (although I am testing pixz now, but there are, like I said, other caveats to that test because I am not sure if I can decompress a .tpxz file when booting a system over PXE whereas with .tar.xz, I know that I can decompress it when booting said system over PXE.)

(And also as noted, I am also running into the minor issue where I am trying to get tar to use pixz for both compression and decompression in the same command, at the same time (as given by the example that’s used for compression above), but I haven’t really been able to get that to work “backwards” during the decompression phase like I have been able to during the compression phase.)

If I can be more efficient by issuing a single command that does both rather than using two commands, then again, more efficient is better.

Thank you for your help.

edit
Sidebar#1: I do have cygwin installed on my system.

Sidebar#2: I am still using Windows 7 SP1 x64 Pro otherwise, in theory, if I upgraded my systems up to Windows 10 (I think at least 20H2), then I could use WSL and through that, I would be able to do the same thing in Linux whilst I’m still in the Windows environment.

But I don’t have any plans for migration yet because Windows 7 still works faster (for me) than Windows 10 does. (My main desktop is still rocking a 10-year-old Intel Core i7-3930K processor, so there is little to no advantage for me in upgrading my hardware (or Windows) to anything newer. My said 5950X system is actually one of my new HPC nodes that’s replacing an older Supermicro server because wife says that the Supermicro server is too loud.)

Mastic_Warrior · January 25, 2022, 7:05pm

I was going to suggest that or WSL2 (you mentioned GPU). In that case, I am out of suggestions because you have an extremely niche use case that has (seemingly) a lot of hard constraints.

risk · January 25, 2022, 10:25pm

Try pixz -d < input.tpxz | tar x .

For some reason pixz doesn’t support -c.

Normally you could do stuff like curl ... | gzip -dc | tar ... ;

… I guess with pixz you should also be able to curl ... | pixz -d | tar ... or build similar workflows.

Are you storing your archive on the network, or locally?

What kind of storage are you using, plain old nvme?

What kind of network do you have?

(I’m wondering if you could do compression/decompression more quickly using multiport machines somehow?)

alpha754293 · January 26, 2022, 3:01am

Yeah, the only reason why I am still using Windows 7 is because for high performance computing applications and computer aided engineering applications, Windows 7 is still around 7% faster (on average) compared to Windows 10 doing exactly the same thing, on exactly the same hardware.

I will try that.

The archive file ultimately gets written to LTO-8 tape.

The data currently resides on the cluster headnode (because it has more scratch disk/storage space). (Four HGST 6 TB SATA 6 Gbps 7200 rpm HDDs in RAID0.)

The system itself, which is replacing my old HPC nodes, so I guess that makes it my new HPC nodes, reads the data over NFS over RDMA over 100 Gbps Infiniband, processes it, and then writes the results back onto the same storage RAID array that’s on the cluster head node.

From there, the data is then sent to LTO-8 tape (which is also read off the cluster headnode over NFS over RDMA over Infiniband).

The system that is running the LTO-8 tape drive only has what is now a 4-core processor (it’s the 3930K, but I think that I burned up at least one of the cores due to excessive and long term overclocking, which results in system instabilities which I am able to resolve by just using 4 instead of all 6 cores).

And whilst that system could probably handle the data management and archiving tasks, my concern is that if it takes my 5950X more than 2 days to compress and archive about 7 TiB of data, I can only guess at how much longer a crippled 3930K would take to do the same.

100 Gbps Infiniband

I’m not entirely sure I fully understand what you mean by this.

This is a new concept to me.

risk · January 26, 2022, 7:58am

connecting the dots:

…

so 7T of data in 7K files means that optimistically you have 1G or 2G files?

what I had in mind was a bit custom

option 1. what if you compressed each file, and then tarred or zipped them together into some kind of tape-able archive?

option 2. do a bit of coding and a more professional job of this and ~read~ research about xz file format and what pixz is doing across multiple threads and treat this as a ridiculously parallelizable hpc problem. write some simple remote compression / decompression service, and do the same across multiple machines.

back of the envelope math:
200k seconds for 7T of data across 32 zen 3 cores tells me you can do 1MB/s / core second… and your job is worth ~7M core seconds.

… so if you had e.g. a couple of large core count epycs e.g. 4 or 8 nodes with 256 threads each, this compression/decompression could maybe be done in 30, maybe 60 minutes whenever you need it, maybe faster if you had more nodes…
… this may be too fast/faster than it’s worth.

The math above is very handwavy, you’d probably need to write down some options in a sheet to compare.

It does seem like xz / pixz file format allows for trivial parallelization of compression/decompression - it’s worth looking into.

alpha754293 · January 26, 2022, 4:50pm

Yeah, I don’t have a histogram of the distribution of the number of files w.r.t. file size, but on average, that would be about correct. (Although, the reality is probably that there are probably about as many really small files (maybe on the order of < 1 MiB each) and a large number of really BIG files as well.

So rather than a Gaussian distribution, it might be more akin to it being a bimodal distribution of sorts.

Yeah…so this would be basically the parallel bzip2 option.

And in theory, that’s an option, because then I can store it a 7-zip archive using the Copy option (with no compression) and that can do the trick.

My original reluctance with doing something like this is because I’d have to figure out a way to deploy said parallel bzip2 so that it would snake through, recursively, into the sub directories/folders and compress all of the files it sees.

7-zip with LZMA2 on max compression is more of a “lazy” way, but doesn’t offer parallel decompression.

pixz could work for my first use case/scenario. Not sure if I would be able to do the same for my second use case/scenario (PXE boot).

Unfortunately, coding is NOT my forte.

In fact, I avoid it like the plague.

My faculty advisor, when I was going to university for mechanical engineering, at one point, literally told me that I was useless to him as an engineer because I can’t program.

(It has been my experience that with engineering equations, each part of the equation is explained and taught as to why things are the way they are so that you can understand it and learn it. This is NOT necessarily true for programming. Like so far, nobody has been able to explain to me why (I think it’s C) ends each line with a semicolon. Or why in Fortran 77, the write command has the syntax of write(,). Or what a Fortran unit actually means. etc. Some of the programmers I know just try to google the answer to their assignments and then tweak it until it works. And if you were to ask them WHY it worked, they can’t tell you that. They just know that it does.)

But yes, I would LOVE to be able to see a MPI version of this because if that were a possibility, then I can put my entire cluster to use for this task and that would be AWESOME!

(On that note though, if I go the parallel bzip2 route, then in theory, I can manually partition the total task such that computenode1 handles directories A-D, computenode2 handles directories E-H, etc. A smarter approach may be that it would just schedule each of the compression task as a separate job unto itself and then submit the jobs to SLURM or something, but that’s like wayyyy more complicated, but it would use the cluster somewhat more efficiently (because as the compression tasks are finished, it would just pick up the next, new task and work on that, though I fear that if you are compressing a lot of really small files, the overhead from SLURM would make that woefully ineffiicent.)

Interestingly enough though, pxz achieves parallelisation with OpenMP, which, unfortuantely, means that you CAN’T parallelise across nodes which means that you would need a monolithic system to be able to handled that, and with OpenMP, I also know for a fact that you will never fully load all of the CPU cores in the same way that MPI can.

I am not as familiar with pthreads that pixz uses, so I don’t know if and/or how well you would be able to use that for distributed parallel processing.

There are people who are DEFINITELY wayyyy smarter than I am, and given the fact it appears no one (so far) has written or converted some of these tools into MPI versions of the same, tells me that it’s not as easy as it might otherwise sound on the surface.

Like, my thought process would be that if MPI can do it, it would take an inventory of all the files and work that it has ahead of it, and then distributed the tasks to the nodes/cores/processes ahead of itself so that ideally, all of the nodes/processes would finish at about the same time as each other.

From the research (prompted by your suggestions) that I’ve done so far, it looks like that pixz is one of the leading candidates.

I haven’t tried pzstd yet, so that’s why I don’t know about that one (because pzstd has a Windows counterpart as a 7-zip plugin) whereas I haven’t seen the same for pixz yet.

However, once I figure it out for the first use case, the next question that comes immediately after that is “okay, now how do I deploy this for my PXE boot scenario?” where the PXE client would download the .tpxz file and decompress it and then continue booting from it (so that I can replace the .tar.xz format that I am currently using).

risk · January 26, 2022, 7:01pm

most of “computing infrastructure” humans have built to date, is a series of engineering tradeoffs… same with semicolons that end statements (not lines). Alternatives were just more complex.

(actually, the fact you could have statements span multiple lines… and would only end with a semicolon was seen as a huge advancement at the time, back in the early 70s when c appeared).

Like, my thought process would be that if MPI can do it, it would take an inventory of all the files and work that it has ahead of it, and then distributed the tasks to the nodes/cores/processes ahead of itself so that ideally, all of the nodes/processes would finish at about the same time as each other.

…

I was thinking of something easier, well easier from my own perspective.

I think what pixz does, is it takes a chunk of input e.g. 10MB, compresses it, writes the xz block header and the lzma2 compressed data, takes next chunk … rince repeat.

and then there’s a bit at the end.

I haven’t done as much research into pixz, but this is what I had in mind:

I was thinking of something like making a small http server in python that would take data over http, and return compressed data over http. (e.g. simple 50 lines of python flask app)

And then another thing in python that reads 10MB from input, sends it over http, … once it receives result, writes the xz block header and the result.

A few things would need to be done correctly, like ensuring stuff remains ordered.

jlittle · January 26, 2022, 8:37pm

IME zstd has good compression, not quite as good as the best that xz or lzma2 but close, but is much faster, both to compress and to decompress. Decompression is often fast enough to be limited by the i/o so multithreading wouldn’t help.

If your data has many files, zstd in dictionary mode can mean you don’t have to put the files into an archive to get good compression; a use case that wants random access can benefit from that.

alpha754293 · January 27, 2022, 1:33am

Did not know that. Interesting.

Yeah…I’m not sure how the parallelisation piece of that would work/come into play given that in theory, you can take a 320 MB file, and break that up into 32-10 MB chunks and then run the process that you’re talking about in parallel.

And like you also mention:

But I’m not sure how the system would know where’s the beginning and/or where’s the end (or if it doesn’t care about making the problem implicit in nature), i.e. it would just compress the stream.

I don’t know how fast it would be to be reading and writing data over http stream given that I am using RDMA over Infiniband.

I get the idea of what you are saying where you basically have a processor, and then a way of feeding in the data and a way of getting the output from the other end, and everything else is just passthrough.

Where I am less unsure of is the parallelisation aspect of it, like I understand for example, that a http server can handle multiple requests at a time, but then you have the whole “ordering” problem that you mentioned at the end there; plus then you add on top of it, the other idea that you had where you use a cluster to do this processing.

In more “traditional” MPI applications, care is taken that it would be able to re-compose the distributed parallel results back into a single solution that is “linear” such that between a serial result and a distributed parallel result, especially for implicit solutions, the results should be identical or nearly so.

This level of computer science though is WAYYY over my head.

I haven’t tested it with (p)zstd yet, but I figure that with Infiniband and multiple drives in RAID0, it would be interesting to see what the decompression speed would be with and without parallelisation.

I read about that but you also have to train the dictionary and I don’t know if this would mean that I would need to train the dictionary for each, independent computer aided engineering project that I work on because the result files will be different project-to-project.

It’s worthwhile for me to try/test it.

But again though, the PXE boot question still comes in right behind it; i.e. if I can get it to work for my CAE data, then I am going to want to try and deploy it/experiment with it for my PXE boot server.

edit

By some “freak” miracle, I actually got pixz working for PXE boot and I don’t really understand how it is able to understand the .tpxz archive format and how it knows to decompress it and unpack the resulting tarball, but I did get it to work.

So, looks like that pixz is now my leading candidate for both parallel compression and parallel DEcompression.

(For Windows, I am going to have to migrate to Windows 10 (at least 20H2), so that I can install WSL to be able to use pixz, so no GUI is available.)

Thanks all for your thoughts, comments, and feedback.

alpha754293 · February 25, 2022, 5:32pm

@risk
Are you aware of any tools for parallel extraction of tarball files?

(i.e. parallel tar extract?)

The reason why I am asking is that now that I am able to use parallel decompression with pixz, the resulting file is still a tarball file.

My understanding is that the extraction of a tarball file is still only single threaded/single process.

Do you know or and/or are you aware of any means or mechanisms to be able to speed up the extraction of a tarball file with some kind of parallelisation?

Your help is greatly appreciated.

Thank you.