What's the fastest processor for single threaded/single process of running sha256sum?

MrFigs · May 24, 2020, 8:48pm

Why not use split to divide the file into chunks—say, 1 GiB per chunk—and record the hash for each one? Then, after transmitting them, you could hash each chunk and make sure that each checksum matches. That way, you can hash all the chunks in parallel. And as an added bonus, if one chunk doesn’t match, you don’t have to retry the entire file at once. Once they all check out, just use cat to re-assemble them in order.

You can’t add two hashes together to get the same result as hashing the combined file. That would have serious security implications. It would only work if your hashing algorithm is to add up all the bytes, and keep the least significant n bits…so hopefully not. But, I can’t think of a reason why you can’t check the sums per-chunk, then add the files together.

I would add that all hashing algorithms with a finite-size result must have collisions, through pigeon-holing. md5 collisions are a security/crypto concern, since you can pretty easily modify a file to have a desired sum using its many vulnerabilities. But, the chance of a random modification causing a collision is 1/(2^128), or 0.000000000000000000000000000000000000293873587%. That’s beyond lottery odds, to say the least. Unless you’re concerned about deliberate tampering with the file in-transit, I would go with whatever hashing algorithm executes fastest.

I’m not sure exactly how you would parallelize the hashing operations, but maybe somebody here knows?

Also, here's my sha256 speeds (i9-9900k at 4.9GHz):

$ openssl speed sha256
Doing sha256 for 3s on 16 size blocks: 21075457 sha256's in 3.00s
Doing sha256 for 3s on 64 size blocks: 11810309 sha256's in 3.00s
Doing sha256 for 3s on 256 size blocks: 5437878 sha256's in 3.00s
Doing sha256 for 3s on 1024 size blocks: 1701300 sha256's in 3.00s
Doing sha256 for 3s on 8192 size blocks: 230657 sha256's in 3.00s
Doing sha256 for 3s on 16384 size blocks: 115969 sha256's in 3.00s
OpenSSL 1.1.1f  31 Mar 2020
built on: Mon Apr 20 11:53:50 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-P_ODHM/openssl-1.1.1f=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha256          112402.44k   251953.26k   464032.26k   580710.40k   629847.38k   633345.37k

And md5 for comparison: (~1.5x speedup)

$ openssl speed md5
Doing md5 for 3s on 16 size blocks: 33872370 md5's in 3.00s
Doing md5 for 3s on 64 size blocks: 19564317 md5's in 3.00s
Doing md5 for 3s on 256 size blocks: 8616480 md5's in 3.00s
Doing md5 for 3s on 1024 size blocks: 2660204 md5's in 3.00s
Doing md5 for 3s on 8192 size blocks: 356744 md5's in 3.00s
Doing md5 for 3s on 16384 size blocks: 179171 md5's in 3.00s
OpenSSL 1.1.1f  31 Mar 2020
built on: Mon Apr 20 11:53:50 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-P_ODHM/openssl-1.1.1f=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5             180652.64k   417372.10k   735272.96k   908016.30k   974148.95k   978512.55k

Surprisingly, I’m way behind compared to AMD. For 16384 bytes, I got 633345.37k, while thro got 1996559.70k. That’s more than 3x faster! I guess those specialized hashing operations are no joke. So my tirade about Intel IPC and clocks goes out the window, I guess

I wonder if Zen 2 (3000-series) improves on this at all, and if it’s only SHA or other algorithms as well.

alpha754293 · May 24, 2020, 9:28pm

Much appreciated.

Single threaded performance isn’t as strong compared to, for example the Ryzen 3950X.

But if someone has like a 3970X and they don’t mind running the openssl speed sha256 test for me, that would help and would be greatly appreciated.

I forget why, but I think that the other reason why I also originally Threadripper was taken out of the running was because if I wanted to run 32-cores, I can run dual EPYCs (which would actually cost less) than a single Threadripper 3970X (and I would also get more PCIe lanes), but it might be worth revisiting.

Because for a 6.5 TB file, you’d end up with 6500 chunks.

Now you’d have to sequence those chunks to be hashed so that you don’t spawn 6500 hashing threads.

Even if I, say, were to set the chunk size to 1 TiB, I’d still end up with 7 chunks.

It’s adding an extra step, either way, and of course, as it is with anything, the more steps you add, the more points of failure you’re introducing into the process.

But that’s the thing though, if I have “in line” corruption during the transfer, despite it being infinitesimally small, it would still be there.

Yeah…Intel is supposed to have special SHA hashing instructions, but I haven’t been able to confirm if any pre-compiled/packaged openssl binaries actually make use of them.

Yeah…not sure.

alpha754293 · May 24, 2020, 9:33pm

I was originally looking at dual AMD EPYC 7282 ($650 ea. or $1300 total) vs. AMD Ryzen Threadripper 3970X ($1999).

With the dual EPYC, I would get 128 PCIe 4.0 lanes vs. 88 PCIe 4.0 lanes with the Threadripper.

noenken · May 24, 2020, 9:40pm

The 3960X is the 24 core and costs 1300,- US.
The single core deficit has been a thing in the past for TR, as far as I know it isn’t a problem with the 3000 series anymore. But as I said before, I don’t know shit about hashing algorithms. So … apply salt as needed.

Those are 2.8 to 3.2GHz, right? Aren’t those… like … a bit slow for single core stuff?

CybeastRaystriker · May 24, 2020, 10:05pm

How many lanes do you need, realistically? Because there aren’t many dual socket epyc motherboards that have many pcie slots available at retail.

CookiesAreAmazing · May 24, 2020, 10:40pm

If the problem is about source code not being “enabled” by default then using gentoo might be a solution?

@MrFigs
if confirming that the file is in one piece on the other side then CRC32 would be a better suggestion than md5 I think

risk · May 25, 2020, 12:24am

You wouldn’t get the same answer,… but you can do something like compute a crc32/adler32 for each e.g. 64K or 1M block on both sides and compare. This is trivial to parallelize even for a single large file, regardless of the algorithm (ie. you’d be doing the same thing on both sides for each block). And actual checksum can probably be weaker/shorter if you have less data per checksummed block…

by trivial I mean in a technical sense - I’d probably reach for a queue-like datastructure to store datastructure containing pointers to buffers alongside some kind of conditional variable either in go or in Python… no idea how I’d do it in a shell… by using lots of small temporary files in /dev/shm perhaps

MrFigs · May 25, 2020, 12:44am

I imagined a script or something that spawns n hashing jobs for the first n chunks, where n equals the number of threads on the CPU. When a job returns, start the next one until there are none left. Redirect all the stdout to a text file, and there you go. To verify after transfer, just add the --check flag. Although I’ve never written a parallelized shell script, so this might be a bad way to implement.

True, but the step you’re adding is just disk I/O, not network activity. If you’re concerned about this failing, I would wonder how you plan to reliably read or edit the file in the future.

I think you misunderstand. With MD5, an attacker can add a few bytes to the end of a file to manipulate the hash. So, they can compromise the file, then “fix” the hash to make a counterfeit version. This is much harder to do with sha256, hence “secure”. You would just have to try random blocks of data until you get a hit (heat death of the universe inc).

However, any hashing algorithm, even those that are woefully out of date and insecure, will detect a single flipped bit, or a dropped byte. But for this type of application, CRC32 or a similar data-integrity checksum will also suffice, and will probably be faster still. Just so long as there is no cryptography angle to this problem.

Unfortunately openssl doesn’t have a crc32 speed tester, as expected, since it’s not really a crypto algorithm. So I can’t compare the speeds like-for-like.

alpha754293 · May 25, 2020, 12:59am

Yeah, the 3960X is 24 core. 3970X would be 32 cores. Dual EPYC 7282 would also be 32-cores.

Again, it would be interesting to see if someone would be able to run the openssl speed sha256 command to see if we can get some data in this thread. If the Ryzen 2700 was able to get the results that @thro was able to get, I would definitely agree with you that there is or should be some level of expectation that the Zen 2/Ryzen 3rd gen processors would only do better, even we might now be in marginal gains territory.

Yes.

So…depending on the board.

I haven’t found a sTR4 board for the 3rd gen Threadripper CPUs that include IPMI, and therefore; I will need a discrete GPU for each system, even if it is just a very basic Nvidia Quadro 600 (PCIe x16 2.0), just to have SOMETHING as far as video output, because I will need it for the initial setup/configuration of the system.

After the base OS is installed, and I have openssh-server installed on it, then I can do pretty much just about everything else remotely, but it’s the initial setup where that comes in handy.

It’s too bad that there aren’t any sTR4 boards that I am aware that have IPMI because if such boards existed, then I probably could forego using a discrete graphics card.

So that technically puts it at x16 for the GPU, x16 for the IB NIC, and x16 for the 12 Gbps SAS RAID HBA, for a total minimum of 48 PCIe 3.0 lanes.

The idea with the dual EPYC and having 128 PCIe 4.0 lanes is that it will give me room to expand to U.2 NVMe SSDs once they are cost effective enough for me to deploy it at the storage capacity scale that I would need to deploy them.

I’m not sure if I follow in regards to how gentoo would factor into this.

I think that technically, I can probably download the openssl source and recompile it myself to ensure that I enable the AVX1 SHA extensions that’s currently available on my Intel Core i7 3930K processor, but I don’t see any benchmarks, even from Intel, about whether that actually improves the hashing performance because even their own publication only highlights improvements to AES, which suggests that the SHA AVX1 extensions don’t offer enough of a performance benefit for Intel’s marketing team to publish anything about that.

And again, given the data that @thro published with his AMD 2700X is able to produce, it is clear that AMD definitely has an advantage in this regard by a wide margin.

So now, it’s a competition between AMD 3950X, 3970X TR, or EPYC 7282.

thro · May 25, 2020, 1:05am

Yeah I got those results on a 2700x. Mine isn’t even tuned and was running a bunch of shit in the background (couple of VMs, etc.). Mostly idle VMs, but still. It’s by no means a test of the absolute best the machine could do.

It runs a Corsair h115i, but I haven’t touched clocks, that’s just precision boost/xfr doing it’s thing. I think it maxes out around 4.2-4.3 briefly on a single core. I did drop core voltage like 15 mV or so to reduce heat but that’s it.

It’s running 64 GB of non-QVL DDR4 at DDR4-1866. (mis-matched pairs, didn’t realise but not all Geil Trident-Z DDR4-3000 16 GB sticks are the same) - so it’s by no means a rocket-ship configuration either. Given infinity fabric speed is linked to memory clock… mine is kinda handicapped here…

I’ll pull out the slower pair of sticks when I upgrade my other home PC from Haswell, but for now I needed more capacity rather than speed for work from home test lab. I didn’t tune memory at all beyond simply dropping clocks from the JDEC profile until it stopped failing to post (I’m lazy :D)

alpha754293 · May 25, 2020, 1:11am

Yeah…no idea.

I don’t program, I am even MORE clueless.

Yeah…neither have I.

Like for my CAE stuff, I usually run it with a job scheduler and resource manager that’s built into the CAE application (i.e. I don’t use SLURM for it), so to do that properly, you’d effectively be having to write a low-level form/function of that, which, again, given that I don’t program, I wouldn’t have the slightest clue as to how to do that.

It’d be useful, like, if you have a LOT of shell commands to process, especially if you have a LOT of data and/or files to go through, and I can see how something like that can potentially be very useful, but I just have no clue how to do something like that.

But it isn’t just the disk I/O part of it.

The splitting of the file and then concatenation of the file so that it’s stitched back together properly again, are still extra processes that need to take place. And I’m not sure if you would be able to parallelise those processes.

Ultimately, I think that because I am starting from a single, large 6.5 TB file, that that’s going to be the bottleneck no matter what I do with it, because unless you can parallelise the split and the concatenation when you stitch the file back together, either way, that will remain the bottleneck.

So, it’s like yes, you can speed up the actual hashing portion of it, but now, as an end-to-end process, you’ve added more time to split it, and more time to concatenate the file back together. And with each additional step that you take, unless the sha256 algorithm itself is parallelised, it’s adding more ways that it can fail – i.e. what happens if there is a failure when it splits and/or what happens if there’s a failure when you try to concatenate the file back together?

(e.g. how can you verify that the concatenation was done correctly? You’re back at the same problem that you started off with.)

Yeah, I just started using sha256 as kind of the defacto standard based on industry best practices.

It’s not the fastest, but it is more secure, as mentioned.

But that’s the thing, I mean, in reality, my system will likely have a bunch of shit running in the background as well since it’s the headnode to my cluster, so there will always be cluster management stuff that’d be running on my system.

I like the fact that it hasn’t been specifically tuned for it, so that it gives a really good “real world” feel for what one might reasonably be able to expect.

Your results are good!

Works for me.

Directionally, it actually achieves the objective of trying to get and make a data-driven decision, so thank you for that!

alpha754293 · May 25, 2020, 1:12am

hahaha…aren’t we all?

aren’t we all?

thro · May 25, 2020, 1:12am

Yeah, just figured I’d clarify the situation in case people were thinking my machine is some extreme on-edge overclocked example or something. It’s not, it’s actually a reasonably crippled 2700x at the moment :D. I doubt it will make a heap of difference, but it certainly won’t be working in it’s favour

Can’t even remember what it was running, but likely a copy of pfsense acting as a gateway for some home lab VMs (likely an SCCM server, a client, and probably a few browser windows and steam in the background )

alpha754293 · May 25, 2020, 1:13am

I DO wonder though, if there are people here that has access to an AMD EPYC 7282, and someone else might have access to a Threadripper 3970X that would be able to run the openssl speed sha256 benchmark for us, so that I can see what kind of numbers one might reasonably be able to get.

Having these extra data points would be able to help point me in the direction that I think I would need to go with this.

Thanks.

alpha754293 · May 25, 2020, 1:15am

Right on!

Like I said, nearly 2 GB/s 8 kiB sha256 hash rate and that’s for a 2700X, that isn’t even for the latest and greatest is FANTASTIC!!!

alpha754293 · May 25, 2020, 1:54am

May I ask you to do a favour for me:

Can you create this shell script on your system and run it.

call it like test.sh or something

and in it, it should have the following:

#!/bin/bash
openssl speed sha256 2>&1 | tee 1.txt &
openssl speed sha256 2>&1 | tee 2.txt &
openssl speed sha256 2>&1 | tee 3.txt &
openssl speed sha256 2>&1 | tee 4.txt &
openssl speed sha256 2>&1 | tee 5.txt &
openssl speed sha256 2>&1 | tee 6.txt &
openssl speed sha256 2>&1 | tee 7.txt &
openssl speed sha256 2>&1 | tee 8.txt &

This should spawn 8 test threads.

What I am hoping to test for is whether the almost 2 GB/s sha256 8 kiB hash rate is going to be same across all of those threads and/or that the sum of all of those threads will roughly equal the single threaded hash rate performance.

If you don’t mind running that for me, that would be greatly appreciated.

Thank you.

You can probably dump the results here, using whatever method you think would be best.

My dummy way would be to cat *.txt > results.txt and then copying-and-pasting the results here, but if you have a different method, that you think would work better, then use that.

Thank you.

CybeastRaystriker · May 25, 2020, 2:02am

Since you really want the pcie lanes too, so Epyc would be ideal.

You could probably pair a high freq Epyc with this mobo-https://www.asrockrack.com/general/productdetail.asp?Model=ROMED8-2T#Specifications

thro · May 25, 2020, 2:25am

Will have to wait until this evening when I get home, at work at the moment…

alpha754293 · May 25, 2020, 2:50am

No rush.

Thank you.

alpha754293 · May 25, 2020, 2:57am

Yeah, I was watching a video by Patrick over at STH, and he was going over the CCX architecture and how each of the chiplets connect (or don’t connect, depending on how many cores there are) to the central I/O die.

Also, if I remember correctly, I thought that the Threadrippers also DO NOT support ECC Reg. RAM, whereas, of course, the EPYCs do.

So that was another consideration as well.

And I think that it was also based on some testing that Wendall did here as well which showed that going from like a 3970X (32-core) to a 3990X (64-core) resulted in something like only a 31% gain in performance (I think it was for GROMACS) which tells me that either the software and/or the hardware has scalability issues.

So…I think that was another reason why I had originally ruled out the 3rd gen Threadripper as well, on account of that, because I was waiting for the processors to launch, and then when it did, and the reviews started coming out for it, it resulted in less-than-stellar performance once you got beyond a certain level due to various architectural differences between say the 3970X (vs. dual EPYC 7282) or the 3990X (vs. dual EPYC 7542).

Like I said, this is what makes this a bit of a complicated issue due to all of the various competing demands in terms of what I am looking for the headnode processor to be able to do.