Appropriate workloads for 16 core consumer CPUs with only 2 memory channels?

hsnyder · November 23, 2021, 11:17pm

I come from a scientific computing background, where most of the time, memory bandwidth is the most important figure in predicting how fast an application (usually some sort of simulation) will go. This makes consumer chips with only 2 memory channels not that appealing for those workloads. But recently I’ve been thinking - there are several products out now that are straight-up consumer CPUs that have 16 cores, like the top end Ryzen 9 and the new Alder Lake i9 from Intel. What workloads actually benefit from all those cores (outside of just benchmarks)? They must be very compute heavy and rather light on the memory traffic. Reviews seem to focus on media production, but I’m curious what else is a good fit for these parts?

luukas · November 24, 2021, 12:24am

I’ve got pretty much the same background and don’t understand this trend either. Especially Ryzen 9 seems just stupid. Those are extremely powerful CPUs on mainboards that are simply not ‘good’ enough for this type of CPU.

I think there is hardly any reasons for CPUs with more than about 8 cores (at least high performance ones) in the desktop segment any more. I mean it’s not 2013 any more, we’ve got extremely performant GPUs and GPGPU is very well supported across pretty much all applications.
3D modeling, video editing, live streaming etc. were highly depended on many cores back then but today? Once you throw even a mid range GPU in the mix CPUs with lower core counts typically even outperform CPUs with higher core counts due to higher all core boost frequencies.
LTT did couple of videos on this topic but never came to the right conclusion, I think.
The only typical workload that does’t work CPUs is code compiling.
But who actually compiles large projects regularly on his/her own machine?
First of all most software can be built in seconds even with a single job (if they have to be compiled at all). Then there are pretty smart differential build systems that only recompile the files that changed.
Moreover most build jobs actually don’t scale that well with many cores. In my experience going with more than ~16 cores is pretty much negligible.
Virtualization is pretty demanding but wait AM4 CPUs have pretty limited RAM and I/O which makes them way less appealing.

Reviews/tests of CPUs are rather bad in general. The problem is that most benchmarks are designed to run as short possible and use as little RAM as possible since reviewers want to compare as “fair” as possible.
The problem is nowerdays these many core CPUs are only useful when you work with huge datasets or many instances. Both usually need lots of RAM and more importantly memory bandwidth.
Moreover reviews always focus on CPU only performance which doesn’t tell you anything nowadays.

I think reviews should focus more on system performance than component performance. The idea of having a ‘good’ (isolated) CPU doesn’t work anymore.

Orionagappe · November 24, 2021, 1:34am

If the workload can fit within the L3 cache, then memory bandwidth is a much smaller issue from what I understand. This useful for things like profiling datasets or cryptography.

risk · November 24, 2021, 2:54am

I think marketing tried to push “megatasking” as a thing, but that’s BS IMHO.

also, DDR5 and giant caches (not just for memory) are becoming a thing now, while not much helpful to existing huge thread count CPUs, things are moving in the right direction.

Also, what’s the point of having a 256 thread server with only 16 memory channels where half your threads need to jump to another socket to get to ram ¯_(ツ)_/¯

In theory yes, in practice it’s either messy or there’s tests to run as well, and even if you have caches and a build farm, coordinating a distributed build takes some CPU and for builds ram capacity often ends up as a limiting factor.

However I think beyond dry performance, it comes down more on how personal “productivity scaling” is just plain weird… and people with money/consumers will be the ones driving the design forwards.

An example might be a professional software developer/occasional opensource contributor who happens to have significant other and kids…in addition to a well paid job - a consumer.

They may have spent $3000-$5000 on a machine that’s off Mon-Fri ($5000/year on home computer “toys” budget) because they simply don’t have the time to use it, and then… it’s the weekend,. round of updates, FUU…skip… later…if malware will whine a little and restore from backup… and finally it’s hack-hack-hack time.

If going from 16threads to 32threads ($400 extra on a $3000+ 64G ram desktop machine) reduces the build/test cycle from 12s to 8s… (33%), as a developer that has maybe an hour or two on a good week of “free time” beyond work/kids/other to spend on opensource projects… that measly “4s” right there… totally worth it in my book.

Going from 2s->1s is less important, it’s not enough time to be distracted by something else and feel like you’re eating time;; … going from e.g. 20 minutes to 13 minutes, is also less important — it’s fully distracting, they’ll step away entirely and maybe come back next weekend in some cases, or never, or they’ll be reading docs and actually be working on code while the thing builds in the background.

Before they start there’s that first make test or equivalent after the initial git clone or similar even before making any changes, that’s another make-it-or-break-it moment - this is when they still don’t really know what they’re doing, so not much they can do in parallel to that, other than read the docs (that maybe don’t exist) but all the build caches are empty… time to crank up the CPU.

There’s a few million people in the world like this “aspiring” to hack on stuff in their free time that they don’t have, happy to spend money on the chance it might help, not a huge market but significant.

Those same few million people in their day job will run build/test cycles too on their Mon-Fri work machines, where they get to feel that 4s difference in building/testing/running an app in an emulator a lot more often… and their business is paying for it… this doubles the TAM (total addressable market).

Sum these up, and that’s a few bees ($B) every 2-5years – spent on high thread count CPU only – probably worth having a SKU.

The irony is, if you’d replaced all their work machines with raspberry pi zero class machines, the efficiency and performance of the rest of the world’s few billion machines would probably skyrocket.

MazeFrame · November 24, 2021, 7:26am

A lot of the media creation benefits from higher core count since it is similar data to be processed every time (= may fit into cache) and can be processed in parallel. Rendering videos, 3D modelling (not of the CAD kind, but the “artsy” kind), photo editing, etc.

It really is, yeah.

merry · November 24, 2021, 10:01am

It’s very niche but a a prime example of a workload well-suited to variable memory bandwidth is GIMPS prime hunting (of which Prime95 is one of the main ways to utilise a CPU). It is very bandwidth intensive and very compute intensive, however the memory footprint is relatively small which makes it a good target for caching. Testing smaller primes requires smaller FFT’s, so tests can be chosen to fit within a certain memory bandwidth, or L3 cache, or even L2/L1 cache of any modern CPU. It’s a shame Prime95 across a range of FFT’s isn’t widely used as a means of measuring the balance of compute/bandwidth/cache/efficiency, done properly it could be one doozy of a test.

luukas · November 24, 2021, 1:09pm

also, DDR5 and giant caches (not just for memory) are becoming a thing now, while not much helpful to existing huge thread count CPUs, things are moving in the right direction.

Well DDR5 doesn’t really makes this whole situation better. We constantly need faster and faster RAM to make use of CPU improvements.
We’ve got 16 core CPUs with just 2 memory channels that’s 8 cores per channel. Running a Ryzen 9 with two channels is effectively the same as using two Ryzen 7 with one channel each (I know, I know it doesn’t actually work that way but I’m talking about raw throughput) which everyone would call ridiculous since Ryzen 7 are highly bottlenecked by one memory channel. When you remember the performance improvements zen 3 with higher memory speeds it’s obvious that DDR5 will probably not fix the problem with memory.

In theory yes, in practice it’s either messy or there’s tests to run as well, and even if you have caches and a build farm, coordinating a distributed build takes some CPU and for builds ram capacity often ends up as a limiting factor.

I know. On my own projects I actually make use of it but that requires lots of knowledge of the build system and time. I just mentioned it because I think it’s something that should be a goal.
From my experience the even bigger problem are the users. I’ve seen so many developers that do a ‘make clean’ before every single compiling because they think it’s better to clean it first.

But I strongly agree that RAM is a very limiting factor. 2GB per job/thread is usually recommended as a minimum. 4GB is usually a good amount of RAM. But then we’re limited by the platform again.

If going from 16threads to 32threads ($400 extra on a $3000+ 64G ram desktop machine) reduces the build/test cycle from 12s to 8s… (33%), as a developer that has maybe an hour or two on a good week of “free time” beyond work/kids/other to spend on opensource projects… that measly “4s” right there… totally worth it in my book.

Going from 2s->1s is less important, it’s not enough time to be distracted by something else and feel like you’re eating time;; … going from e.g. 20 minutes to 13 minutes, is also less important — it’s fully distracting, they’ll step away entirely and maybe come back next weekend in some cases, or never, or they’ll be reading docs and actually be working on code while the thing builds in the background.

Strongly agree with that. That’s the reason why I personally always liked the way intel boosts their CPUs clock speeds for up to one minute. I think this ~25% (at >2x power consumption) is much more noticeable than ~15% at longer build times. Everything that is loner than about one minute doesn’t matter that much because no one would sit there and just wait for minutes.

I think I wasn’t as precise as I wanted. The first part of my comment was a bit exaggerated. I would not say that many core CPUs are definitely useless in desktop/workstation types of applications. I just think that many core CPUs are highly over rated and not as important any more.

My point is that whenever you can actually make use of more than ~8 cores you are probably more limited by platforms like AM4 or Z690. I means especially when working on large projects you probably want more RAM (in total) and more importantly more PCIe Lanes.
Most important is networking (I wouldn’t go with less than 25Gb any more on a system that I build on my own) and sometimes fast SSDs. I mean 25Gb Networks sounds like overkill but it definitely isn’t. Once you put thousands of dollars into a workstation you probably have your own server too. Hosting your own git can drastically improve performance.
Using a good NIC with a TCPoE and RDMA gives you way faster and (more important) better responsiveness. Cloning/Pulling/Pushing many small files over regular Ethernet is typically very slow. That’s something that can easily save docents of seconds too.
Having just 20 PCIe Lanes is very limiting.

I would wish to have segmentation like intel did again.
Having a small but cheap desktop platform (like AM4 or LGA 1151/1200) with up to ~8-12 cores two memory channels and 20 PCIe Lanes an intermediate (like LGA2066) with ~6-24 cores with four memory channels and say 40-48 PCIe Lanes (is enough for 2 GPUs, one 40Gb NIC and at least two NVMe devices) and a high end platform like Threadripper Pro with lots of cores, memory channels and PCIe would make a lot more sense than current segmentation.

Right now you either get a limiting platform or pay extremely high prices for a platform like Threadripper Pro because the mainboards are ridiculous expensive to manufactor with 128PCIe Lanes, 8 memory channels and vrms that have to handle 280W of power.

hsnyder · November 24, 2021, 7:35pm

Thanks for the points/discussion. Very interesting points regarding platform limitations and also how some benchmarks are tuned to run out of cache, thereby bypassing the effect I’m trying to highlight. I’d be curious to hear what @wendell’s thoughts are on this topic, as he’s someone who actually evaluates CPUs and benchmarks them, but also really knows what he’s talking about (no disrespect intended to other reviewers).

wendell · November 24, 2021, 9:27pm

So this is actually more interesting if you think about it deeper.

If you just have ONE busy core, do you really think you can get at all ~50 gigabytes/sec memory bandwidth on DDR4? The answer is lol, no, unless unless you are on a mac m1.

Holy crap, how did we not know that made that much difference before?

I get it now. System designers had to have a really hard choice – more memory channels in parallel is going to add latency. Is dual or triple channel the sweet spot? i was sure it was triple channel but no, it really is/still dual channel.

DDR5 can currently do about 75 gigabytes/sec at roughly the same latencies as ddr4. actually less if you use less channels of ddr5, but throughput suffers. However single core throughput to memory actually boosts a fair bit.

Some of that is Alder Lake P core monstrosity, some of it is smart cache, some of it is ddr5. But it adds up.

Memory bandwidth isn’t all that important if the cache is good. The cache is quite good on Zen 3. Alder lake seems decent also. The latency goes up more than I expected when the ring gets a little loaded; that’s a weakness of multicore alder lake for sure.

You can answer your own question partially if you look at zen2>zen3 and how the cache behaves. 2x16mb l3 means that if both sets of cores are busy doing stuff in the same areas of memory there are 2x sets of cache caching the same things in each segment of l3. But with 32mb in one segment it’s just the one copy. So not only do you have more memory b/c 32 is bigger than 16, you have more usable cache because you cut down on how many dupes you have in cache.

So it turns out for many, many things, the memory bandwidth is ‘enough’. So much so that amd eliminiated half the bandwidth for writes (see aida64) and literally no one noticed or cared. That freed up silicon floor space for other stuff.

All that waxing poetic now lets me make the gross oversimplification: We haven’t made an algorithm that both parallelizes well and operates on a memory segment so large that 20-30mb of cache doesn’t reasonably cover things and keep everything fed. At the same time, more cache is more better, so we’ll see some performance benefits of bigger caches. Understand that mostly this is a latency benefit more than a bandwidth benefit, I think, in real-life scenarios.

I expected the m1 to have superiority in latency. It does niot.

The place where we, the user, should be crying foul is in single thread memory performance. The apple m1 does great here and that’s 80% of the reason it “feels” better. One core can totally monopolize all dat memory bandwidth. Which seems like bad design. And it would be for a server! But a desktop? Swanky, as it turns out.

ThatGuyB · November 24, 2021, 9:29pm

Appropriate workloads for 16 core consumer CPUs with only 2 memory channels?

Mining Monero.
/thread

Jokes aside (even though mining crypto is a legitimate workload), consumer CPUs with multiple cores have been pushed mostly by game streamers and prosumer content production people. Streamers don’t always want or can use GPU acceleration for encoding, but even if they do, it’s very likely that streamers are running other software besides their video streaming software and their games. Things like software sound equalizers, additional software for expensive hardware docks like StreamDecks or similar, RGB lighting software (which all of them tend to be really big hogs on the CPU, probably because of how awful those software are written) and maybe they have to capture multiple inputs, like a 1080p camera stream and also push that into the video encoder. All of these take a lot of CPU. And games nowadays can use at least 8 cores, so that means a 12-16 CPU core with just 2 memory channels should be ok for those needs.

Content creation also doesn’t require a lot of memory, things like video editing, photo editing, amateur animation, those can chug along just fine on a CPU with just 2 memory lanes, but they can be pretty CPU intensive.

Then, there is also the amateur programmers who may be running things like Visual Studio, Eclipse or Intelij Idea or whatver else heavy IDE that is still compiling stuff on their own machine and not on a build server.

Linus Torvalds and Greg Kroah-Harman I guess… Hey, Wendell built Greg’s PC, albeit it is a Threadripper, you should give it a watch if you haven’t already!

The market is pretty well thought out TBH. If it wasn’t for so many shortages, you would likely see the 6 cores being the best sold CPUs. Threadrippers and TR Pros are fine for people who need a lot of local expansion, like multi-GPU setups or other accelerators and lots of cores.

To be honest, I’m kinda sad you don’t see an 8 core with really high frequency TR skew anymore, because having fast single cores with lots of PCI-E expansion makes a lot of sense - and it would be especially better if the cores were all running on a single CCX, to speed things up, but I think if that were the case, you might lose some memory channels, since I believe the CCX help with the high memory lanes (I could be wrong about that though).

hsnyder · November 24, 2021, 10:24pm

Interesting - I think I’m up against the limitation of my knowledge regarding caches, then. It’s funny, literally an hour ago I asked a question on an online computational fluid dynamics forum why Milan-X shows such an outrageous improvement in finite volume method CFD programs, even though the overall bandwidth is the same as regular Milan, and most of the changes ‘only’ have to do with cache… FVM CFD tends to be very low compute intensity - memory performance more or less completely dictates overall solver runtime. But, the meshes are usually very large, which made me think cache wouldn’t have a significant impact - you have to stream through all points in your mesh at every solver iteration anyways. The conventional wisdom regarding performance of CFD is that you basically stop seeing any benefit with more than 2-3 zen 2 cores per memory channel. The 64 core epyc rome parts basically performed the same as the 32 core parts, and the speedup wasn’t linear from 16 to 32 - you were better off with dual sockets at 16 if budget was a concern. On the surface it felt like a textbook example of a bandwidth bottleneck.

I suppose my next question is, given a fixed amount of available bandwidth, what are the factors of cache “quality” that you’re talking about? Is it as simple as just how good it is at predicting what memory you’re going to want next?

wendell · November 25, 2021, 12:17am

So what’s a prediction when you control time?

Suppose you use an RNG. What if the thing knows that the RNG feeds the thing that picks the next address? A fixed execution program you know what all the steps are going to be, in order, and it is either randomly generated (on the same silicon!) or deterministic.

If it’s RNG you just have to hack RNG so you know what RNG is going to be ahead of time, and pre-cache that.

For other access patterns it can be surprisingly banal and easy to predict. Every 3rd page? every 5th page? some polynomial determines next page?

there is some pretty elaborate machinery in there. The prediction part is very good but understand that prediction here is a misnomer because its not realllllyyyy predictive in the conventional sense.

What causes the performance delta with more cache is that you can keep more stuff in the cache which lowers the latency between “hey we’re probably going to need the stuff at XYZ address in memory”

So for “large memory footprint” sparse access more cache is more better than more bandwidth because more bandwidth doesn’t necessarily lower the latency. The sequential access is of course perfectly predictable.

The usage pattern that more cache doesnt help is one that is truely random in such a way that whatever predictor machinery is doesn’t comprehend. Can’t best fit a polynomial, can’t best fit linear, can’t best fit log… so what the heck?

Here’s the fun part: The compiler that compiled your code knows that this is a sub-optimal situation. So good compilers will try to translate and “optimize” your code using mathematically provable transformation patterns that better fit with the available hardware.

hsnyder · November 25, 2021, 12:28am

Thanks for explaining, that all makes perfect sense.

We’re definitely getting way down into the weeds here, but suppose for a moment the access pattern is deterministic, but based on the value stored in another array. I’m thinking of an unstructured mesh PDE solver, with connectivity arrays that encode which cells are adjacent to which other cells… Do you know if modern cache designs would be “smart” enough to pick up on that? It might look organized, but it depends a lot on the mesh - it might look like a pretty whacky pattern at times, even though it’s totally deterministic.

(Edit: I guess that access pattern I just described is even simpler than what a game would have, for example, so I’m guessing it’s all just fine…)

hsnyder · November 25, 2021, 12:34am

In other words, it can do more preemptive caching because it has more spare room available?

I’ve been eagerly awaiting the rumoured Sapphire Rapids chips with integrated HBM memory, thinking that for my workloads it will crush everything else available… But if I recall, HBM actually has fairly poor latency, and if that’s the case, perhaps AMD’s V-Cache approach will still win out, unless the actual on-die caches on the Sapphire Rapids cores end up being extremely good…

wendell · November 25, 2021, 1:51am

It’ll probably depend on the compiler and workload. Predicting memory access based on transformations to an array is maybe not something the silicon would do but the compiler would.

The compiler knows a lot about how the silicon is built and will do a lot of things like once the array is built it will pre-emptively do a load to cache. Or program designers can take the reigns and worth within optimizing for a specific chip.

The the tooling for looking at where a program is stalling or bottlenecking is very well built. It’s not black arts left to a secret Kabal just get your hands dirty and look into what your program is doing or times when CPU execution stalls then optimize it away.

Hbm vs vcache is probably going to come down to which is easier/more obvious to optimize for.

I don’t know.

Parallel example: optane persistent memory. So with pmem database architectural decisions change completely. There are industry standards. No matter how good your machine is, postgresql performance falls off exponentially. Unless you can count on memory not losing data. Then it scales linearly.

Is it a matter of just adding some pmem and off you go? Lol no it’s a crazy set of patches to make it all work. It’s not ubiqutos enough for the most optimal code to be easily accessible.

We have two companies doing their best engineering right now. We didn’t have that for almost a decade. The software is barely keeping up with the hardware because the hardware offers so much more potential now.

another reason m1 is good is because hardware and software can work together to find and nuke bottlenecks. And in a perfect world apple can control xcode to the nth degree to make sure everyone uses day zero optimizations. Reality tho is that apple is all thumbs here and literally anyone else with more engineers that does the hardware+software syngergy will take over. Here where OS X is kinda bleh from the inside will come back to bite them. But their mobile os has been optimized to the nth degree mostly. Probably to the point that the optimizations open up a lot of sideways hardware holes

hsnyder · November 25, 2021, 2:53pm

I see. Just out of curiosity, what could the compiler do in this example, other than re-ordering the array elements or inserting explicit prefetch instructions?

wendell · November 25, 2021, 8:07pm

If the compiler sees that the reference to the array will result in a memory fetch it can insert a read operation as soon as the address is computed rather than when it’s needed

luukas · November 27, 2021, 1:50pm

Wouldn’t agree on all points.

more memory channels in parallel is going to add latency. Is dual or triple channel the sweet spot? i was sure it was triple channel but no, it really is/still dual channel.

Not necessarily. In theory more channels (or a wider memory interface) should not increase memory latencies. It typically does because of the added complexity but there are some exceptions. Anandtech did some testig on ice lake vs cascade lake and found that going from 6 to 8 channels didn’t change the overall latency.
Moreover there are basically no direct DRAM memory accesses. Data will always be fetched (at least in Intel/AMD Caching strategies) in a whole cache line (typically 64 bytes/512 bits). A wider memory interface could even give better latencies because 64 bytes could be brought to cache in one DRAM access with 8 memory channels. Obviously that usually doesn’t work and more memory channels mostly add variance in latency.

Memory bandwidth isn’t all that important if the cache is good. The cache is quite good on Zen 3. Alder lake seems decent also. The latency goes up more than I expected when the ring gets a little loaded; that’s a weakness of multicore alder lake for sure.

I would say it’s actually the other way around. Caches nowadays aren’t actually that fast in terms of bandwidth or throughput (they don’t have to) - at least for a single core.
I mean from Cache to Register you usually just get 64 bits/ 8bytes. Even if you fetched at 5GHz that would just be about 40GB/s.
Caches are more about latencies. You’re trying to improve the access times not the overall bandwidth. Having large caches can be good (it actually can make performance worse in some applications - I’m currently reviewing a paper that found applications where these caches can actually make performance worse).

I expected the m1 to have superiority in latency. It does niot.

Did you do some testing on that? I haven’t got the chance to dig a bit deeper but my first guess would be LPDDR5. At least for previous generations of DRAM the LPDDR variants had a lot worse latencies than the non LP variants. Could it be the LPDDR RAM that is causing the higher latencies?
I’m very exited about the M1 family I’m looking forward to do some testing on my own. I really would like to do some testing with these devices. But I have to work an Milan and Ampere first (could be worse, I know )

faizoff · November 29, 2021, 7:49pm

Very interesting discussion that I’ll be following. In comparison to the listed above scenarios and scientific usage, my purpose for the 5950X pales. I bought the 5950x mainly because I was coming off a decade old i5 2500k and wanted to upgrade with the intention that this should last me another 10 years or so.

I do a lot of transcoding for home media consumption and now am starting to get into the weeds of tweaking these processes. After jumping into the world of 4k media, transcoding a 50-60GB mkv file has become quite a task. The 5950x can crush those files in about an hour or so compared to 20+hours on the previous i5.

I had the 2700x for a few months prior to getting the 5950x and there’s another small usage I’ve found the 16 cores helps. When using DaVinci Resolve the transcoding time is much faster on the 16 core CPU even though Resolve is a GPU heavy program.

So while not so significant in the grand scheme of things, I am making use of my 16 core machine as much as I can. I also run a few VMs so I’m sure that helps. Pair it with some 64 GB RAM at 3200 CL16, I seem to be utilizing it fair bit, though nothing near its potential I suppose.

system · August 30, 2022, 1:49pm

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.