Just curious if anybody here has gone with a consumer DDR5 system for a “workstation” style build? Particularly for AI/ML, or HPC/scientific computing work.
Both intel and AMD support ECC on their high end consumer chips (via W680 on intel), and the memory bandwidth is way higher these days than it used to be on dual-channel systems. Basically I’m curious if anybody is using high end consumer chips as a mini-threadripper, rather than shelling out for TR pro or the new Xeon Ws?
If so, I’d love to hear about your setup and how you cope with the limitations, if you feel any.
But so are CPU and GPU speeds. DRAM scaling is very lackluster and while DDR5 gives like +50%, it only makes DRAM less slow. If you need bandwidth, you still want 4,8 or 12 channels, or HBM.
I don’t deal with Scientific compute, but I ran some corresponding applications and benchmarks to check my new machine with DDR5 and while it’s better than AM4, I see memory bottlenecks as soon as I hit 6-8 Zen4 cores on some memory intensive workloads. So depending on what you are doing, you get performance regression, which might be still fine, but leaves a lot of compute on the table.
And getting 128GB+ and lowering clocks only reduces bandwidth even further. Even the highest overclocks can’t compete against 4-channel DDR5-4800.
Dual-channel isn’t up for the task, but might still be the best bang for your buck in the end. Don’t expect to get 16 cores worth of compute.
I think this is an interesting topic that, unfortunately, I have never had a great experience with myself.
I pulled up the output of dmidecode on an AWS EC2 instance that we use for scientific compute tasks, it looks like this;
Manufacturer: Amazon EC2
Product Name: r5.12xlarge
Socket Designation: CPU 0
Type: Central Processor
Manufacturer: Intel(R) Corporation
Version: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Voltage: 1.6 V
External Clock: 100 MHz
Max Speed: 3500 MHz
Current Speed: 2500 MHz
Upgrade: Socket LGA3647-1
Core Count: 24
Core Enabled: 24
Thread Count: 48
Total Width: 72 bits
Data Width: 64 bits
Size: 384 GB
Form Factor: DIMM
Speed: 2933 MT/s
So this can be a sort of a reference of the type of system we are up against; note that many on-prem HPC systems get upgraded far less often so might be using even older nodes
My experience has been that even if you can match or beat these types of specs on paper, the real life results of trying to do scientific compute on consumer setups has been lackluster because of the lack of horizontal scaling.
We dont use just one of these EC2’s, we have dozens or potentially hundreds of them running at a time (if available).
So if one EC2 like this can finish an analysis on one sample in 6 hours, but a typical batch has 25 samples, well thats 150hrs (6 days) if you have to run them sequentially. If you can run all 25 in parallel across 25 EC2’s (or HPC nodes), then your start-to-finish time is only 6hrs. If you only have a single high-end workstation, it does not matter how many CPU cores and TB of memory and high bandwidth interconnects you have, you’re still gonna be limited to the max capacity it can handle at any one time, and it wont make much difference if you are using e.g Zen 4 or the latest Intel arch for it. So from a hardware-enthusiast’s perspective, it can be hard to justify splurging on hot tech if you’re still gonna have such limitations in your overall throughput.
It would be really interesting to hear about scientific workloads that people have found amenable to running within manageable timeframes on e.g. single home enthusiast workstation setups.
A GPU won’t do anything if the workload isn’t suitable for GPUs. GPUs do a limited amount of things very well, but CPUs can do everything, they’re jack of all trades. And often the work required to port a workload to a GPU affine algorithm is more expensive than just running it on CPU. Things like CUDA or OpenCL made this easier for scientists, but it still is extra work. A lot of stuff can’t be computed by GPUs at all…
yea my experience has been that in regards to making use of “exotic” components such as GPU’s, in scientific computing you are heavily restricted by the actual software you have available and thus whatever capabilities the developers of that software were able to support
that is why in workloads such as I am used to, you might have one set of tasks that can scale out to dozens of CPU cores and hundreds of GB of memory, followed by subsequent tasks that are restricted to e.g. single thread processing because the specific software library you are required to use simply was not designed to support multithreading or other acceleration
heterogeneity of processing requirements also makes it more difficult to design a single system that is optimized for any possible task you might throw at it, you end up just trying to max out everthing you can in the hopes that you dont run into restrictions down the line
This is very true, even if overnight memory bandwidth went up by 500%, it would still not be enough to saturate core execution on modern cpus, they’d still be backend bounded for the problems I work on.
I could see the 12th/13th gen intel processor being useful on some of the scientific problems that are more affected by Amdahl’s law and still fit into small memory spaces (<192GB).
Not all problems are vastly parallelizable, and because of this the less performant problems (on modern hardware) are not discussed in the common literature because people throw their hands up and say the problems are too hard.
I’d argue some of the most important problems don’t scale with core count and because of this, are put on the sidelines since we don’t have a way to solve them, similar to how it took over a hundred years before we had the computing technology to proof the four color theorem; We’re sidelining many problems today because we lack the technology to solve them.
If we could get a breakthrough in virtual instruction set computing, VISC, architectures we could be on the way to solving these classes of problems.
For example the problems I’m currently working on take over 200 hours to solve using modern hardware on the most conservative meshes, I probably have hundreds of these I need to solve but the results of one solution may influence design decisions on the next iterations, making this a serial process at times and negating scaling out the problem.
I’m using a 7950X for CFDs. It’s alright, but memory bandwidth is a major limitation. I’m currently doing [email protected]/s, which is much more performant than my previous [email protected]“6000MT/s”, which performed like absolute garbage despite reporting decent frequency and timings.
You really want at least four memory channels for this sort of thing.
This is a good observation, and in my own field I have seen that things like FPGA’s are starting to make headway in these areas where traditional computer hardware has not been making advances fast enough to accelerate workloads
example here is the Illumina DRAGEN FPGA accelerator platform, which can complete some common genomics tasks roughly an order of magnitude faster than the traditional server using only CPU;
this has implications in home server and workstation design, since a specialized $$proprietary$$ hardware component could potentially yield such massive speed gains as to make home-computing harder to justify if you dont have access to hardware that could e.g. reduce execution time to 1/10th or less the usual time.
FPGAs could definitely be a possible solution too; they have the ability to physically solve the problems, the issue is there isn’t a good middleware layer to go from boundary conditions to synthesis/P&R.
The complexity of the problem is enormous and imo the industry isn’t cohesive/collaborative enough to come to a standard. I think that if middleware was developed, it would be one vendor, probably ansys that comes out with it’s own implementation and ecosystem for FPGA accelerated solvers and they wouldn’t share it with anyone.
Too bad GPUs don’t have more memory so they could be useful in simulation, most fluent and openfoam problems that will fit into GPU memory actually scale pretty well.
Just to make this more visible:
We moved from like DDR3 1866-2133 to DDR5 4800 in the last 10 years. 10 years ago we had dual core and quad core and high-end server ran with 8 cores. 16-24 cores and 96 for servers are what’s up today.
That’s a 10x increase compared to ~2.5x on DRAM.
And I didn’t count IPC or increased clock speeds into the mix which makes that probably 20x on the CPU side of things.
DRAM sucks, but it is still the best we have. Except for some boutique HBM products.
+500% as @twin_savage suggested isn’t some made up number. This is what DRAM needs to keep up with the curve other components set. Like CPU, NVMe or GPUs.
Part 1 (the “Serious” part): I think gc71 above had some good observations. The “how” you’re going to use it (standalone, part of a cluster, etc) may be more important.
Also, will this be the machine’s primary role? Or will it have to do more than that? And while I’m not really doing any “scientific” or “engineering” programming anymore where a TR would arguably be more usefull than my 5950X I do regret not having more PCIE lanes. Maybe a different motherboard choice (I thought more was going through the chipset than it is) would have resolved that but heh, it’s still overpowered for what I’m doing…
Part 2: Thanks for making me realize I’ve moved into “Old Grouchy Grognard” territory when my first thought was what are you worried about?
I’m betting most of you are reading this and living in a building that was built after the year 2000…That has to comply with the national building fire codes I was doing the engineering work on with an original IBM XT (4.7MHz / 640kb ram) when I was at Brookhaven National Lab back in the 1990s. We would litterally commandeer every computer in my division to run the simulations over night and over the weekend. The geeks among us (who had 286 and 386 computers at home) would set them up to run at home during the day when we were at the office. Using floppy disk sneakernet to move files around.
And since you all “forced” me to go down this path it turned out one of my professors was doing nuclear research for the navy (monte carlo simulations for reactor design). Which I discovered when we ran into each other in the cafeteria. And he was telling me how he was doing all his work on an IBM XT as well. Fire it up Friday night and it’s done on Monday. They made him redo it for the Cray supercomputer (either at Sandia or Los Alamos I don’t remember which) because it would “look better” in the final report for DOD. It took the Cray less than a blink to actually run it and get the results (same as the XT to a decimal point!) BUT it took them 6 weeks to actually get around to running it!
Thanks for everyone’s input, this generated some interesting discussion and anecdotes!
I guess the consumer systems that I was talking about really have 3 main limitations:
Low memory bandwidth
Low PICe lane count
Low maximum RAM capacity
Technically, low core count may also be an issue, but IME, if you have that problem on a 2-channel system, you have a very compute-bound problem.
I’m using an Epyc 7302 (16 core rome, 3.3GHz) which has 8 memory channels. GPUs do most of the heavy lifting in my case, but there are plenty of short CPU sections where we’re realizing that single-core performance genuinely does matter a lot. It’s mainly this latter point that had me wondering if a consumer system would be viable. 128 GB of RAM is enough in my case, so my main worry is whether using chipset lanes for GPUs would cause problems, and whether the dual channel memory would cause a bottleneck on those short, bursty, CPU sections. There’s no real way to know without trying (which I’m not likely to do), but this is what prompted my curiosity.
Some GPU jobs don’t use GPU bandwidth at all. Some use a lot. You should be able to test this by looking into nvtop or nvidia-smi.
Or by setting your PCIe generation from 4.0 to 2.0 in the BIOS to simulate bandwidth drop from x16 to x4. This doesn’t account for sharing bandwidth with other devices on the chipset. NVMe and 10Gbit NICs are very noisy neighbors.
If chipset isn’t overloaded and your GPU jobs don’t need much bandwidth, x4 chipset lane PCIe slot on consumer board is totally fine.
Oh and although x4 can work, if that’s just a PCIe 3.0 slot with x4, you’re basically down to effective x2. Might be too much of a stretch even for bandwidth-friendly jobs.
And if you have an EPYC, dropping down to consumer-level may give you better clocks on the CPU, but there is little else that consumer land has going for it. Ok, RGB and Audio jacks, but that’s off-topic