Cheapest CPU/mobo combo that can run the fastest memory possible?

Hello I want to conduct a test, the test needs the fatest RAM bandwidth possible because the results are going to depend on latency and total throughput of the RAM.

But budget is tight and I dont have the time to bin ram+cpu+mobo combos.

So I would like you to advise me about a (used) threadripper/epyc/xeon capable of running octa-channel or quad-channel memory, cpu+mobo+ram combo that can do near 200GBs RAM read/writes if not a little higher than this (anything above 150GB/s is acceptable if near 200 isnt possible for the budget) .

The used parts need to cost less than 1000 USD (for CPU+mobo+ram)

The ram doesnt have to be a lot but preferably at least 32GB if not more.

Cores can be as little as 8 and actually the higher the single /few core turbo and base frequency the better.

If this is not possible at such a low price target range then I am after the cheapest CPU that can normally handle the highest XMP (or manual) profile possible or like has very good chance to work with it (more than 50% )

I am hoping for speeds at or above 100GB/s in this subcase.

What exactly are you trying to do?
As I recently found out in the CFD thread, high bandwidth numbers don’t necessarily translate to high performance, I’m getting 1,800GB/s memory read bandwidth numbers but the performance was losing out to systems with ~250GB/s until numactl was invoked to improve locality and cut latency.

A dual socket naples or rome system might be able to satisfy the requirements, but will have high inter-socket latency and sandy bridge levels of single threaded performance.

2 Likes

Vega 64 GPUs have HBM2 memory with 483GB/s bandwidth, they also have a feature called HBCC which allows them to use ram (and I think also disk space) as “l3 cache” of sorts.

I have seen (in games mostly that need more than 8GB of vram, videos showcasing little to no difference are of games with settings that dont utilize more than 8GB of vram in the first place hense irrelevant)performance to increase significantly with this feature enabled while the memory being significantly slower than the HBM2 on the GPU die (usually basic DDR4 speeds so like 40GB/s which is 10 times + slower yet in spite of that it still had a positive influence) …

I want to check if this feature can be utilized in AI applications that need GPUs with high vram capacity e.g more than 42GB (so use the system RAM via HBCC as vram) which I think (given the parity or close to it in speed between RAM and vram) would dramatically increase the speed of the GPU to the point that it maybe half decent especially like multiples of them to make a cheap yet fast AI rig.

Vega 64 GPUs also use a PCIE 3.0 x16 interface that tops out at 16GB/sec.
This will be the limiting factor even using slow DDR4 system memory.

3 Likes

youp but wouldnt crossfire multiply that?

This might not be what you’re looking for, but (for example) Gskill has 2x24 GB DDR5-8400 that can apparently give you ~ 120 GB read and write speeds (measured with a i9-14900 in a Z790 board and XMP 3). Now, I don’t know if you can get all that for the price target you set, but Z790 boards have gotten cheaper, and will give you the 16 PCIE-5 lanes you’d probably would want your dGPU plugged into. The new 5090 or 5080 might be especially interesting for what you want to try, but I don’t know if they can be made to “dip” into system RAM if their on-board VRAM isn’t enough. Wendell had those cards in for testing (new videos), so he might know.

nope unfortunately expanding vram is an AMD feature limited to Vega architectures as far as I know, although it would be cool if it could reach newer GPUs too.

There are some AMD professional GPUs (maybe instinct but I am too bored to google) that offer physical expansion slots with nvme ssds I think or something like that.

This might be of interest to you https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9426-using-tensor-swapping-and-nvlink-to-overcome-gpu-memory-limits-with-tensorflow.pdf
And, at least in theory (:stuck_out_tongue_winking_eye:! ), even consumer dGPUs can access system RAM; main obstacles and bottlenecks are drivers, chipset and the speed of the PCIE bus.

I’ve never really looked into crossfire but I think it was exclusively used to render the same scene using multiple GPUs by sharing resources over PCIE. In theory multiple GPUs could put more load on system memory bandwidth but in practice I think most if not all crossfire bandwidth was GPU to GPU.

Abstracting away the difference between system memory and GPU memory makes it seem like you can get all the benefits of both capacity and speed but at the end of the day HBCC can’t circumvent underlying hardware limitations.

Yea I used the term “crossfire” loosely so that you get what I mean, so using multiple GPUs in unison for compute (e.g via ROCm’s HIP )

I believe that one way or an other the bandwidth will stack with multiple GPUs and most of the mobos that offer quad and octa channel ram for epyc/xeon etc CPUs usually have quite a few full fledged x16 slots (also PCI3.0 x16 is 16GB/s each way (read and write) so in total bothways 32GB/s total )

Last but not least I agree with your conclusion that’s why I just want to test it and dont want to spent much but having that said despite the PCI 3.0 limitation there are signs that it could work e.g in gaming

e.g

Note also the smoother frame times in the instance with HBCC enabled (last but not least notice the VRAM demand which is important since if you don’t exceed 8GB of vram with your game resolution and settings HBCC wont do anything at all or even be detrimental, and that’s what I noticed in lots of videos underestimating HBCC they play at 1080p and vram usage is like noticeably less than 8GB… )

Also noteworthy is the fact that this rig runs with zen 2 cpu (3900 I believe) and some very slow ram.

Even if ROCm still supported VEGA each GPU still suffers the same limitation.

Even 10 times the bandwidth still leaves you at levels below mid range modern GPUs. Unfortunately it just doesn’t make sense to treat memory from multiple GPUs or from system and VRAM as a single pool.

HBCC reminds me of AGP texture acceleration which was a similar idea that never really worked out for similar reasons.

It would be nice to be able to put all that HBM to use but there are just too many obstacles, on both the hardware and software sides.

It works with vega they just dont do bug fixes and stuff officially at least and HIP is essentially a layer that allows the GPUs to work on the same workload so “crossfire”

It does, 10 times would be 160GB/s vega 64’s HBM2 speed is 480GB/s so it is 33% of its “native” speed.

Anyway bottom line is that from outside evidence I am compelled to believe that utilizing HBCC will affect the GPU’s performance positively.

Actually I am almost certain of that, the question is(and the reason I want to conduct tests my self) by how much,

I know that it won’t be mind blowing but rather incremental instead but it depends again on which side of the “bell curve” it will sit, maybe it is good enough to make financial sense compared to the cost of new hardware with similar performance .

Being bandwidth constrained I’ll be surprised if AI performance using Vega 64 with HBCC to run models more than 42GB would be any faster than using CPU alone.

Let us know if you get anything working because that would be an achievement on its own and any results would interesting one way or the other.

If it is at games it could be there too that was my entire point when showcasing the almost 2x performance upflift in “the last of us” from 19 to 38 FPS that means that it took the GPU 52ms to process each frame because it had do unload and load VRAM data because VRAM was full and enabling HBCC it needed 26ms to process the same frame workload

Rendering a game and AI differ considerably though and I think HBCC has a lot more potential when it comes to games. Even under normal conditions when VRAM is plentiful games only use a relatively small subset of the data they load into VRAM to render any given frame so there’s a lot of data that can potentially be swapped out to system memory without being missed. AI on the other hand just runs through gigabyte after gigabyte of parameters layer by layer in sequence.

But if you’re set on giving it a go… with Vega 64 being limited to PCIE 3.0 x16 any DDR4 or DDR5 system will suffice as a test bed. Getting even a 16GB model to run faster than CPU on an 8GB Vega 64 using HBCC would prove that it may in fact be viable.

In what way? I mean the end result is different (one side you have visual imagery the other you have text on prompt) but both are just matrix calculation taking place it is actually very similar as a workload hence I am very curious to see if it makes monetary sense (cause surely there will be an performance uplift I have no doubt on that) .

That is true but since it will not be an out of the box experience there surely would be many nooks and crannies for me to take care of every step of the way so if I am going to spent so much time tinkering I would prefer doing it with hardware that is worth keeping it around as part of my homelab rig after I am finished with the test :stuck_out_tongue:

So I am looking for bang for buck but also specced in a way that would make sense even if i am not going to use the Vegas that I have lying around (from my mining rigs)

and nobody is helping out :frowning:

Gaming and AI are vastly different workloads which is why the designs of NPUs dramatically differ from GPUs.