Rusty PC Builder here, looking for advise on my first beast build in 10 years

The Creator board I link to supports 8x 8x 2x while the MSI board is 8x 8x 4x, you can also go for the cheaper MSI MPG X670E CARBON WIFI which doesn’t seem like much of a downgrade compared to the other MSI board.
Looking at NVIDIA GeForce RTX 4090 PCI-Express Scaling | TechPowerUp it seems like you would get away using a 4x slot without much penalty at all.

Aah, right. Well, then watercooled EPYC/TR/Xeon build is unavoidable then. I do not see how you can possibly fit three 3-slot GPUs in a single case seeing as most cases have 8 slots so at least one needs to have water cooling in either case.

An AM5 motherboard could theoretically run an x8 / x8 / x8 config, but in practice you want 4 lanes for an m.2. The B650/X670 chipsets are disappointing right now though - the X570 was extremely promising and had a x8 / x8 / x8 setup.

Now this is speculation and wishful thinking, so don’t take this as gospel, but from what I can see, a PCIe 5.0 x8 / x4 / x4 with eight x2 m.2 slots would probably be enough bandwidth to last to 2035, for all new builds. A 32 lane CPU could eliminate the need for chipsets and AM5 got 28 lanes… So close :cry:

Anyway, no use complaining over spilled milk here. As for PCIe lanes, for AI x1 lane could possibly be enough bandwidth for training, though it would make uploads and downloads of sets suffer. I have no data here though.

relevant video by ian cutress:

because at best you confused people there at worst you get them to think, that ddr5 has some kind of ecc, that actually matters.

on-die ecc got added to increase yields. that’s it, that is what it is for.

on-die ecc is meaningless nonsense. in fact it is harmful, because there is so much wrong information about this, that even highly technically aware people think, that ddr5 added REAL ECC to all systems now.

this is nonsense of course.

so let’s live in a magical world, where the on-die ecc will always correct the errors, that happened during the data sitting on the memory.
first off we DON’T KNOW. there are no logs, there is nothing. does it error or not? does it correct the errors or not? WE DON’T KNOW.

2.: there is no protection going towards the memory and back to it.
we DON’T KNOW whether it errored or not.

real ecc protects the data on the way to the memory, while it sits on the memory and on the way back to the memory.
and it will log all errors for you to KNOW if there is an issue at all.

you might be aware of all this and might have seen this video already, but honestly everyone here should watch this video.

people will run file servers without ecc, because they think, that “on-die ecc” is real ecc and they will see file corruption and won’t even think, that it could be the memory then when trouble shooting at first.
this is all horrible.

the HORRIBLE thing is, that thanks to all of this “on-die ecc” is setting back ecc support on desktop platforms for years, maybe even a decade.

on am4 almost all motherboards had full ecc support (error correction + reporting)
on am5 ONLY asus, the WORST motherboard company officially claims support for ecc and from the testing i saw all others either don’t work, or they correct the error, but don’t report it.

reddit post from truenas subreddit on ddr5 ecc support on am5:
https://www.reddit.com/r/truenas/comments/10lqofy/ecc_support_for_am5_motherboards/

we might have to wait until ddr7 now to get REAL ECC in all desktop and laptop machines FINALLY.
something, that should have happened decades ago.

also during am4, there was 3600 mhz 16-19-19-39 real ecc memory.

so we had highspeed real ecc memory on am4.
that is gone now. there is nothing for now on am5. there is only high speed tight registered memory rightnow.

will be an exciting future when you got desktop systems with 256 GB of memory throwing out memory errors once a month, but no one notices and the tech industry still shows us the middle finger…

but yeah i’m rambling now.

PLEASE be absolutely clear when you talk about ddr5 ecc support, because we actually need to fight this i’d argue deliberately created confusion with the fake “on-die ecc” memory to protect users from data loss/corruption/stability issues and also to maybe in 5-10 years actually get to ecc being the standard for everything.

EDIT: ian cutress video mentions, that using ecc could result in slower performance as you’d have to run slower speeds. that video probably was made before docp/xmp ecc kits came out. so that problem does NOT exist. you can just run 3600 mhz cl16 ECC memory on am4 and you get the full performance with error correction.
of course the motherboard and cpu mem controller need to be able to run that speed, bla bla bla, but with error correction you can actually see when it fails or when it doesn’t in a freaking log! but that isn’t a problem for sweetspots on amd platforms usually at least…

also gamers would LOVE ecc memory, because you can run a memory overclock closer to the sun as the ecc would still correct an error popping up once a week during normal use for example. and finding errors when doing overclocking would also be vastly easier, because you can just check the logs…
so gamers would love ecc memory, the average person would love to have ecc memory everywhere. so everyone benefits.
the production cost difference is very small btw too.

1 Like

Thanks @pega for that. I’ll definitely do my reading up when dealing with DDR5, it’s a shame that DDR5 doesn’t have Real ECC, but then again if my next build after this, would be for just performance, won’t need to worry about ECC… but this build is definitely going towards the TR route.

@wertigon Yes it definitely looking like (for this build), a TR water cooled (AIO loop) is where I’m leaning towards right now.

As for the case, yes most cases only do 8 slots… but the Fractal Define 7 XL has 9 + 3 slots… so it’s looking good but I’m researching to see if the water cooling radiator will still fit:

Enermax Liqtech TR4 II 360 Addressable RGB AIO CPU Liquid Cooler
or
CORSAIR iCUE H170i ELITE CAPELLIX Liquid CPU Cooler

As for the original conundrum of PSUs, I’m still double checking and using pcpartppicker, it seems that Super Flower Leadex 2000 W 80+ Platinum has enough rails / juice to power this setup.

Here’s a PSU Testing Build I’ve made:
(Pls Ignore the CPU in the list, this list it just to check if the PSU will pose any issues)

i would avoid all enermax liquid coolers. they had quite a lot of issues with gunk up and i wouldn’t trust a new one from them if i HAD TO get an aio.
i mean you’d have to hold a gun to this head to get me to get one over an aircooler, but gosh darn it, i wouldn’t get the enermax one :smiley:

(gn video about gunk up problem)

and don’t remember if you already have it, but if you’re considering on buying the samsung 980 pro new, i would advise against it.
samsung had a bunch of issues with their drives lately to the point where puget systems (workstation builders and stuff) dropped them almost entirely:

and sorry if i forgot that too, but if you play on buying a new 4090, that is from gigabyte, DON’T.
gigabyte is having issues with a bunch of cracked pcbs near the slot and they don’t wanna take responsibility for it. granted a bunch of them could be HORRIBLY shipped, but there appears to be a spike of cracked gigabyte pcbs.
gigabyte is also generally one of the worst graphics cards maker imo rightnow.
it could also be worth to check if you can get a 4090 with a 12v 2x6 connector, or a midway partial change at least.
the shorter sense pins on the card should at least hopefully reduce the risk of melting parts/fire a bit.
video about the revision:

and as a bunch of cards got already updated you could try to get one with the updated connector and make sure it isn’t gigabyte i guess :smiley:
that is what i would do if i had to buy a 4090 at least.

btw funny to see a 20k + pc partpicker for psu testing ideas :smiley:

no need to comment on the rest, because it is placeholder anyways right? maybe enermax cooler and gigabyte card and samsung ssd were already placeholder, but i guess worst case i gave you generally good information hopefully and didn’t waste my time :smiley:

Thank you for bringing me up to speed; I agree ECC will probably be more and more relevant, but as long as it costs me $300-$500 extra for a platform that supports ECC (CPU + MB + RAM), I cannot in good conscience recommend ECC for home NAS use or other places where data corruption is not a catastrophic failure. It just is too much money.

Then again charging an arm and a leg for ECC capabilities is about as despicable as charging extra for ABS and seat belts in a car, IMO. At least no one dies if your NAS ate your wedding photos due to bitflips…

actually, the price difference for a nas should be almost 0.

most people building a nas would get am4, because dirt cheap and most am4 motherboards and cpus support ecc memory.
and you can get a garbage speec 16 GB ecc 3200 mhz cl22 stick for 38 euros.
so the am4 platform cost different was 0, it was only the memory.
the issue comes in, when you want FAST ecc memory …
i probably paid around double to have ecc at that memory speed.
getting 64 GB ecc 3600 mhz 16-19-19 unbuffered memory at the time for am4.
so 500 euros vs 250 euros if it wasn’t ecc i think roughly.

but yeah on am4 for a fileserver, you can grab yourself jedec sticks for very cheap and have at maximum 50 euro difference i’d say, depending on your memory size.

we don’t have another unbuffered memory set to compare those too though in regards to price premium.

it was VERY expensive, but cheaper than you thought is my point and for any home nas the price difference is 0 or almost 0, so those should always run ecc, because well it is free or almost free in cost difference.

that was or is the beauty of am4.
i deeply hope, that they will fix support on am5 or at least with am6, if am6 comes with zen6.

depends on the marriage you’re in maybe :wink: (couldn’t help myself)

and there is a good question to ask here, how many systems caused deaths, because the systems used non ecc memory and errored out?

and as you mentioned cars. are all the computers, that influence driving in modern cars build with ecc on all connections and memory?

so there might be quite some deaths due to systems not having ecc.
just not directly maybe very often.

also damn is this comment way too long that i wrote here now lol

1 Like

I don’t really see cost being an issue these days

If budget is no worries, Wendell’s been putting together reviews of some very beefy Falcon Northwest systems.

You might want to check this builder, even if just to get some ideas -

https://builder.falcon-nw.com/templates/talon

The actual RAM cost itself is negligible.

However, you need both a motherboard with the physical traces for ECC, as well as a CPU that supports this.

For AM4, the 5600G does not support ECC, even if the motherboard does. You could buy a regular 5600 but then you also need to spend $30 extra on a GPU and $40 extra on at least a B550 board. That brings a $400 machine to $500 - at the very least.

For Intel, you can only get ECC if you buy a W680 board for something like $400+. At least it is possible on LGA1700 sockets as opposed to LGA1200.

ECC costs are not just buying the RAM sticks, would love to see this changed, then again I would love for Nvidia to open their damn driver source for Linux, too :pensive:

Thanks Everyone with the comments and advice - definitely useful information there, much appreciated - It’s like a refresher for someone like me who hasn’t been building in a while.

In so far as ECC Ram is concerned, I’m going down this path, along with threadripper on the ASRock board.

I’m doing final checking on the power portion because, Yes the power and melting has been the forefront of my mind as I’m putting this together cos there’s 3 cards here, the 2 x A6000 ADAs uses a PCIe Gen5 Power Connector Adapter and the 4090 also uses a 16 pin monster of a connector.

I am re-reading everything and checking if I can swap the 4090 out with a 3090 Ti instead, as @pega rpeviously pointed out.

I’ll put together a pre final parts list very soon, thanks everyone so far!

in regards to the overall powerbudget, psus and cases are relatively cheap compared to a6000 cards :wink: and you should in your case get some 20 euro good enough power meter, apply psu efficiency, run everything at max load and see if you can trip the psu and if it stays below its max sustained load.
at worst case the psu should just trip under full load, UNLESS
it is a gigabyte psu :wink:
https://www.youtube.com/watch?v=7JmPUr-BeEM (funny gigabyte speedrun explosion video, this failure can take out your graphics cards of course too, also funny video :smiley: )

and just some random psu load testing information.

psus can trip by having a sustained load hitting opp (over power protection), or a spike going high enough and long enough to trip it.

the oop might be set at 130%. (so on a 1000 watt psu, that is a 1300 watt OPP then)
so having a sustained load of 110%, that is very very low on spikes should not trip the psu, but you should NOT run a psu over its max sustained load.

so you can have a 1000 watt psu, that trips with a sustained load of 800 watts.
and you can have a 1000 watt psu, that does not trip with a sustained load of 1100 watts.

i used prime 95 with 2 cores not used and instead used for furmark.
so prime 95 + furmark, but obviously that won’t be the hardest load for a6000 cards i’d assume :smiley:

so the parts should all be safe with a proper psu, even when the psu gets overloaded/tripped.
and the psu itself should also be completely safe from that.
it is DESIGNED to trip and protect hardware and itself without issues.

the testing might just be your ai workloads, that run at PPT probably on all 3 cards and some prime95 taking up lots of system memory and using enough threads to saturate PPT of the cpu.

just random powersupply testing stuff once you have build things and a bit of background info and hopefully the video is entertaining for you :blush:

1 Like

Thanks everyone again for all the input so far… Here’s my Pre Final List!
Alot of parts aren’t avail from PC Part Picker, so I started with somethinc as close as possible, and then I modified it by hand.

The main thing I’m still trying to figure out is the Power Supply.
Does anyone have feedback on the Super Flower 2000W Model?

As for the rest, there’s some changes:

Motherboard: Ditching Asus and going with the ASRock Creator

Cooler: I’m kind of settling on the NZXT Cooler after reading and watching lots of reviews, but if anyone thinks it’s gonna go horribly, wrong, pls let me know.

Casing: For the case, I’ve decided on the Fractal Meshify XL as it has the space for mounting 3 cards, as well as capability to handle the radiator, with even extra headroom in case I decide on a push pull config.

Memory: As for the RAM… I might go with crucial if I can get ECC Registered in Local Stock, but this is a placeholder, and it’s available from Amazon.

Storage: Looking at the MSI due to the cache, but I might throw in extra later on.

Complete Parts List:
Modified from the original PCPartPicker Part List

take a look at Wendell’s newest video…

looks like he made it just for you :wink:

pro-tip… use the promo code :wink:

1 Like

I have build two systems so far with the Asus Dark Hero x570 motherboard and did not have any issues like this. Also did not have any issues with any of the several other Asus motherboards I have used (at this point, iirc 4-5 different B550 and X570 boards). That video you posted really does not mean much.

I also cannot help but wonder in Threadripper is really warranted for this purpose either. I run AI / ML stuff on my Ryzen 5950X + 128GB DDR4 + RTX 3090 + Asus Dark Hero setup

also, if price really is not an issue, then it seems like you should just buy an Nvidia DGX Workstation;

Its basically what you are trying to build yourself here, except higher quality all around.

Have you seen:

It looks like they do everything that you are looking for. ie their computers can accommodate 8 triple slot pcie gen4 cards connected to one motherboard.
They have some sample configs on their website, but if you contact their sales department they will do something custom for you.

good for you, that you a positive experience with the asus x570 dark hero board.
mine was broken. and the asus x570 dark hero boards from countless others as seen in the again BIGGEST ASUS FORUM THREAD EVER in the support hardware category shows this clearly.
if you commented trying to imply, that:
“hey she must have just gotten unlucky.”
then you are clearly wrong.
there is a reason a post on a company forum about hardware support has 132 000 views and 638 comments.
because lots and lots of people are effected.
further more the response from asus to NOT tell the community what the actual cause is further shows bad behavior.
and we know, that it is NOT rare either, as lots of people got replacement boards and those boards shows up having the issue again after a few months as reported several times in the forum thread.

so again i am happy for you to not have start-up issues, but there is a massive issue and it wasn’t HANDLED AT ALL by asus. well worse than not handled, because they replaced boards to have the same issue again. creating downtime, costs (if they didn’t pay shipping both ways) and work for the customer.

the issue also isn’t limited to just the dark hero btw as got mentioned in the forum thread.
maybe there would be more messages on the forum thread by now, but ASUS CLOSED THE THREAD…
and it wasn’t because the forum was silent or the problem got solved.
the lost post was:

Hey guys I am about to RMA my motheroard. As everyone else I have the same startup prolems.

Any of you guys who RMA your motherboard lately received a working one?

Thanks in advance

so asus BLOCKED people from communicating their experience about a major issue on the forum at this point.

and the video by gamersnexus about asus being horrible is full of details and again horrible handling of issues created by asus.

again, if i just posted my personal experience and you countered with having a great experience on 5 boards, SURE whatever. but we are talking about a systemic issue in regards to motherboards at asus and NO solutions at all offered to lots of customers, that are effected.

you know what, let’s have a section of a collection of recent asus motherboard problems, because why not:

examples of other MAJOR asus motherboard problems:

this is for anyone thinking, that my experience with this one product is rare.

the 3 examples given include the biggest asus forum thread with the most views and comments ever (asus x570 dark hero not starting up) and straight up fire risk and thus risk of life, that also didn’t get properly addressed for over 8 months (asus z690 hero):

1 asus x570 dark hero NOT starting up at times:

Asus dark hero startup issue ? - Republic of Gamers Forum - 813987

record asus forum post with most responses and most views. asus did NOTHING to address this problem and was returning boards unchanged, or is exchanging boards and the replacement gets the same issue in a few months at best. asus refuses to share the result of their investigation. (speculation: to avoid a recall) this issue also effects other motherboards btw….

2 asus x670 boards:

https://www.youtube.com/watch?v=cbGfc-JBxlY&list=PLsuVSmND84Qv-dOEYb8uImC6Jc27gZG8x&index=2

sending 1.4 volts at the soc.

burning through chips and motherboards too or degrading the cpu permanently.

putting on WARRANTY VOID disclaimers on beta bioses, that were created to stop further degrading or burning through chips by limiting the soc voltage theoretically to 1.30 volts soc as per amd guidance.

it wasn’t actually fixed at the time with beta bioses as it was still far above 1.30 volts for soc voltage.

so asus set out a statement to void your warranty, when you update a bios to one, that should protect your hardware actually, but at the same time, that beta bios, that had ONE JOB! didn’t even do that!!

lots of manufacturers had too high soc voltages, not just asus, but asus was the highest and asus handled it BY FAR the worst way.

3 asus z690 hero fire and risk of life issue:

https://www.youtube.com/watch?v=p8ktO-WdlyQ

no full proper recall done for over 8 months.

for a FIRE RISK, you do a PROPER recall, because people could die.

asus did NOT do so.

to be clear here, PEOPLE COULD HAVE DIED!, because asus didn’t do a proper recall for this level of issue. that is the level that asus is at!

is the z690 FIRE HAZARD risk, that asus didn’t adress properly for 8 months enough for you to get the idea, that isn’t the greatest company at least?

Personal ML infra is a really tough space. It’s seriously tough to get any meaningful oomph at home.

here’s a guy who tried to build a multi-3090 rig and settled for a 4xA6000 rig, then decided to put it in a server chassis and move it to a local DC to keep the heat and noise out of his apartment. This post covers a lot of the trials and tribulations of DIY.

If you run into 4090 fitment problems, consider the PNY Verto 4090 because it claims to be exactly three slots. If that’s still too big then there’s the MSI liquid cooled 4090 at two slots, but then it’s water cooled. By the looks of it you won’t run into an issue unless you need a NIC or HBA with that motherboard.

With this setup, one thing I don’t hear is how many circuits/breakers feed the room where the machine will go. If you have a 20A outlet you could get a 2000w PSU and maybe be future-proof for a more built-out system or skip a second PSU.

Random fun below:

If you want an overview of what it looks like to go beyond 8 GPUs, I found this lambda labs explainer interesting: https://www.youtube.com/watch?v=rfu5FwncZ6s

If you’re a well-paid researcher type, the easiest thing would probably be to get a 4xH100PCIe rig from supermicro or something, with professional water cooling. I think those run 80-100k. Wild!

Another thing I see happening in the last month or so is people talking about model compilation and various efforts like ggml to make it easier to use any or mixed hardware. I can now run a 70b model inference at 4-bit on my laptop at acceptable speed and even offload layers to CPU.

IMO the best thing may be to just get something simple that works and just get cracking at the actual ML work if you are doing indie R&D. It’s such a moving target and things are moving so fast it pays to just have something that works in the next few weeks.

1 Like

Here is a video of that crazy supermicro rig:

look at the size of that water radiator.

Here is the documentation for my genoa build. I just ran geekbench tonight.