Fun Intel Anecdote, Qs, AMD ECC mobos?

Hi Everybody,

In spite of watching a bunch of videos on the topic from Steve, Wendell, and Buildzoid, my suspicions have been insufficiently honed by paranoia. I’ve only assembled this thing about a year ago. It’s got the 0x129 microcode. It’s not like I’m running Minecraft servers…

I’d been trying to compile QEMU the past few weeks, here on my i9-13900K computer system. It would keep segfaulting on this line in a configure script, some ninja meson setup etc. No matter what I did, it would crash on that line. I blamed myself. After all, when is PEBKAC ever the wrong answer? I tried reboots, different versions of the Linux kernel, different versions of QEMU, different versions of Meson… It’s some other guy’s build script, confirmed working. Everything else seems fine. How badly could I have broken my Linux installation?

Wait. Intel?

I cobbled together a Bash script so I could watch requested voltages while working. That was inconclusive. They’re mostly under 1.4 volts. Then, when I was trying to build virt-manager, it happened to fail on the analogous Meson line in a similar-enough configure script to the failing QEMU script. On a hunch, I changed the CPU scheduling governor to powersave. Then the build failed in a different spot. I changed it back to performance, its original value, and the build succeeded.

Intel!

The same trick didn’t work with QEMU though. I decided to run y-cruncher overnight, trying to catch my processor red-handeder. I saw that the voltages were lower than in fewer-core workloads, makes sense with probably some overall power or thermal limit, went to bed. About 10 hours later, I checked back in, and saw that the voltages had dropped quite a bit, down from 1.2-1.3V requested to like 1.17V. That’s weird. On a hunch, I tried to compile QEMU again, and this time, out of nowhere, it succeeded. It had been failing consistently for weeks.

Somehow y-cruncher healed the build.

Even if it is ultimately my fault, bad CPU installation a year ago, in the immortal words of Buildzoid, slightly edited,

If it was me, I’d sell the replacement CPU. And the motherboard.
And just like swear off Intel CPUs for like a few years.
You screw me over buy selling me defective trash.
I’m not buying stuff from you anymore.

Thereby, I’m looking for CPU and especially motherboard recommendations. My hard requirements are:

  • AMD CPU
  • ECC RAM officially supported
  • Friendly to my 3 slot tall AMD 7900 XTX GPU
  • 8+ SATA ports (or free PCIe slots if needs be)

My currently owned non-Intel parts are:

  • Case: Fractal Design Define 7 XL
  • Power supply: be quiet! Dark Power 13 1000W ATX 3.0
  • RAM: 2x 32 GB DDR5 ECC UDIMM – Micron MTC20C2085S1EC48BA1R
  • Drives: 1 M.2 NVMe drive, 6 SATA HDDs (raidz2), 1 SATA optical drive
  • GPU: AMD 7900 XTX (3 slots tall)
  • CPU cooler: Noctua NH-D15 (might as well keep it)
  • IIRC 4x 12V PWM fans and 2x DC

I’m eyeballing a 7800X3D, because of course, but I am curious about alternatives, if they exist. Tentatively, as far as AM5 goes, it looks like either of the X670E or X870E chipsets are likely to give me a good time. X870, according to Wikipedia, only supports x8 PCIe gen 4 lanes. Would that bottleneck me? Or would it be completely irrelevant, in spite of my graphics card’s lack of support for PCIe 5? (or – maybe I’m just misreading a mislabeled table)

Workloads I’m planning on are, playing games, storing / locally hosting files, otherwise generally screwing around with anything that’ll run on it. Probably monkeying around with VMs more over time, whatever tiny AI models will fit, possibly inducing RAM upgrades, SSD upgrades. I keep computers a long time. I’m chasing this dream of having one box for all purposes. You can never cross the same river twice. You can never go home. But maybe you can consolidate on a single computer.

Linux only. Windows in a VM later, maybe, in screaming protest. My long term plan for this machine is to join a death cult so I can persist as an immortal lich over the millennia, long enough to see Microsoft go bankrupt.

Aside, how worried would you be, if you had a deeply convicted religious obsession with data integrity, about the probability of erroneous bits previously committed to disk over the past year?

Please and thank you.

1 Like

so I think your CPU is possibly degraded. the problems show up when the voltage is lower and that’s consistent with your report. I had instability when the processor transitions from higher to lower voltages mainly.
seems like that’s why compile crashes.

compile is a very bursty workload

ecc support is pretty good since ages 1.2.0.0 on am5. been using 48gb udimms without issue. steel legend does it, pro art x670/b650 does it, etc.

4 Likes

It struck me as strange that the build so consistently failed on this meson setup command, across different versions of different packages, with me otherwise not noticing any compilation trouble. I felt like the failures should be spread out more. (I guess I did see occasional random compilation failures though, now that I’m thinking about it.)

Would a decent hypothesis be that y-cruncher hit some longer-term power limit that kept the voltage lower, preventing some crashy undervolt condition in what would’ve otherwise been a larger voltage drop?

(edit) Oh hey, it’s you! Thanks for the videos and the wiki in this forum. I don’t know what I’d have been thinking without them.

There’s only four 8x SATA boards on AM5, ignoring the $1k and Biostar offerings. The X870 motherboards are reported to offer more than just four SATA ports, so you have better odds with them.

As for ECC support it’s pretty miss on AM5, there’s people saying the ASUS X670 ProArt board has it but I suggest you read the ECC threads in this forum. Hopefully the X800 boards will have better ECC support…

1 Like

ASRock X670E Taichi and Taichi Carrara, Asus Prime B650M-A AX, what’s the fourth one?

Both Taichis get four of their ports from ASM1061s, so they’re not quite full rate, but that’s fine for the drives here. Asus doesn’t seem to say specifically but, as a single Promontory 21 has four SATAs, the B650M-A has to be doing something similar.

I’m not sure ASRock has any AM5 boards that don’t support ECC at this point. The B650M-A does as well.

Since this is seven SATAs the X670E Pro RS would minimally also work (the seventh is M.2 SATA, so would presumably need a cable adapter).

2 Likes

The Asus ProArt X670E-CREATOR WIFI is a well proven motherboard with multiple users on the forums even with ECC memory running. Not the cheapest board but one that also supports 2x 8x PCIe which might be something you’ll value. Might be worth mentioning that the 10Gbit NIC (Marvell) runs hot in general so you might need additional cooling for it but that’s not a motherboard specific issue.

X3D CPUs are likely not a great option for your use case, the 7900 (non X) is a cool model (65W) and sports 12 cores which sounds like a better priority than cache size.

For SATA expansion you can either use a Broadcom HBA or a simple ASM1166 SATA card (I haven’t tested one yet on AM5 personally), just make sure you get a 2x PCIe variant and put in the 4x PCIe slot at the bottom of the motherboard.

1 Like

The Asus Prime B650M-A AX 2 which is $35 less than the non-2. And yes, the 600 series only supported up to four so any extra SATA ports are coming from a third party chip.

Would be news to me, was it hard confirmed anywhere? On a ASRock B650E Riptide myself and anticipated a Zen 6 upgrade, probably with a RAM upgrade. Getting to add ECC would be amazing.

1 Like

ASRock indicates ECC support on every AM5 board I’ve looked at the memory specs for (mostly midrange and higher, including the B650E PG Riptide, but even the A620M Pro RS). Wendell confirmed Steel Legend up thread a few hours ago and I’ve come across a handful of third party fault injections confirming support on a few other AM4 and AM5 boards.

Since ECC’s apparently an AGESA feature presumably all ASRock has to do is allow it to be enabled. Seems pretty low risk.

Both MSI and Gigabyte don’t so why are you pushing on this assumption?

1 Like

I see you want to run some VMs, play games, and have at least 6 disks + an nvme on a native ecc motherboard.

games = fast single threaded = AM5 or threadripper or epyc 9004.
epyc 8004 uses the efficiency cores which don’t give fast single threaded performance.

I am using a epyc 9124 as a desktop, with fewer limitations. The motherboard is about the same as a high end AM5, but the cpu is +$600. Then you get 12 ram channels so can have much higher ram capacity and much faster ram throughput, but the burst cpu speed is 3.7ghz instead of 5+ghz. I think that threadripper is the better option.

Supermicro and asrockrack both have epyc 4004 motherboards, but they all only do ecc udimms. ie The stack of ram on the board does ecc on its own chips, then returns the result. When I was working at a small company with a few racks of servers that would cover about 90% of the problems, but I found 2 cases of a bad motherboard in 4 years where memory problems could not be resolved by replacing dimms. rdimms were needed to diagnose these problems.

threadripper uses rimms, so real ecc.

1 Like

Is that how ecc udimms work? Are you not confusing on-die ecc (which is a requirement for all ddr5) with ecc udimms, which actually do have an extra chip and 8 extra bits for in-band ecc that is done on the CPU?

3 Likes

The chip is often a stack of 72 to 228 chips in one package, and a compute chip that does ECC transparently to the motherboard. But this means that the motherboard and CPU are still vulnerable to rowhammer and motherboard errors. If you have a corroded pin on your memory dimm, the ECC needs to be calculated on the cpu, not on the dimm to catch it.

Sure. So how would you describe the difference between these 3 dimms?

As far as I understand, you describe on-die ECC. Is the ECC in the second udimm not ‘real’ ECC? It does have 8 extra in-band bits, which get sent to the CPU?

1 Like

Registered vs unbuffered and ECC vs non-ECC are orthogonal concepts. Unbuffered ECC DIMMs use “real” error correction.

From Wikipedia:

Registered, or buffered, memory is not the same as ECC; the technologies perform different functions. It is usual for memory used in servers to be both registered, to allow many memory modules to be used without electrical problems, and ECC, for data integrity. Memory used in desktop computers is usually neither, for economy. However, unbuffered (not-registered) ECC memory is available, and some non-server motherboards support ECC functionality of such modules when used with a CPU that supports ECC.

Now, DDR5 complicates things because even non-ECC DDR5 DIMMs use on-chip error correction (which is what you describe as “The stack of ram on the board does ecc on its own chips”), and DDR5 ECC UDIMMs apparently “only” support 72-bit wide ECC (EC4) (same as DDR4) while DDR5 RDIMMs can and often do support 80-bit wide ECC (EC8). Apparently, EC4 is enough to provide SECDED (as proven by DDR4 as well).

I’m not sure where this misconception that buffering implies ECC and vice versa comes from. Perhaps simply because server memory is mostly both, and desktop memory is mostly neither?

4 Likes

Since on die ECC performs ecc on many more layers than just 4+1, more like 216+12, they can have more errors and still consistently return a valid result, with less expense of extra hardware, ie 5% vs 25%.

Im specifically stated that on chip ecc covers most use cases, and allows vendors to use cheaper parts that throw errors, but let the on chip ecc handle the correction.

it doesn’t cover problems on the dimm slot or on the motherboard between the slot and the motherboard, or on the cpu socket.

rdimms also cover problems on the dimm slot, on the cpu socket, and on the motherboard.

Also on chip ecc does not notify the motherboard about uncorrectable errors.

I have fixed 2 ddr4 ecc dimms by using a contact cleaner on the dimm contacts, about 8 years apart. During post these machines showed errors and needed manual intervention to proceed. On one of those machines I also reseated all of the connectors and replaced the thermal paste as it was overheating with a cool heatsink. On the other there was a pool of water on the top and water was running down the side. Part of my repair on that one was installing a shelf and placing the server above the leaking device.

1 Like

It will always be a dream. Just like the dream of owing a car that is great on the track, off-road and can carry 7 people. In reality, you can either pick gaming performance OR ECC RAM, not both.

I think we have to be very precise with our words when it comes to ECC.
ECC Support == it does run
ECC Support != it makes actual use of ECC

Lots of boards can run ECC memory but will not make use of ECC.

Official support (and that is what OP asked for!)

is for Desktop CPUS only on PRO CPUs.
I also highly mistrust the mobo vendors to not screw this up on their gaming line.

2 Likes

:face_with_raised_eyebrow:


From 2016:

Basically, AMD (unlike Intel) does not artificially limit ECC support to server CPUs. AFAIK the mainboard does not have too much to do with supporting it.

1 Like

you should watch the video I did recently. things actually work with rasdaemon now with new ages and many bios surface a new setting ecc on/off. so all the plumbing is there.

when I say supported I don’t mean it posts I mean it works.

7 Likes

More the opposite as most non-Pro, desktop(ish) CPUs support ECC, for the honest meaning of support inclusive of checking and correcting (rather than the proposed contrived marketing redefinition).

Zen 3, 4, and 5 parts ECC
non-Pro Ryzen APUs not supported
all other Ryzens (including Pro APUs) supported
Threadripper supported
Threadripper Pro supported

(Not sure about ECC in Zen, Zen+, and Zen 2 offhand.)

2 Likes

I think you’re right. That looks like a very good option (Interestingly, on Amazon, the non-X 7900 is priced $30 higher than the X SKU). Though, some of the processors that MikeGrok’s alluding to… could have, more… uh. I may have a disease. I got a little more research to do.

If that’s an RDIMM, what’s a DDR5 “full” / proper / whatever ECC UDIMM, as distinct from the DDR5 on-die ECC UDIMMs that are the common variety?

I was trying to read https://www.memsys.io/wp-content/uploads/2020/10/p337-criss.pdf to get a sense of what kinds of techniques were likely being used, but I don’t really know much about data correction in general and I’d misplaced my RAM glossary, so I couldn’t get through the paper with high enough comprehension.

Haven’t you ever seen https://www.youtube.com/watch?v=nmt1wTciQII :wink:

Data integrity is far and away my priority. As an idiot desktop hobo, I can afford to be selectively less choosy. It’s not like the games won’t run, especially at 2560x1440.