ZFS with Zen2: overclock ECC RAM or don't use ECC?

igoodman · April 19, 2020, 1:14pm

Thank you all for your answers.
@Log & @risk - very insightful answers

(btw: risk & log … the very best combination of nicknames for the current topic )

I see that there are very quite some differing opinion on the topic. As all of you seem to found your individual different way I conclude that ZFS is maybe quite resilient to RAM errors and the question of less importance than I initially assumed.

So I will get the ECC Ram try to overclock it - if it’s stable, I’ll keep it otherwise I’ll exchange it for some non-ECC that is higher clocked and learn to stop being worried

And of course I will post my experience here.

Yet if someone has practical experience with overclocking this particular RAM stick - please

What I want to do at home is fitting pharmacokinetic models to the data on a voxel by voxel basis with some bayesian regularizers for sub-regions. All of that is CPU bound in the moment.

(I also do hardcore image recon of the k-space = fourier space raw data (massive GPU based calculations) but that will have to wait till the university is open again as I will not try that on any real life sized batches at home: I would need my own power plant for that (but … I am a nuclear physicist by training … so maybe I will at some point get my own reactor running in the basement )

this keeps me thinking if it wouldn’t be a good idea to use ECC for data integrity of my calculations anyhow. First I dismissed that cause I have a lot of noise anyhow and with millions of voxels, who cares. But if the error strikes at a particular bad time during the regularisation it may be a problem. (a dataset of a single subject takes me about 4 days to fit)

It’s not so much about the stability I am worries (heck, it’s scientific code, it will crash at some point — after all it’s partly written by me ) it’s those subtle hard to see errors with little consequences for the general system

So I would be curious I any of you knows about the real life (soft error?!) failure rate of DDR4-RAM (not in theory but to your own experience) that has such very subtile effects on data (and hurts a scientist much more than a crash)

A crash now and then I can live with (good point by @risk ) but subtile corruption on large scale would be bad.

risk · April 19, 2020, 4:08pm

So, where I work we have lots of machines, without going into numbers, lots and lots, like I’ve been meaning to run some tests on a small batch of a few hundred epyc Rome servers running AMD CPUs that AMD made specifically for us, unfortunately that’s a very small batch so I’m having some issues…

On a sample of about 20k Intel Skylake and Cascade Lake machines where we’re mostly memory latency/bandwidth bound (using about a quarter of theoretical bandwidth on average all vector workloads), at 60-80 C at about 768G or ram per machine, we see ECC correcting about an error per week on average. Note that this ram is not A grade silicon, and relatively hot. Dropping the temperature down to 50 yields 100x fewer errors, but we’d have to cut cluster utilization by more than half (our air intake is about 30).

Don’t get me wrong ecc is great, but if you keep it at 50C or lower and once a year you need to throw away 4 days of work, but get 5+ % of performance… sounds ok… Also, it’s unlikely you’ll be able to saturate it as much with a 3700x I’m thinking

bitcore · April 24, 2020, 1:41pm

I’ll echo risk’s comments.
We have >50TB of RAM in our cluster as well - and with our particular high density chassis, keeping inlet temperature down and increasing air flow rate through the chassis results in a dramatic difference in ECC correctable rates. These gigabyte chassis need tighter fan curves as they are effectively blowing air as hot as a hairdryer over the dimms from the amount of heat these naples epyc cpu’s chuck out. We were able to reduce temperature without reducing throughput. (HACS + rack sealing + negative pressure delta + setpoint)

Before we mitigated temperatures to be even lower (we were already WELL within specs) We were seeing a node having correctable rates high enough that it became an annoyance where we took action against it - about once every week to where it became almost routine. If we didn’t have ECC - about 15% of our total quantity of cluster nodes would have caused corrupted data/crashes.

Keep your memory COOL! And actually verify if any fancy heatspreadders on your DIMMS are A) actually contacting all of the chips entire surface B) actually using a THERMAL pad rather than double sided tape. Some commercial “consumer” dimm heatspreaders actually do more harm than good.

Speaking of the OP’s system, considering you have a regular ryzen CPU - you must get unbuffered ECC - you CANNOT use registered DIMMs. This is also true for past and current generations of Threadripper.
I would say that if your data is important to you and worth the cost: Go ECC.

I’ve had a personal order placed for the QTY 4 of only 32GB ECC UDIMM that I’m aware of: samsung’s M391A4G43MB1-CTD (2666), since January - and it’s been on back-order since. My most recent shipping update is now for August. Samsung has two faster bins, -CVF at 2933, and -CWE at 3200 - but those are even more difficult to locate. I think all of their ECC UDIMMs are in sample only and they simply have not made another run of them. I’ll also see that I remember buildzoid having great success overclocking the type of memory chip that these DIMMS have (apparently it’s M-Die, off memory) - so you quite possibly could overclock it if you win the silicon lottary - chances go UP the fewer dimms you have.

Would you kindly share what supplier you have found that has these in stock?

igoodman · April 26, 2020, 3:16pm

Thank you again for your insight @risk

Also to you @bitcore! Coming back to your @bitcore question where I purchases the 32GB ECC UDIMM: I am in Austria - here we mostly use the largest european price checker geizhals.at (“Geizhals” mean “scrooge” in German)
Here the direkt link to the RAM https://geizhals.at/samsung-dimm-32gb-m391a4g43mb1-ctd-a2057207.html?hloc=at (I only buy from shops with 4.5+ user rating - you can do the math where I got it.) My RAM will arrive on tuesday – I’ll get you posted.

So I decided to go with the ECC RAM and try to overclock it. After all I am an experimental physicist, so I simply have to do it especially considering the price difference is so small (10EUR) compared to the non-ECC RAM.

You were mentioning RAM coolers. I will use the RAM on an ASUS WS-x570 ACE with a Noctua NH-D15 in a big tower (Nanoxia Deep Silence 6).

Do you think RAM coolers would bring me any benefit under the large tower cooler?

As a physicist I always thought of RAM coolers more as an unnecessary RGB gadget -- but could they be useful here?

Any recommendations?

(p.s.: I am thinking about inverting the flow direction of the top coolers in the case, meaning having them to blow filtered air in directly on the RAM rather than sucking it out)

Pleytos · April 26, 2020, 5:44pm

There is also DDR4 3200MT/s ECC unbuffered from both Micron and Samsung. Only Micron seems to be directly available.

https://nl.mouser.com/ProductDetail/Micron/MTA18ADF2G72AZ-3G2E1?qs=sGAEpiMZZMv0kptKhOOd1HbSv2VZyx4%2Bk4YgXtp%2F8RMk%2BktPUOWYug%3D%3D

https://www.samsung.com/semiconductor/dram/module

igoodman · April 26, 2020, 6:34pm

@Pleytos
But I think those are only 16GB … or do you explicitly know of other 32GB dual rank UDIMMs from Mircon?

GigaBusterEXE · April 27, 2020, 6:36am

they could help with EM shielding, I can’t remember who, might have been wendell or Jay but he had a 12v wire too close to the ram module and it was causing errors, maybe a block of metal in the way could help idk I’m not a scientist, this is just speculation

Pleytos · April 29, 2020, 5:53pm

They exist. I am also trying to hunt down some 3200 UB ECC memory, it ain’t easy. I am going for 8GB or 16GB.

https://www.connection.com/product/crucial-ddr4-udimm-std-32gb-1rx4-3200/mta18asf4g72az-3g2b1/38137686

https://www.digikey.be/products/nl/memory-cards-modules/memory-modules/505?k=&pkeyword=&sv=0&pv16=165233&sf=0&FV=-8|505%2C142|107954%2C142|188347%2C142|294117%2C143|172959%2C143|187288%2C149|337832&quantity=&ColumnSort=0&page=1&pageSize=25

https://www.digikey.be/product-detail/nl/micron-technology-inc/MTA18ADF4G72AZ-3G2B3/557-MTA18ADF4G72AZ-3G2B3-ND/11591641

https://www.digikey.be/product-detail/nl/micron-technology-inc/MTA18ASF4G72AZ-3G2B1/557-MTA18ASF4G72AZ-3G2B1-ND/11591653

igoodman · April 30, 2020, 11:59am

I’ve recieved the memory. 2x 32GB ECC UDIMM DDR4-2666MHz PC4-21300E 19-19-19 dual rank x8 M391A4G43MB1-CTD (for only 5% more than non DDR4-3600 would have been)

It works fine with stock speed.
3600MHz Ram clock did boot but was not stable (maybe one can tweak the timing and or voltage settings here later on)

I’ve now set the RAM speed to 3200MHz in the BIOS while leaving everything else on Auto and it’s stable so far. memtest86 runs without errors and also none with edac-util after boot, but re running it after memtester is still pending.

Do you have additional advice on how to get the best RAM-timing settings?

Does anyone know it there is a temprature sensor on the RAM one could use for monitoring RAM temprature?
If so: How to do that with linux?

Thanks a lot!

ps: Tip for using it with the ASUS WS x570 ACE Pro: Make sure that you have enabled ECC in the BIOS (the Auto setting seems not to detect it!)

thro · April 30, 2020, 12:21pm

Zen2? I’d not bother with overclocking RAM, as you’re unlikely to see a huge difference in performance anyway due to zen2 having massive caches.

I’d run ECC at rated speed and be done with it.

Or non-ECC at rated speed, and again… be done with it.

Unless you like creating un-necessary problems for yourself.

“complex calculations” don’t tend to be massively memory bandwidth sensitive, so before assuming you need to overclock the living shit out of everything, at least benchmark/confirm first.

igoodman · April 30, 2020, 12:26pm

I respect your opinion @thro

thank you for your feedback.

Notwithstanding I would be grateful for any feedback to my last questions:

I see it as small research project for fun and will share all my results here.

I will do extensive benchmarks with my specific application tomorrow. A quick check showd about 7% performance difference with 2666MHz vs. 3600MHz RAM speed. Not massive but still noticable.

Log · April 30, 2020, 5:14pm

Some ram does have temperature sensors. I could swear my ecc ram has them, but it’s been 2 years since I’ve checked.

Checking it likely involves lm_sensors. Good luck with that.

I highly recommend looking at front ends first.
https://wiki.archlinux.org/index.php/Lm_sensors

marelooke · April 30, 2020, 7:01pm

Or through the BMC, if the motherboard has that feature.

Pleytos · May 2, 2020, 11:46am

My Crucial DDR4 2400MTS for 2017 has temperature sensors. HWmonitor can read them, I think I saw the temperature on Linux once too.

bitcore · May 26, 2020, 3:14pm

My order finally arrived. last week

Just like @igoodman, I was able to overclock the 4x sticks of M391A4G43MB1-CTD to 3200 @ stock voltages and CL19. Mine were also not stable at 3600, even if I loosened the timings a bit. Benchmarks didn’t really seem to suggest much of an improvement in performance for it to be worth it trying to get 3600 stable. Sure, it was faster, but not “Oh goodness this is much better” faster. I preferred not to be too greedy, and I really didn’t want to overvolt (I want a lower power, low heat server). I already had an “overclock” applied.

I did try the heat-gun trick during a stress test to try and trigger some ECC errors to see how stable they were. I got one bank of dimms (2 of them) to around 80c before I chickened out and aborted the test for fear of damaging them or other nearby motherboard components. No ECC errors were apparent - so these sticks appear to be rather strong units. Sorry, didn’t get any pics with the thermal camera - I should have.

igoodman · May 27, 2020, 12:28pm

@bitcore Very good idea!

I would be very happy if you could share the test protocol for provoking and reading out the ECC errors.

Utilizing edac-util --v yields

#edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
edac-util: No errors to report.

That may imply that it’s working in the background (as anecdotically mentioned here (link). Yet I do not see how many.

Does anyone of you manage to get that info with zen2?

Comming back specifically to the M391A4G43MB1-CTD modules

I’ve noticed some weird random errors that seem to be loosely memory related (but no ECC errors caught) even at stock 2666MHz speed after 100% ok memtest.
I ordered a replacement RAM and will test the new modules.

kwinz · February 10, 2021, 5:58pm

Just FYI I think edac-utils has been deprecated in favor of rasdaemon.

$ ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.

$ systemctl status rasdaemon
● rasdaemon.service - RAS daemon to log the RAS events
Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2020-08-05 02:26:12 CEST; 6 months 7 days ago
[…]

Log · February 10, 2021, 8:26pm

Interesting, this is first I’ve heard of rasdaemon, thanks for bringing it up.

igoodman · March 3, 2021, 7:58pm

rasdaemon fails to start on ununtu 20.10

systemctl status rasdaemon
● rasdaemon.service - RAS daemon to log the RAS events
Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; vendor preset: enabled)
Active: failed (Result: resources)

Mär 03 18:12:15 ryzen-pro systemd[1]: rasdaemon.service: Failed to load environment files: No such file or directory
Mär 03 18:12:15 ryzen-pro systemd[1]: rasdaemon.service: Failed to run ‘start’ task: No such file or directory
Mär 03 18:12:15 ryzen-pro systemd[1]: rasdaemon.service: Failed with result ‘resources’.
Mär 03 18:12:15 ryzen-pro systemd[1]: Failed to start RAS daemon to log the RAS events.

running apt yields that the package is up to date

Do you have any Idea what could be behind this? @kwinz

kwinz · March 9, 2021, 8:03am

There is a Debian bugreport that could fit this issue:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=966717
and upstream here:

with fix:

Maybe this fix wasn’t ported to Ubuntu 20.10 yet? (I am still on Ubuntu 18.04)