I have the non 10G version MB
but I did recently get a aquantia 10G nic
The 10Gb MB version wasnt available when I ordered my x470d4u
I have the non 10G version MB
but I did recently get a aquantia 10G nic
The 10Gb MB version wasnt available when I ordered my x470d4u
ZFS doesn’t care or do anything about hardware errors reported to the system/os as far as I know.
It checks the correctness of the data (all blocks are checksummed) every time it reads it from disk, which deals with far more than just ram errors, which is what you need if a cable, disk, or disk controller goes bad.
I’m curious and unfamiliar about that. How do you have that set up?
Mh interesting… I always kinda assumed it did because what would be the point otherwise Because as far as I know an uncorrectable error isn’t necessarily a reason for the kernel to halt, right? So might still end up with broken data? Doesn’t really make sense to me
@wendell probably knows?
Posting this here too:
Hi all,
I’ve recently bought the X470D4U2-2T for a FreeNAS build.
I’ve posted my first experiences with this board at the FreeNAS forum (search for “FreeNAS build with 10gbe and ryzen” on ixsystems. com)
But since this forum seems very active regarding this board and since many people are strugling, I thought it could be useful here as well:
For those having trouble with a similar built, here is how I “kinda” did it…
Please keep in mind that I didn’t log everything I did, so this is from memory and perhaps I’m forgetting something…
So, this is my first built with IPMI. I didn’t install a graphics card, but I was hoping it would work with just a network to the IPMI (and it did )
Also I found this reddit regarding how to test ECC RAM:
I might try it in the future…
Edit:
I also found the below article, which is even more detailed
After spending quite some time testing, I have an big update on this, but also still some questions / issues with it.
Windows
For the first command: 2 (unknown), 3 (none), 4 (parity), 5 (single-bit ECC), or 6 (multi-bit ECC). So that looks good!
For the second command: Also that looks good (TotalWidth is larger than DataWidth).
Also CPU-z, HWinfo64 and AIDA64 correctly recognize the ECC RAM and AIDA64 also reports that it is enabled.
Linux
But then the actual testing
For this I’ve overclocked the memory from 1333Mhz to 1500Mhz, keeping all other timings the same. At 1533Mhz or 1567Mhz the mobo no longer posts and requires a clear CMOS to recover.
These are my default settings (bottom right are is the memory)
I suspect that ECC actually does work and corrects many errors, but it doesn’t report anything to any OS? (because just slightly increasing the frequency causes it to not post at all anymore).
In the IPMI I’m also not finding any errors being reported:
But also that failed. Even after disabling ECC, I get no error in Linux, Windows (didn’t check the IPMI in this scenario yet) and no crashes either.
I’ve used below BIOS settings for trying this (not sure if this is correct / sufficient though).
These settings are default (but show the BIOS maze I went through to get there ):
I’ve tried to change ‘DRAM ECC Enable’ to ‘Disabled’ and after that also ‘DRAM UECC Retry’ to ‘Disabled’:
Can someone help me figure out how to fix my ECC error reporting to the OS (and/or the IPMI) or explain what I’m doing wrong?
Thanks!
Edit: issues with pictures solved (user error )
Maybe you can see them in preview because they are cached?
But I think their forum wants a login to view the images. Not sure 100% tbh.
But if click one, like,
https://hardwarecanucks.com/forum/attachments/1571955813388-png.27275/
I get some error message about log in.
Need to be a logged in user (prevents them hosting the bandwidth for others using the images) If you copy paste you can upload the images on the L1 Forum
Ah lol ok, now I understand what you mean…
I was thinking “but I am logged in” (on Level1Techs I mean)… But you meant “logged in on Hardware Canucks”, where the pictures are apparently still hosted I (wrongfully) assumed that when copying, the pictures were copied and hosted on Level1Techs automatically as well.
I have edited my post now with working pictures
Hi Wendell (and Steve?),
Awesome that you’re working with Steve on practically exactly the same hardware project as I am (this board with a Ryzen 3600)!!
I suppose that for ZFS on UnRAID, ECC is equally critically important as on FreeNAS? Will you have a look at the ECC aspect in this video series?
I started looking into this ECC aspect myself because some FreeNAS guru (jgreco) was doubting that it was actually fully functional and tested (mainly properly reporting it to the BMC (or OS) seems a questionmark):
Would be nice if someone can confirm / properly test this! (or even if your contact at ASRock could confirm that this really really works?).
You seem to the perfect team for figuring this out: You seem to know something about servers / ECC and Steve probably has messed with memory timings / frequency somewhere in his life
Thanks!
edit:
Feel free to use my “ground-work” in you video series
We’re running the “normal” non 10G version of this board and we’ve seen the ce_count increasing after running the board for two months. Currently the counter is back to 0 because we’ve had restarted it a few days ago (I will let you known when it’s increased) However the IPMI isn’t logging this errors. Maybe it only shows uncorrectable errors?
You can have a look at this path for the current count.
cat /sys/devices/system/edac/mc/mc0/ce_count
I was kind of shocked that in the Gamers Nexus YouTube video they showed using “Gamer” memory without ECC and planning to set up ZFS. Sent a shiver down my spine and would have expected Wendell protesting that configuration later - maybe that comes in part 2.
Ive run zfs for a few years without ecc memory. Its fine
I guess this is the same info as what edac-util displays:
[root@localhost ~]# cat /sys/devices/system/edac/mc/mc0/ce_count 0 [root@localhost ~]# edac-util -v mc0: 0 Uncorrected Errors with no DIMM info mc0: 0 Corrected Errors with no DIMM info mc0: csrow2: 0 Uncorrected Errors mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors mc0: csrow3: 0 Uncorrected Errors mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors [root@localhost ~]# ls -la /sys/devices/system/edac/mc/mc0/ total 0 drwxr-xr-x. 9 root root 0 Nov 1 02:30 . drwxr-xr-x. 4 root root 0 Oct 31 21:41 .. -r--r--r--. 1 root root 4096 Nov 1 02:30 ce_count -r--r--r--. 1 root root 4096 Nov 1 02:30 ce_noinfo_count drwxr-xr-x. 3 root root 0 Nov 1 02:30 csrow2 drwxr-xr-x. 3 root root 0 Nov 1 02:30 csrow3 -r--r--r--. 1 root root 4096 Nov 1 02:30 max_location -r--r--r--. 1 root root 4096 Nov 1 02:30 mc_name drwxr-xr-x. 2 root root 0 Nov 1 02:30 power drwxr-xr-x. 3 root root 0 Nov 1 02:30 rank4 drwxr-xr-x. 3 root root 0 Nov 1 02:30 rank5 drwxr-xr-x. 3 root root 0 Nov 1 02:30 rank6 drwxr-xr-x. 3 root root 0 Nov 1 02:30 rank7 --w-------. 1 root root 4096 Nov 1 02:30 reset_counters -rw-r--r--. 1 root root 4096 Nov 1 02:30 sdram_scrub_rate -r--r--r--. 1 root root 4096 Nov 1 02:30 seconds_since_reset -r--r--r--. 1 root root 4096 Nov 1 02:30 size_mb -r--r--r--. 1 root root 4096 Nov 1 02:30 ue_count -r--r--r--. 1 root root 4096 Nov 1 02:30 ue_noinfo_count -rw-r--r--. 1 root root 4096 Nov 1 02:30 uevent
Btw: I know you can perfectly run ZFS without ECC (initially I was also planning to do that), however, those FreeNAS people do have a point when saying “If you spend sooo much money and effort on creating a NAS with a big focus on data integrity, then it’s a bit silly to save 20 euro on not buying ECC RAM”. And I kinda agree
Having skimmed the video, part 2 is definitely a “panicked attempt at data recovery” episode, with lack of ECC being least concern here.
I’m pretty sure this was a spooky video designed just for me.
Another thing I found curious was the used HBA. My current VM-NAS configuration is quite similar, but in a mini tower case and with QSFP+ instead of 10 GbE (and ESXi instead of unRAID).
I use a Broadcom 9400 8i8e HBA, the internal two SFF-8643 ports are used for two Optanes for mirrored ZIL/L2ARC, one external SFF-8644 is used for the possibilty of a DIY drive shelf and the other SFF-8644 is looped back to four internal mechanical HDDs for RAIDZ2.
The HBA shown in the video (seems to be a Broadcom HBA 9400-16i) only has four internal SFF-8643 ports (?) meaning they’ll have to use multiple adapters to get two of the internal ports to connect to the SAS expander(s) in the drive shelf…?
Well, you like to live dangerously, don’t you?
My very first DIY NAS half a lifetime ago (repurposed my old Slot-A Athlon 750 MHz gaming PC after an upgrade) had non-ECC memory that was defective and the defect was randomly sooo nice that the OS didn’t crash and since there weren’t any obvious issues I didn’t bother to check any logs or to verify the copied files after the process had completed…
(to be honest I didn’t know any better)
Then I used it for a backup and about half of the data copied to it had errors , of course only realized it after I had wiped the local drives for a clean setup.
Fool me once…