ASRock Rack X470D4U2-2T

I’m curious and unfamiliar about that. How do you have that set up?

https://help.ubuntu.com/community/SerialConsoleHowto

Mh interesting… I always kinda assumed it did because what would be the point otherwise :thinking: Because as far as I know an uncorrectable error isn’t necessarily a reason for the kernel to halt, right? So might still end up with broken data? Doesn’t really make sense to me :thinking:

@wendell probably knows? :stuck_out_tongue:

Posting this here too:

Hi all,

I’ve recently bought the X470D4U2-2T for a FreeNAS build.

I’ve posted my first experiences with this board at the FreeNAS forum (search for “FreeNAS build with 10gbe and ryzen” on ixsystems. com)

But since this forum seems very active regarding this board and since many people are strugling, I thought it could be useful here as well:

For those having trouble with a similar built, here is how I “kinda” did it…
Please keep in mind that I didn’t log everything I did, so this is from memory and perhaps I’m forgetting something…

So, this is my first built with IPMI. I didn’t install a graphics card, but I was hoping it would work with just a network to the IPMI (and it did :slight_smile: )

  • I installed all hardware except the HBA / HDDs in the case
  • I attached a monitor to the D-Sub connector of the mobo (I think this is of the IPMI?)
  • I attached a USB keyboard / mouse
  • I attached a networkcable between the IPMI NIC and the router
  • In my router-admin-page I found the IP that DHCP of the router assigned to the IPMI
  • I browsed to the IPMI web page and logged in with the default credentials (admin / admin)
  • I configured both router and IPMI to use a static IP and changed the admin password
  • I updated the IPMI to v1.40.00 using ‘maintenance’ - ‘firmware update’ in the IPMI
  • I updated the BIOS to v3.20 using ‘maintenance’ - ‘BIOS Update’ in the IPMI
    • I think I used the ‘Instant Flash’ download for this
    • Yes, this worked, even though my CPU wasn’t supported by the older BIOS that was pre-installed
    • No, this isn’t properly documented anywhere that I could find
  • I tried opening ‘Remote Control’ and booting, but didn’t get any visual and many beeps / error codes
  • In the end, I think unplugging both the monitor and USB keyboard / mouse, allowed me to finally boot and get an visual in ‘Remote Control’
  • At one point I also tried disabling ‘Onboard Graphics’ in the BIOS (to save some power usage), but that also caused it to no longer boot. Will try fine tuning the BIOS a bit more later…
  • Then I installed the HBA / HDDs
  • The HDDs support the enterprise feature PWDIS (Power Disable). The PSU does not support this. So none of the HDDs would power up.
    • I fixed this by disconnecting the 3.3v from the modular SATA power cable on the PSU side, using the below method:
      (link was here, see my post ixsystems if you need the link)
    • For 1 of 2 cable this succeeded without any damage to the disconnected cable, for the other cable, 1 of the 2 hooks got bended / forced (I do think I’ll manage to re-attach both without anyone being able to tell, in case I would ever need to use warranty)
  • The HBA was already in IT mode, but was running an old firmware. Upgrading it was a struggle…
    • Bootable (Free)DOS USB stick didn’t work as my BIOS is UEFI (“Failed to initialize PAL” error)
    • Update using EFI shell also refused to work, even though I was using the same old ‘Shell_Full.efi’ file that others used succesfully (I kept on getting “InitShellApp: Application not started from Shell” error)
    • In the end I just installed Windows 10 and used the Windows x64 installer! :smiley:
  • Finally I installed FreeNAS 11.2 U6. No issues occured during the install.
  • Both Windows 10 and FreeNAS, I installed using ‘CD Image’ functionality inside the ‘Remote Control’ of the IPMI. It’s probably slower, but saves you from having to create and mess with USB sticks.
3 Likes

Also I found this reddit regarding how to test ECC RAM:

I might try it in the future…

Edit:
I also found the below article, which is even more detailed

After spending quite some time testing, I have an big update on this, but also still some questions / issues with it.

Windows
afbeelding
For the first command: 2 (unknown), 3 (none), 4 (parity), 5 (single-bit ECC), or 6 (multi-bit ECC). So that looks good!

For the second command: Also that looks good (TotalWidth is larger than DataWidth).
afbeelding
afbeelding
afbeelding
afbeelding
Also CPU-z, HWinfo64 and AIDA64 correctly recognize the ECC RAM and AIDA64 also reports that it is enabled.

Linux


Also in Linux everything looks ok: ‘DRAM ECC enabled’ and ‘using x16 syndromes’.

But then the actual testing

For this I’ve overclocked the memory from 1333Mhz to 1500Mhz, keeping all other timings the same. At 1533Mhz or 1567Mhz the mobo no longer posts and requires a clear CMOS to recover.

These are my default settings (bottom right are is the memory)


And these my overclocked settings

However, with the overclocked settings I’m failing to log any memory error at all on both Windows and Linux…
:(


Both memtester, memtest86+ and Prime95 Blend can run for hours without error at this speed.

I suspect that ECC actually does work and corrects many errors, but it doesn’t report anything to any OS? (because just slightly increasing the frequency causes it to not post at all anymore).

In the IPMI I’m also not finding any errors being reported:



I also tried to disable the ECC functionality and see if I could make any of the stresstest programs crash or that the OS then would receive uncorrected errors reported (this would at least proove that my memory is “unstable” at this frequency).

But also that failed. Even after disabling ECC, I get no error in Linux, Windows (didn’t check the IPMI in this scenario yet) and no crashes either.

I’ve used below BIOS settings for trying this (not sure if this is correct / sufficient though).

These settings are default (but show the BIOS maze I went through to get there ;) ):
afbeelding
afbeelding
afbeelding
afbeelding
afbeelding
I’ve tried to change ‘DRAM ECC Enable’ to ‘Disabled’ and after that also ‘DRAM UECC Retry’ to ‘Disabled’:
afbeelding
afbeelding
Can someone help me figure out how to fix my ECC error reporting to the OS (and/or the IPMI) or explain what I’m doing wrong?

Thanks!

1 Like

Edit: issues with pictures solved (user error :wink: )

Maybe you can see them in preview because they are cached?

But I think their forum wants a login to view the images. Not sure 100% tbh.

But if click one, like,

https://hardwarecanucks.com/forum/attachments/1571955813388-png.27275/

I get some error message about log in.

Need to be a logged in user (prevents them hosting the bandwidth for others using the images) If you copy paste you can upload the images on the L1 Forum

3 Likes

Ah lol ok, now I understand what you mean… :smiley:

I was thinking “but I am logged in” (on Level1Techs I mean)… But you meant “logged in on Hardware Canucks”, where the pictures are apparently still hosted :stuck_out_tongue: I (wrongfully) assumed that when copying, the pictures were copied and hosted on Level1Techs automatically as well.

I have edited my post now with working pictures :partying_face:

Hi Wendell (and Steve?),

Awesome that you’re working with Steve on practically exactly the same hardware project as I am (this board with a Ryzen 3600)!!

I suppose that for ZFS on UnRAID, ECC is equally critically important as on FreeNAS? Will you have a look at the ECC aspect in this video series?

I started looking into this ECC aspect myself because some FreeNAS guru (jgreco) was doubting that it was actually fully functional and tested (mainly properly reporting it to the BMC (or OS) seems a questionmark):

Would be nice if someone can confirm / properly test this! (or even if your contact at ASRock could confirm that this really really works?).
You seem to the perfect team for figuring this out: You seem to know something about servers / ECC and Steve probably has messed with memory timings / frequency somewhere in his life :smiley:

Thanks!

edit:
Feel free to use my “ground-work” in you video series :wink:

We’re running the “normal” non 10G version of this board and we’ve seen the ce_count increasing after running the board for two months. Currently the counter is back to 0 because we’ve had restarted it a few days ago (I will let you known when it’s increased) However the IPMI isn’t logging this errors. Maybe it only shows uncorrectable errors?

You can have a look at this path for the current count.
cat /sys/devices/system/edac/mc/mc0/ce_count

I was kind of shocked that in the Gamers Nexus YouTube video they showed using “Gamer” memory without ECC and planning to set up ZFS. Sent a shiver down my spine and would have expected Wendell protesting that configuration later - maybe that comes in part 2.

1 Like

Ive run zfs for a few years without ecc memory. Its fine

2 Likes

I guess this is the same info as what edac-util displays:

[root@localhost ~]# cat /sys/devices/system/edac/mc/mc0/ce_count
0
[root@localhost ~]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
[root@localhost ~]# ls -la /sys/devices/system/edac/mc/mc0/
total 0
drwxr-xr-x. 9 root root    0 Nov  1 02:30 .
drwxr-xr-x. 4 root root    0 Oct 31 21:41 ..
-r--r--r--. 1 root root 4096 Nov  1 02:30 ce_count
-r--r--r--. 1 root root 4096 Nov  1 02:30 ce_noinfo_count
drwxr-xr-x. 3 root root    0 Nov  1 02:30 csrow2
drwxr-xr-x. 3 root root    0 Nov  1 02:30 csrow3
-r--r--r--. 1 root root 4096 Nov  1 02:30 max_location
-r--r--r--. 1 root root 4096 Nov  1 02:30 mc_name
drwxr-xr-x. 2 root root    0 Nov  1 02:30 power
drwxr-xr-x. 3 root root    0 Nov  1 02:30 rank4
drwxr-xr-x. 3 root root    0 Nov  1 02:30 rank5
drwxr-xr-x. 3 root root    0 Nov  1 02:30 rank6
drwxr-xr-x. 3 root root    0 Nov  1 02:30 rank7
--w-------. 1 root root 4096 Nov  1 02:30 reset_counters
-rw-r--r--. 1 root root 4096 Nov  1 02:30 sdram_scrub_rate
-r--r--r--. 1 root root 4096 Nov  1 02:30 seconds_since_reset
-r--r--r--. 1 root root 4096 Nov  1 02:30 size_mb
-r--r--r--. 1 root root 4096 Nov  1 02:30 ue_count
-r--r--r--. 1 root root 4096 Nov  1 02:30 ue_noinfo_count
-rw-r--r--. 1 root root 4096 Nov  1 02:30 uevent

Btw: I know you can perfectly run ZFS without ECC (initially I was also planning to do that), however, those FreeNAS people do have a point when saying “If you spend sooo much money and effort on creating a NAS with a big focus on data integrity, then it’s a bit silly to save 20 euro on not buying ECC RAM”. And I kinda agree :wink:

Having skimmed the video, part 2 is definitely a “panicked attempt at data recovery” episode, with lack of ECC being least concern here.

I’m pretty sure this was a spooky video designed just for me.

1 Like

Another thing I found curious was the used HBA. My current VM-NAS configuration is quite similar, but in a mini tower case and with QSFP+ instead of 10 GbE (and ESXi instead of unRAID).

I use a Broadcom 9400 8i8e HBA, the internal two SFF-8643 ports are used for two Optanes for mirrored ZIL/L2ARC, one external SFF-8644 is used for the possibilty of a DIY drive shelf and the other SFF-8644 is looped back to four internal mechanical HDDs for RAIDZ2.

The HBA shown in the video (seems to be a Broadcom HBA 9400-16i) only has four internal SFF-8643 ports (?) meaning they’ll have to use multiple adapters to get two of the internal ports to connect to the SAS expander(s) in the drive shelf…?

Well, you like to live dangerously, don’t you? :wink:

My very first DIY NAS half a lifetime ago (repurposed my old Slot-A Athlon 750 MHz gaming PC after an upgrade) had non-ECC memory that was defective and the defect was randomly sooo nice that the OS didn’t crash and since there weren’t any obvious issues I didn’t bother to check any logs or to verify the copied files after the process had completed…
(to be honest I didn’t know any better)

Then I used it for a backup and about half of the data copied to it had errors :frowning:, of course only realized it after I had wiped the local drives for a clean setup.

Fool me once…

Does anyone know where in the BIOS I can lower the “main-memory-voltage” (the one that should be 1.2v by default)?

I’d like to try that next to cause memory instability and hopefully see ECC functioning.

I’ve already bumped frequency to the max achievable and timings (at least the main ones) are also tightened as much as they can…

I think I just found it :slight_smile: