ASUS W790 Sage... Something happened? Restart after boot

I’m running a W9-3495 on a W790 Sage. 512GB mixed ram between 4800 and 5600mhz. RTX4090 and a couple of other accessories. 2x Corsair 4TB Gen5 SSDs in RAID0 with VROC.

Had a relatively stable build for about 3 months. All of a sudden, started getting BSODs saying “Bad System Config Info”. Recovery is working fine, so didn’t figure there was an issue with the SSD/RAID array. BIOS reported the RAID array healthy. When using recovery, I can’t get to the C: - but I’m assuming that might just be because the recovery is missing some VROC drivers.

I tried gparted… but it doesn’t boot, just restarts. I have a working windows installation on another raid array… as soon as windows starts, the computer restarts. memtest stops (assuming) after the screen goes blank… an hour or so in. I had all RAM slots populated, tried with 4 at a time, same experience.

So a couple of goals here…

  1. I’d like to not lose my data assuming that the raid volume is still healthy.
  2. I’d like to understand where the instability comes from…

Are you getting CPU_CATERR or Blue Screen errors logged in IPMI?
If you did get a CPU_CATERR while using VROC I’d assume the partition is corrupted and run chkdsk.

Take a look at your memory temperatures when running memtest, I don’t think I was able to run it without my memory overheating before I got better cooling for them. Any temperature over 65-70C is of concern.
The high capacity RDIMMs put out way more heat than the more pedestrian memory capacities.

I’ll try to monitor temps more, but I have the mems watercooled using https://www.ekwb.com/shop/water-blocks/ram-blocks - although air cooling might be more effective tbh… how’d you cool yours?

I didn’t see any CPU_CATERR logged in ipmi. This is all I see since things have stopped working (I was gone for a couple of weeks).

Other odd things … IPMI has an out of date system inventory, not sure how to get that to refresh - not that it matters.

I also ended up watercooling mine with the bartx spreaders/blocks. Were you able to get good memory chip contact with the EKWB spreaders?

hmmm… I think these *might be memory ECC failure errors.
This is what my log looked like when I was still trying to find a stable overclock:

My 18th event is the same as the ones you are getting. I got it when my memory was on the edge of stability where it would train and post just fine, but after days of a workload would crash the system. I ended up fixing by slowing the memory a tad.

The CATERR errors from my log are from setting the CPU multiplier too high.

The system inventory should update once you have one or two successful starts without a crash on the system.

Copy that, tried running a single stick of 64GB and got both a WinPE to start, and then my other copy of Windows. Seems to be a RAM issue overall… will post back after I restart to try the Intel VROC raid. By the way, was able to see the data from the other Windows - so happy about that.

What RAM are you running? Mind sharing timings / settings for your RAM?

Thanks for your help :slight_smile:

EDIT: Seems to probably be a boot issue at this point, since data / partition are there and working fine - it could be something happened to MBR? Regular Win recovery isn’t fixing boot (probably because lack of drivers) - going to try with the WinPE from EaseUS Partition Magic. Any other ideas?

It might be as simple at the ram thermal cycling at too high of temperature swings and working it’s way loose in the socket and making bad contact, I’ve seen this before.

I’m running 8 sticks of HMCG94AEBRA at 5880MHz 34-36-36-66 using 1.3v.

I’m fuzzier on this than I’d like, but I think if the UEFI boot entry, which is supposed to be stored in a non-volitile cache on the motherboard, gets wiped out it will no longer automatically boot. Does force booting a specific boot media in BIOS help?

holy sh!t you got HMCG94AEBRA running at +1000Mhz over stock clock?

Apparently the EFI partition lives on the disk itself, it’s one of the MBR partitions. Force booting on one Windows Boot Manager works fine (my other raid) but fails on my main one (on the intel vroc). I’m gonna have to see how to fix it likely…

Yeah, if it would have been a 4800MHz 32GB stick I think I could have gotten much higher, perhaps +2000MHz. The doubling of chips that the 64GB RDIMMs have really caps maximum memory performance; I could not push past 6000MHz on my 64GB sticks no matter how much voltage I gave it.

oh, that is not encouraging. I know boot EFI partitions can be restored using WinPE like Hirens BCD, but I don’t know if VROC complicates that process.

Have you changed any hardware (including NVMe drives) recently? My system that came from a SI - late because of problems with the original W790 Sage in the build - is generally rock solid, except when I try to use PCIE 5.0 drives with subpar cables. (Narrator: all cables he tested have been subpar.)

Then I get one of the following:

  1. Drive just not recognized, system boots normally after a long POST
  2. Even longer POST, drive not recognized, system boots into safe BiOS mode with everything downclocked
  3. Drive recognized but the link is 16MT/s or 8MT/s, Linux eventually crashes.

Man… I don’t get it. Yesterday I had a stable boot into secondary Windows. Today… it boots and within 10s it restarts. No codes anywhere, and not enough time to see event viewer…

And I’m having trouble believing it’s RAM related now… BIOS temps appear well within thresholds. I tried on auto (assuming it picks up a profile or at least retarded timing) and I tried with your timing and 5200Mhz - posted just fine, and same behavior once it hit Windows.

@martona I didn’t change any hardware. It just stopped working one day all of a sudden.

Are there any useful errors logged in the IPMI web portal? you can also monitor RAM temps in real time from the web portal.

What PSU are you running? a reboot from a dc power brownout could happen without leaving a trace but it’s kind of odd to only be experiencing that now. Turning off some of the higher C-states in BIOS might actually help if this were the case, it can smooth out the skew rate of power the motherboard asks for.

1 Like

I am tuned in because I am wondering what RAM you got your system running with. I have a 2495x and 3495x and all efforts w/ 3495x couldn’t POST, which was unclear which memory would actually work.

You mentioned HMCG94AEBRA … it’s working well?

Welcome,

W790 seems to be the most tolerant platform with registered DDR5 so far, I haven’t found any RAM that will make my system not post assuming we’re running at reasonable speeds, that being said I haven’t tried any non-binary RAM yet.
Yes HMCG94AEBRA works well, but a lower density RDIMM would work even better.

If you have an ES or QS CPU, you can have posting problems if you aren’t on an early enough BIOS.

1 Like