Asrock b650d4u code 00 server motherboard failure

Liv · September 3, 2024, 12:13pm

Hi @wendell ,
A suspiciously large number of server motherboards are having a 00 code issue with no recovery so far. Mobo BMI still works but shows no info related to this in the logs, parts report ok.
The asrock b650d4u , I’ve so far got two failures in a span of 2 months, same spec systems but different firmware versions. 7950x pwr 105w, 64G ram, 1u server from Rect servers in Germany.
Other users having the same issue in much larger numbers but I’ve got quite a few in the pipeline and getting very worried as they’re production.
Asrock forum:

Have you heard of this? it is really concerning.
Thanks,
Liv.

wendell · September 3, 2024, 12:18pm

did you happen to have any logs or rasdaemon loaded before failure?

NoNo-Chan · September 4, 2024, 7:27am

We bought 8x 1U4LW-B650/2L2T RPSU.
2 seem dead and 1 seems to be dying.

-One was flashed using a 7600x to the newest stable bios. We had boot issues on that one that is why we tried a bios update now its completely dead showing 00.

-On the second i never got a picture at all. Getting 00 with a 7600 and a Epyc 4124p. Sticker on Bios Chips says 4.09

-We have a third using a 4124P. That one needs ~30s to move on from 00. It also shows voltage errors on its BMC Event Log.

RawSELLog.txt (8.5 KB)
SELLog.txt (62.6 KB)

We have now halted all those projects. Luckily only two using a 7950x are in production for now. Those run a pve cluster fine for the past 3 months.

MetalizeYourBrain · September 4, 2024, 7:33am

If it’s anything like Asus boards it’s just dead, not worth troubleshooting it any further.

NoNo-Chan · September 4, 2024, 11:18am

Our distributor already told us to RMA the dead one and that the mainboard is currently problematic.

I’m not sure if the problem started with firmware version 4.09. The working PVE-Cluster runs 3.17 and shows no issues.

MetalizeYourBrain · September 4, 2024, 11:22am

It might very well be a BIOS version that’s bricking it.

What my wild imagination is making me think is that they might’ve made a BIOS that’s made to work with a different IC on the board compared to the one in previous versions. So this new BIOS is not getting the response it wants from said IC and it’s just halting everything.

I red through the forum post you linked and someone was saying that they’ve gone through a lot of motherboards all with the same issue. So it isn’t unrealistic to think that someone messed up somewhere along the line and it’s not a one in a thousand defective board.

newborn2010 · September 4, 2024, 11:52am

Same issue here: ASRock Rack B650D4U MicroATX for Ryzen 7000 processors | ServeTheHome Forums. Many B650D4U series motherboards got stuck at code 00 during boot in the past few months as if it has been infected by a “virus”.

Liv · September 4, 2024, 1:46pm

The two that failed my end were:

one with an older bios running Ubuntu 22 & KVM;
the 2nd with the latest bios 4.10 / 10.15 running pve8
So it doesn’t seem to be a specific version from what I can see.
I am now not updating the new bios on servers going live in the hopes they will survive and setting the CPU power to ECO-65W. I have found servers coming with ECO out of the box on Eco-105W and Eco off which is I think ~190W. I don’t think it’s power related anyway.
I’m hoping @wendell will pull some connections magic out of his hat, draw attention and maybe at least isolate a working scenario until Asrock come with a fix.
@wendell I don’t have logs at this time, I might be able to pull some from the SSDs I took out before RMA. The BMI logs don’t show any issues, it just doesn’t power on, code 00 on mobo.
The server load is very light in my case, the software wasn’t deployed for full functionality yet. And it’s not RAM, I swapped.
Thanks!

NoNo-Chan · September 6, 2024, 10:48am

Could it be faulty or loose bios chips?

I exchanged bios chips from a working with a dead server. Both work now.
I also reseeded the bios chip on the slow booting on. That fixed it beeing stuck at 00 for ~30s.
I updated all boards to 10.18 and flashed the BMCs to 06.01.00.

Flashing BMC to the newest version made those earlier, uploaded voltage warnings disappear. Those appeared on two of them.

I can also confirm, the old UEFI Version 3.17 can run the epyc 4124P.

WARNING: Because we had issues on asrock barebones, which use B650D4U-2L2T/BCM we might have dealt with another, unrelated issue.

lemma · September 6, 2024, 1:35pm

For a 7950X it’s 170 W TDP and 230 W PPT. But, yeah, this doesn’t feel to me like a CPU max draw issue either.

kcinick · September 24, 2024, 3:58am

how bad is the bending on the motherboard itself? can you take a picture?

Liv · October 1, 2024, 11:41am

Did you get any closer to a cause or pattern on the issue?
I haven’t so far.
Thanks

NoNo-Chan · October 8, 2024, 1:51pm

Unfortunately not.
I’m currently dealing with three systems showing this symptom.

All are on a different site, currently i am only able to reach them via bmc.

I can sometimes reproduce it by reseeding the bios chip. Sometimes the system will show 00 after that.

I have to replace the backplane on two systems and will contact asrock support afterwards.

NoNo-Chan · October 16, 2024, 10:20am

I’m in contact with asrock rack support.
We had to send them the serial numbers of those mainboards.

I send them yesterday, still waiting for their reply.

NoNo-Chan · October 22, 2024, 9:10am

We have to send in our boards for replacements.

Someone at asrock forum, who replaced more than hundred boards, recommends send them in.
His replacements are working without problems so far https://forum.asrock.com/forum_posts.asp?TID=40795&PID=156897&title=post-code-00#156897

I don’t know if the root cause is solved, or if that can happen with new boards again.

TopCheddar27 · October 23, 2024, 4:09am

I also have an ASRock Rack b6504DU that has slowly progressed to the 00 boot error.

At first when I got it and put it in the rack after it was on for ~3 days on my desk. Then the first warning came when it would not boot at first. Then, after it did boot after a couple of hard resets of the IPMI, it was on for 10 days, then randomly powered off.

It did this 4-5 times. And then, miraculously It lasted 45 days last boot.

Until this morning, when it just sticks on 00 and no post. There is an error saying power action failed. Everything has been reseated through this process 3 times. It was on last night for 45 days. Nothing was changed.

Support told me to try the experimental version BIOS 20.01. This did not work either. Getting an RMA in tonight. Bought May 11th, Dead in October. Judging from others I am not the only one.

I would say BEWARE this board until Hardware revisions are nailed down to be safe or not. If your board is this version I would say you are at risk.

Here is mine:

akkowicz · October 26, 2024, 11:26pm

My company had fourteen ASRockRack B665D4U-1L fail in the same way after around 2-5 months of use.

Liv · October 27, 2024, 4:46pm

I heard a rumour that there might be a component out of spec on some motherboards, but I can’t confirm this yet, maybe never, might be bull$h1t. So far they are replacing my failed ones, however I’m really worried about the ones not failed yet.

Psi_Iota · October 30, 2024, 8:30pm

We are an MSP specializing in Hyperconverged Infrastructure for production environments, and we’ve encountered significant issues with the ASRock B650D4U boards. We’ve deployed these boards in our systems.

Out of 22 units installed, over 25% have failed with the dreaded “Post Code 00” error, leaving us unable to recover or bring systems back online, despite parts showing no faults in the BIOS Management Interface (BMI) logs. These units are all used in critical infrastructure, and with a failure rate like this, our Hyperconverged Highly Available setup has repeatedly failed to provide the reliability we promised to clients.

Given the gravity of this situation and the impact it has on production environments, we’re looking for concrete solutions or at least transparency from ASRock about the root cause. The high failure rate we and others have reported seriously impacts both our client satisfaction and reputation, and the reputation of this hardware in high-availability applications.

NoNo-Chan · November 1, 2024, 3:09pm

We have exactly the same issue but luckily we were late enough to only start deploying those systems in a single HCI production environment.

We now sit on 12 Systems that cant be deployed until this mess is cleaned up.
Of our 14 Systems 4 have failed. 3 while setting them up and one was doa.

We where just starting to use them. I have a lot more projects on the way and i would need way more of those systems but for now everything is on hold.
I’m not even sure if i can rely on those replaced motherboards.

Edit: Just some side note. Those asrock barebones we use tend to also have multiple defect HDD LEDs on their inwin backplanes. That is clearly a soldering error and is hopefully fixed in some months. Pressing on the LEDs makes them light up. Asrock does RMA those backplanes without issues. 3 of the 14 backplanes where affected.