Hi @wendell ,
A suspiciously large number of server motherboards are having a 00 code issue with no recovery so far. Mobo BMI still works but shows no info related to this in the logs, parts report ok.
The asrock b650d4u , I’ve so far got two failures in a span of 2 months, same spec systems but different firmware versions. 7950x pwr 105w, 64G ram, 1u server from Rect servers in Germany.
Other users having the same issue in much larger numbers but I’ve got quite a few in the pipeline and getting very worried as they’re production.
Asrock forum:
Have you heard of this? it is really concerning.
Thanks,
Liv.
We bought 8x 1U4LW-B650/2L2T RPSU.
2 seem dead and 1 seems to be dying.
-One was flashed using a 7600x to the newest stable bios. We had boot issues on that one that is why we tried a bios update now its completely dead showing 00.
-On the second i never got a picture at all. Getting 00 with a 7600 and a Epyc 4124p. Sticker on Bios Chips says 4.09
-We have a third using a 4124P. That one needs ~30s to move on from 00. It also shows voltage errors on its BMC Event Log.
It might very well be a BIOS version that’s bricking it.
What my wild imagination is making me think is that they might’ve made a BIOS that’s made to work with a different IC on the board compared to the one in previous versions. So this new BIOS is not getting the response it wants from said IC and it’s just halting everything.
I red through the forum post you linked and someone was saying that they’ve gone through a lot of motherboards all with the same issue. So it isn’t unrealistic to think that someone messed up somewhere along the line and it’s not a one in a thousand defective board.
the 2nd with the latest bios 4.10 / 10.15 running pve8
So it doesn’t seem to be a specific version from what I can see.
I am now not updating the new bios on servers going live in the hopes they will survive and setting the CPU power to ECO-65W. I have found servers coming with ECO out of the box on Eco-105W and Eco off which is I think ~190W. I don’t think it’s power related anyway.
I’m hoping @wendell will pull some connections magic out of his hat, draw attention and maybe at least isolate a working scenario until Asrock come with a fix. @wendell I don’t have logs at this time, I might be able to pull some from the SSDs I took out before RMA. The BMI logs don’t show any issues, it just doesn’t power on, code 00 on mobo.
The server load is very light in my case, the software wasn’t deployed for full functionality yet. And it’s not RAM, I swapped.
Thanks!
I also have an ASRock Rack b6504DU that has slowly progressed to the 00 boot error.
At first when I got it and put it in the rack after it was on for ~3 days on my desk. Then the first warning came when it would not boot at first. Then, after it did boot after a couple of hard resets of the IPMI, it was on for 10 days, then randomly powered off.
It did this 4-5 times. And then, miraculously It lasted 45 days last boot.
Until this morning, when it just sticks on 00 and no post. There is an error saying power action failed. Everything has been reseated through this process 3 times. It was on last night for 45 days. Nothing was changed.
Support told me to try the experimental version BIOS 20.01. This did not work either. Getting an RMA in tonight. Bought May 11th, Dead in October. Judging from others I am not the only one.
I would say BEWARE this board until Hardware revisions are nailed down to be safe or not. If your board is this version I would say you are at risk.
I heard a rumour that there might be a component out of spec on some motherboards, but I can’t confirm this yet, maybe never, might be bull$h1t. So far they are replacing my failed ones, however I’m really worried about the ones not failed yet.
We are an MSP specializing in Hyperconverged Infrastructure for production environments, and we’ve encountered significant issues with the ASRock B650D4U boards. We’ve deployed these boards in our systems.
Out of 22 units installed, over 25% have failed with the dreaded “Post Code 00” error, leaving us unable to recover or bring systems back online, despite parts showing no faults in the BIOS Management Interface (BMI) logs. These units are all used in critical infrastructure, and with a failure rate like this, our Hyperconverged Highly Available setup has repeatedly failed to provide the reliability we promised to clients.
Given the gravity of this situation and the impact it has on production environments, we’re looking for concrete solutions or at least transparency from ASRock about the root cause. The high failure rate we and others have reported seriously impacts both our client satisfaction and reputation, and the reputation of this hardware in high-availability applications.
We have exactly the same issue but luckily we were late enough to only start deploying those systems in a single HCI production environment.
We now sit on 12 Systems that cant be deployed until this mess is cleaned up.
Of our 14 Systems 4 have failed. 3 while setting them up and one was doa.
We where just starting to use them. I have a lot more projects on the way and i would need way more of those systems but for now everything is on hold.
I’m not even sure if i can rely on those replaced motherboards.
Edit: Just some side note. Those asrock barebones we use tend to also have multiple defect HDD LEDs on their inwin backplanes. That is clearly a soldering error and is hopefully fixed in some months. Pressing on the LEDs makes them light up. Asrock does RMA those backplanes without issues. 3 of the 14 backplanes where affected.