Supermicro watchdog broken?

My main home server is based around a Supermicro X11SAE board and it has been flawless since I built it several years ago. It is one of those workstation-style boards which have a regular ATX layout and Xeon E3 CPUs, but with some server features like an Aspeed AST2400 BMC for IPMI.

Today, I noticed my Home Assistant automations had stopped working and realised that the server was sitting booting at the POST screen. When it came back up, I was in the middle of doing some checks and it powered off again. When it came back I noticed in the logs that the system had been running for around 5 minutes each time, for about half an hour.

Everything looked normal - temperatures fine, voltages fine, battery fine. As far as Linux was concerned, it was a sudden unexpected reboot. The server is on a UPS and all other devices like my firewall remained up.

I tried to check the event log through the BMC but it was full with 512 old entries moaning about nonsense like fans that aren’t connected, nothing recent. I guess it doesn’t clean itself up when full.

The watchdog situation is not particularly clear. I’ve always had the watchdog BIOS option set to Disabled, but there’s additionally a jumper which can be set to two positions - the factory position is to Reset the system, and an alternative position is to generate an NMI signal. I saw some people with other Supermicro boards remove the watchdog jumper completely. I never touched the jumper when I built the system so it was always set to reset the system, so I’ve now removed it, and the system is behaving itself.

The thing is, I don’t like that/if that has “fixed” the system, because it has been running fine with the default jumper configuration for years. Surely something has broken hardware-wise because I haven’t touched the OS or other software either!

Has anyone else here seen this with Supermicro (or Aspeed BMCs generally)?

The watchdog was a red herring - motherboard now completely dead.

3 Likes

Didn’t see your original post
was gonna say, this is typically the last notification before CPU or MoBo failure

2 Likes