Supermicro watchdog broken?

My main home server is based around a Supermicro X11SAE board and it has been flawless since I built it several years ago. It is one of those workstation-style boards which have a regular ATX layout and Xeon E3 CPUs, but with some server features like an Aspeed AST2400 BMC for IPMI.

Today, I noticed my Home Assistant automations had stopped working and realised that the server was sitting booting at the POST screen. When it came back up, I was in the middle of doing some checks and it powered off again. When it came back I noticed in the logs that the system had been running for around 5 minutes each time, for about half an hour.

Everything looked normal - temperatures fine, voltages fine, battery fine. As far as Linux was concerned, it was a sudden unexpected reboot. The server is on a UPS and all other devices like my firewall remained up.

I tried to check the event log through the BMC but it was full with 512 old entries moaning about nonsense like fans that aren’t connected, nothing recent. I guess it doesn’t clean itself up when full.

The watchdog situation is not particularly clear. I’ve always had the watchdog BIOS option set to Disabled, but there’s additionally a jumper which can be set to two positions - the factory position is to Reset the system, and an alternative position is to generate an NMI signal. I saw some people with other Supermicro boards remove the watchdog jumper completely. I never touched the jumper when I built the system so it was always set to reset the system, so I’ve now removed it, and the system is behaving itself.

The thing is, I don’t like that/if that has “fixed” the system, because it has been running fine with the default jumper configuration for years. Surely something has broken hardware-wise because I haven’t touched the OS or other software either!

Has anyone else here seen this with Supermicro (or Aspeed BMCs generally)?

The watchdog was a red herring - motherboard now completely dead.

3 Likes

Didn’t see your original post
was gonna say, this is typically the last notification before CPU or MoBo failure

2 Likes

Final follow-up. Moved everything (CPU, RAM and all expansion cards) over to an Intel S1200SP board and it’s been fine. So the Supermicro board had just died.

Annoyingly, a few weeks later, I had a Supermicro X10 board in a different server start showing ECC memory errors when under heavy IO load doing Snapraid sync (multi-bit errors, so instant restart) after populating the second memory channel - didn’t matter which sticks were in the slots and forcing the lowest speed made no difference. Memtest86 ran dozens of passes without failure but a few minutes into Snapraid sync job would trigger it every time. Swapped the CPU and took time to carefully check all socket pins (no issues) and the problem persisted. Eventually the board decided to fail to detect memory in any slots with a “55” POST code. I saw some people found certain Supermicro semi-brick themselves with unexplainable memory errors and require re-flashing the BIOS to fix, but I couldn’t be bothered faffing around with buggy hardware and swapped it out for an ancient Intel S1200BT and older Xeon with the same RAM and power supply, and it’s run perfectly.

1 Like

I have an S1200BT and have been debating on building my first True Nas setup with it. You post is encouraging. Do you have any recommendations of whether or not I should go for it? My hesitation is, how hard is it to move all the drives to a different PC later. Any tips would be great. Thanks

I can’t speak for TrueNAS since my systems are just running Debian, but I’ve never had issues moving data drives between systems.

The only issue I had is that my Intel boards don’t have m.2 slots for NVMe drives, and my original SuperMicro X11SAE system used an m.2 NVMe for the OS, to save on drive bay slots. Prior to picking up the S1200SP, I temporarily went back to a S1200BT for my main server and I had to transfer the contents of my NVMe over to a regular SATA SSD, because the S1200BT can’t boot from NVMe. It can see and use NVMe in a PCIe slot adapter, it just can’t boot from them, so it’s kind of useless (you could use a USB stick to boot, from the internal USB port though, with a fresh install).

My biggest complaint with any of the Intel boards is their BMC/IPMI - which require Java for the virtual console and a little RMM4 key daughterboard, whereas similar aged SuperMicro boards work natively in any modern web browser.

The S1200BT isn’t as efficient as a newer system in terms of performance-per-watt, but the idle power consumption isn’t terrible. It’s obviously nowhere near the level of something like an Intel N100, but the benefit of the S1200BT over that is that you get ECC memory support, and the memory is so cheap these days you can easily max it out to 32GB.

1 Like

Thank you for your detailed reply. I apologize I never responded back to you.