How to debug random hardlocking?

Hi guys,

I am currently rocking a dual Xeon system on the Supermicro board and the problem is, that system just hardware locks. Sporadically and randomly. Many times during playing a game but it also happens outside of just playing games. Nothing happens for a while and then the system just restarts. One main factor many times, after the start of my computer the keyboard doesn’t work (USB) - not sure if that’s the problem of the keyboard or the motherboard. How to find out what’s going on actually?

Is there any way how to find out what’s going on? I am rocking Ubuntu LTS.

Cheers.

Could you please provide your system specs? Make, model, etc? Thanks!

Also are there any messages in /var/log or dmesg about the time of the error?

Have you run a memtest?

Also check smart reporting on your disk drive.

Specs:
SuperMicro X10DRL-i,
2x Intel Xeon E5-2630 v4,
32 GB 2133 MHz DDR4 memory,
Zotac GTX 1070 (nVidia drivers 435.21)
Crucial SSD (disk utility: 2 bad sectors, Disk is OK)

anything specific what I should be looking for?

Here’s a journal’s log.
journalctl.log.txt (398.2 KB)

Possible a power issue. What PSU are you using?

Try removing a ram stick or run memteat to see if it a memory issue

EVGA SuperNova 750W G2

Are you running the OS directly on the hardware? Also are you using a USB add-in card or directly connected to the MB?

yup, directly on the hardware. Everything is running directly off of Motherboard, my computer locked during the memtest, so I made sure that the RAM is properly in the sockets, now I am running second memtest. Testing 32GB memory is not fun.

If that is happening I would test 1 stick in 1 slot with only CPU 1 installed to minimize the time on. If it completes then move it to the next slot of CPU 1. After checking each slot for CPU 1 then you can test one stick at a time, if all your sticks test good. Then you can install one stick in CPU 1 and reinstall CPU 2 and repeat test for CPU 2 and test each of those slots. I am wondering if this is could be an overheating issue

unlikely during just a memtest. more likely a bad memory stick or a faulty power delivery system. advice is sound, test each component separately and see where you land. good luck.

how about a bios update?
And what do the ipmi event logs say ?

it is rare for an SM board to just crash on bad ram.
Usually, it just tells you that there is an issue.
Especially since you are running Reg ECC.

Watch out, only Memtest86 V8 and above can detect corrected errors.
So don’t use the blue V5 one, that thing is shit.

1 Like

Thank you all for advice, after 4 hours of running memtest, it seems to be stable. Reseating was good enough. If the unstability comes back, I will proceed with the more rigorous testing.

When it comes to IPMI, I am not sure how to get to that data. (This is my first second-hand rig).

Thanks.

Two options. Either there is a dedicated Lan port on the board or the board will set up a URL that you can access on your normal Lan. Check the manual on the manufacturer site.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.