Detecting Faulty Hardware

SesameStreetThug · February 16, 2019, 1:24am

At work today I was having problems with VMs when they were on a specific compute host. General speed was very slow and on one VM I saw this error in the console:

NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s
NMI watchdog: BUG: soft lockup - CPU#8 stuck for 41s

Interestingly it was limited to just the VMs on the machine, and the compute host itself seemed to respond fine. All nodes are running the same hardware and kernel, and the problem hasn’t occurred on the others. Additionally the problem inside the VM disappears after migrating it to a new host.
In the meantime I’ve disabled the host and migrated all load off of it. I was curious what recommended tools (or any method) for detecting faulty components. Any tool that runs on Linux is great. Normally I would use IPDT to check, but these systems are AMD Opterons, so the intel tool does not work.

Thoughts?

hem · February 16, 2019, 2:36pm

My initial idea would be to write a small script that did some computation for a set amount of time, and set affinity to the individual cores with some form of logging output while it runs.

There ought to be some tools out there that could do it, but I think the main thing would be to provoke an error to occur.

Wouldd have to think more about it to have a more specific idea.

SesameStreetThug · February 16, 2019, 2:39pm

This is the big thing, Linux already has MCE logging, but I need a way to actually get the machine to log an error. (Unlikely that it will just running at idle, with all of the VMs moved off it.)

hem · February 16, 2019, 3:42pm

It shouldn’t give any errors at all if it’s not being used.

Below I have from suse.com’s knowledge base.
After checking around for a few while waiting for missus in the car, I found an explanation. From what I’ve been able to find, a soft lockup comes when the kernel has been looping for more than 20 seconds while not allowing other tasks to run. And may go away if the load is decreased.

A resolution to prevent this error from firing, would be to increase the sysctl parameter kernel.watchdog_thresh, default value should be 10, so could double it.
— end of what I got from suse.com

In the case of you wanting to have a log entry with this, I’d think it would have to be with some form of kernel hook. If or how this would be possible within reason, I do not know, am only guessing. Maybe someone knows how this could be achieved?

SesameStreetThug · February 16, 2019, 3:49pm

This is a production system, and none of the other nodes have the issue. This is what leads me to believe that it’s a hardware issue. I’d rather just replace the CPUs than put a band-aid on it in software. But I want to verify the issue with some repeatable test.

hem · February 16, 2019, 3:51pm

Is it possible for you to replicate the workload and the situation?

Was there a stack trace or a kernel dump from when it happened? They should shed some light on where the problem might be. If it is hardware, it can be in several places, and to make sure you catch an error like that, you’d have to monitor for many situations.