Help with VMware core dump analysis (CPU disabled by OS)

Hey, so this is a loooong shot, but I’m looking for any input I can get.

Llast week, VMs running SLES 15 on VMware ESXi started crashing. Like very important VMs running SAP and this is the 4 occurrence in 10 days. The VM just freezes and needs to be force reset. Last message in VMware events is:

The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.

We were able to generate .vmem and .vmss files from the VM in frozen state and in theory this gives us a core dump to analyze.

Nobody has any idea so far what to do with this dump or how to interpret it, VMware support is shrugging and directing us to SUSE support and SUSE …well, it’s SUSE…

I myself have hardly any experience analyzing a dump. From what I could gather so far, you gotta run gdb with a debug kernel matching the VM’s kernel and then feed in the dump.

Getting a SEGFAULT pointing to an unknown address, but nothing more so far:

[xxx@yyyyyy zzz]$ gdb vmlinux-5.14.21-150500.55.68-default.debug vmss.core
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-20.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type “show copying” and “show warranty” for details.
This GDB was configured as “x86_64-redhat-linux-gnu”.
Type “show configuration” for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type “help”.
Type “apropos word” to search for commands related to “word”…
Reading symbols from vmlinux-5.14.21-150500.55.68-default.debug…done.

warning: core file may not match specified executable file.
[New LWP 12345]
Core was generated by `GuestVM’.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0xffffffffb604c55d in ?? ()
(gdb)

Any ideas or tips on how I could analyze further or why there is no backtrace available in this dump?

Even without all their political and financial issues that will likely affect their support team, VMware is just not the correct team to analyze a crash inside a VM, beyond checking the ESXi logs to see if they provide any pointers.

Not sure what that is supposed to mean… but if you are paying for support, contact your support account person, and light a fire under their ass until they escalate the issue inside SuSE so it gets looked at by someone qualified.

Although even that might not help unless they have the symbols to the exact kernel your crashing VMs run.

I’d suggest to investigate how to configure your VMs with a virtual serial port, connecting the kernel console to that port, redirecting that port’s output to a file, and checking the file when the VM crashes.

1 Like