AMD TR 2950x with qemu hard locks the system

Hi!

I bought an AMD TR2 system to use it with kvm for virtualizing a couple of machines, one of which is a Windows Server 2016. Once a day the kernel hard locks and not even the Magic SysRq keys are working. The system is as following:

CPU: AMD ThreadRipper 2950x
MB: Gigabyte X399 AORUS PRO (BIOS version: F2g, Update AGESA 1.1.0.2)
NAND: 2x Samsung 970 EVO Plus - 500G
OS: Arch Linux
Kernels: 5.2.3-arch1-1-ARCH, 4.19.61-1 (LTS)

I had before, 2 other no-name NAND disks that were crashing the kernel in the same way every time I was copying from one another. Now, I cannot reproduce the issue with the new Samsung NAND but the system still hard locks once a day. The only way to start it is with a hard-reset.

Reading through various forums I tried the following:

  1. Disable/enable IOMMU in BIOS (various kernel params: amd_iommu=on, amd_iommu=pt, amd_iommu=soft)
  2. kernel params for nvme ASP issues: nvme_core.default_ps_max_latency_us=0
  3. Tried latest kernel and linux-lts
  4. Compiled kernel with IOMMU debugging options, pci debugging, etc. Enabled panic for all OOPS to be able to catch the defect
  5. Enabled kernel dump for OOPS

No matter the configuration, the hang is always the same: no magic sysrq, no logs, no dump.
I am a developer but not a kernel developer so I am asking nicely if there is any way that I can catch this hard lock in order to understand what the BUG is or what the hardware issue is.

Thank you!

Did you run memtest86? Last time file I/O kernel panicked my system it was due to faulty memory modules. Replaced modules, everything fine.

Also try re-seating your CPU. Sometimes not all pins make proper contact and that tends to cause “weird” issues.

Note: there are some issues with KVM virtualization and the early 5.2.x kernels - I had to downgrade to 5.1.17 as otherwise all my VMs would become unstable; Windows 10 VM would BSOD on boot, Linux VM would start experiencing segfaults. Some initial reports seem to indicate that 5.2.5 may have at least some of the issues patched, however I’m still testing that for myself.