I bought an AMD TR2 system to use it with kvm for virtualizing a couple of machines, one of which is a Windows Server 2016. Once a day the kernel hard locks and not even the Magic SysRq keys are working. The system is as following:
CPU: AMD ThreadRipper 2950x
MB: Gigabyte X399 AORUS PRO (BIOS version: F2g, Update AGESA 126.96.36.199)
NAND: 2x Samsung 970 EVO Plus - 500G
OS: Arch Linux
Kernels: 5.2.3-arch1-1-ARCH, 4.19.61-1 (LTS)
I had before, 2 other no-name NAND disks that were crashing the kernel in the same way every time I was copying from one another. Now, I cannot reproduce the issue with the new Samsung NAND but the system still hard locks once a day. The only way to start it is with a hard-reset.
Reading through various forums I tried the following:
- Disable/enable IOMMU in BIOS (various kernel params: amd_iommu=on, amd_iommu=pt, amd_iommu=soft)
- kernel params for nvme ASP issues: nvme_core.default_ps_max_latency_us=0
- Tried latest kernel and linux-lts
- Compiled kernel with IOMMU debugging options, pci debugging, etc. Enabled panic for all OOPS to be able to catch the defect
- Enabled kernel dump for OOPS
No matter the configuration, the hang is always the same: no magic sysrq, no logs, no dump.
I am a developer but not a kernel developer so I am asking nicely if there is any way that I can catch this hard lock in order to understand what the BUG is or what the hardware issue is.