System crashes completely when VM tries to initialize Intel 82599 NIC SR-IOV VF

I am having a serious problem.

I have an SR-IOV/IOMMU/Virtualization enabled Threadripper desktop PC running 5.0.4 Arch.

I run KVM/QEMU and libvirt for VMs.

I recently got an Intel X520-DA2 82599ES NIC in order to use SR-IOV for my VMs.

I can successfully pass through a GPU.

For testing, I have set 1 VF per PF and I am trying to pass it through to VMs. ixgbevf is blacklisted on the host and the vfio driver is set to handle the VFs.

No matter what VM I try to run (Ubuntu 18.04, Fedora 29, Windows 10) probably as soon as the VF driver starts loading on the VM, the whole host system crashes and becomes unresponsive to any input, including SSH.

Windows 10 did boot with the VF passed through but that was before installing the VF driver. As soon as I installed the VF driver, it showed the same behavior with the Linux VMs.

Is this a problem with the kernel ixgbe driver? Does anyone have any idea?

This sounds like a @wendell grade of question.

1 Like

is it in a CPU PCIe slot or the chipset slot? (must be in a CPU PCIe slot)

3 Likes

I have the X399 Taichi.

It is in the second physical x16 slot so a x8 slot directly to the CPU.

There is one other person that is experiencing the same problem with kernels >= 4.20 on other platforms. We also have a reddit and Intel Forum thread.

https://forums.intel.com/s/question/0D50P00004HoctbSAB/host-system-hangs-and-becomes-unresponsive-when-guest-vm-tries-to-initialize-sriov-driver-with-82599es-nic?language=en_US

Trying the same with a Mellanox ConncectX-3 card in the same slots works as expected.

EDIT:

I tried with Fedora 29 and kernel versions 4.18, 4.19, 4.20 and 5.0.7.

4.18 and 4.19 worked as expected. VM had a Virtual Ethernet interface and full access to the internet.

4.20 and 5.0.7 exhibited the problem. The host completely locks up (does not respond to ssh) when the VM starts booting.

@wendell Maybe some pointers on how to diagnose the lock-up/gather some more info? Running dmesg -wH through ssh from another machine does not show anything useful.