Threadripper / Vega Reset Bug

reutefleut · June 9, 2018, 10:37pm

You confirmed to have got a “virgin” copy of vBIOS with:

diff virgin.rom kernel_copied.rom?

As far as my knowledge goes the kernel exposes a “shadow copy” of the vBIOS that lost it’s virginity! Could be mistaken though by recent developments in this kernel aspect.

Edit: how would the kernel see a vBIOS when started from pure UEFI or the UEFI operating in “former BIOS” compatibility mode?

Edit2: the kernel executes “acpi” code served by the BIOS to enumerate devices, essentially MB functionality. UEFI CSM or “pure” UEFI are different code-paths.

magicthighs · June 10, 2018, 1:20am

The method I described reads the ROM directly from the card, exactly like how it works in windows, actually. How do you think those download sites get their BIOS files?

I don’t have a clue what you’re on about regarding shadow copies or virginity, or how ACPI would alter the contents of the BIOS chip. That’s just not how it works.

diff is also pretty useless when it comes to binary files, like a video BIOS. You’d be better off using a checksum, like md5. And yes, writing a BIOS file this way, then reading it back, gets you a file with the exact same checksum.

Elitist-Neckbeard · June 17, 2018, 6:03pm

CONFIG_IRQ_REMAP requires a CPU that supports x2apic mode, but it seems the Threadripper doesn’t support that, since lscpu | grep x2apic turns up nothing. My VM seems to work fine without that. The only issue is if I turn it off and leave it off.

Does anyone have some good reading material for x2apic? Is it deprecated, or is the expensive Threadripper missing a feature that other modern CPUs have?

When the GPU fan speeds up after extended disuse, I notice that dmesg says something about a header for the device being incorrect. I’ll try to capture that next time it happens.

Elitist-Neckbeard · July 12, 2018, 2:25am

Just before a crash, I managed to dump this from dmesg. It looks like the GPU is not able to be “unmapped” when the VM is shutdown.

[13578.281075] Modules linked in: nvidia_modeset(PO) ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nvidia(PO) nvidia_drm(PO) efivarfs
[13578.281079] CPU: 2 PID: 6790 Comm: qemu-system-x86 Tainted: P        W  O     4.16.9-gentoo #3
[13578.281079] Hardware name: Micro-Star International Co., Ltd. MS-7B09/X399 GAMING PRO CARBON AC (MS-7B09), BIOS 1.70 12/18/2017
[13578.281080] RIP: 0010:__domain_flush_pages+0x1bc/0x1d0
[13578.281081] RSP: 0018:ffffb01ec3c53c30 EFLAGS: 00010282
[13578.281082] RAX: 00000000fffffffb RBX: 00000000fffff000 RCX: ffffffff8924fe18
[13578.281082] RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffffa27f4a7e9814
[13578.281083] RBP: ffffb01ec3c53c98 R08: 0000000000000001 R09: 000000000000712b
[13578.281083] R10: 0000000000000002 R11: 0000000000000000 R12: 00000000fffffffb
[13578.281083] R13: 000000007fffffff R14: ffffa27f05c14a10 R15: ffffa27f05c14a10
[13578.281084] FS:  00007f453377bf80(0000) GS:ffffa27f6f080000(0000) knlGS:0000000000000000
[13578.281085] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13578.281085] CR2: 0000188cf014c000 CR3: 000000080938a000 CR4: 00000000003406e0
[13578.281085] Call Trace:
[13578.281087]  amd_iommu_unmap+0x70/0xa0
[13578.281088]  __iommu_unmap+0xeb/0x180
[13578.281089]  vfio_unmap_unpin+0xfe/0x1d0
[13578.281090]  vfio_remove_dma+0x1d/0x50
[13578.281091]  vfio_iommu_type1_ioctl+0x6a3/0x920
[13578.281092]  ? kvm_vm_ioctl+0x4b9/0x880
[13578.281093]  ? __handle_mm_fault+0x962/0xa20
[13578.281094]  do_vfs_ioctl+0xab/0x600
[13578.281095]  ? handle_mm_fault+0xf4/0x210
[13578.281096]  SyS_ioctl+0x91/0xa0
[13578.281096]  do_syscall_64+0x55/0x100
[13578.281097]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[13578.281098] RIP: 0033:0x7f452ecf1887
[13578.281098] RSP: 002b:00007ffd47f91bb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[13578.281099] RAX: ffffffffffffffda RBX: 00000000c4000000 RCX: 00007f452ecf1887
[13578.281099] RDX: 00007ffd47f91bc0 RSI: 0000000000003b72 RDI: 0000000000000021
[13578.281100] RBP: 00007ffd47f91c90 R08: 0000000000000001 R09: 00000000c7ffffff
[13578.281100] R10: 0000000000000000 R11: 0000000000000246 R12: 000055615c14deb0
[13578.281100] R13: 0000000004000000 R14: 0000000000000000 R15: 000055615c14dea0
[13578.281101] Code: 30 48 c7 44 24 10 00 00 00 00 48 81 e2 00 f0 ff ff 89 44 24 14 c6 44 24 0f 00 89 54 24 18 48 c1 ea 20 89 54 24 1c e9 b8 fe ff ff <0f> 0b eb ab e8 fb b8 bd ff 90 66 2e 0f 1f 84 00 00 00 00 00 53 
[13578.281114] ---[ end trace 40aaac061620138d ]---
[13578.430295] AMD-Vi: Command buffer timeout
[13578.579405] AMD-Vi: Command buffer timeout
[13578.728480] AMD-Vi: Command buffer timeout
[13578.728491] WARNING: CPU: 3 PID: 6790 at drivers/iommu/amd_iommu.c:1255 __domain_flush_pages+0x1bc/0x1d0
[13578.728492] Modules linked in: nvidia_modeset(PO) ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nvidia(PO) nvidia_drm(PO) efivarfs
[13578.728496] CPU: 3 PID: 6790 Comm: qemu-system-x86 Tainted: P        W  O     4.16.9-gentoo #3
[13578.728497] Hardware name: Micro-Star International Co., Ltd. MS-7B09/X399 GAMING PRO CARBON AC (MS-7B09), BIOS 1.70 12/18/2017
[13578.728498] RIP: 0010:__domain_flush_pages+0x1bc/0x1d0
[13578.728499] RSP: 0018:ffffb01ec3c53c30 EFLAGS: 00010282
[13578.728499] RAX: 00000000fffffffb RBX: 00000000fffff000 RCX: ffffffff8924fe18
[13578.728500] RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffffa27f4a7e9814
[13578.728500] RBP: ffffb01ec3c53c98 R08: 0000000000000001 R09: 0000000000007152
[13578.728501] R10: 0000000000000002 R11: 0000000000000000 R12: 00000000fffffffb
[13578.728501] R13: 000000007fffffff R14: ffffa27f05c14a10 R15: ffffa27f05c14a10
[13578.728502] FS:  00007f453377bf80(0000) GS:ffffa27f6f0c0000(0000) knlGS:0000000000000000
[13578.728502] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13578.728503] CR2: 00000e3cc98e7000 CR3: 000000080938a000 CR4: 00000000003406e0
[13578.728503] Call Trace:
[13578.728505]  amd_iommu_unmap+0x70/0xa0
[13578.728506]  __iommu_unmap+0xeb/0x180
[13578.728509]  vfio_unmap_unpin+0xfe/0x1d0
[13578.728510]  vfio_remove_dma+0x1d/0x50
[13578.728511]  vfio_iommu_type1_ioctl+0x6a3/0x920
[13578.728512]  ? kvm_vm_ioctl+0x4b9/0x880
[13578.728513]  ? __handle_mm_fault+0x962/0xa20
[13578.728515]  do_vfs_ioctl+0xab/0x600
[13578.728516]  ? handle_mm_fault+0xf4/0x210
[13578.728516]  SyS_ioctl+0x91/0xa0
[13578.728518]  do_syscall_64+0x55/0x100
[13578.728519]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[13578.728520] RIP: 0033:0x7f452ecf1887
[13578.728520] RSP: 002b:00007ffd47f91bb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[13578.728521] RAX: ffffffffffffffda RBX: 00000000c4000000 RCX: 00007f452ecf1887
[13578.728521] RDX: 00007ffd47f91bc0 RSI: 0000000000003b72 RDI: 0000000000000021
[13578.728522] RBP: 00007ffd47f91c90 R08: 0000000000000001 R09: 00000000c7ffffff
[13578.728522] R10: 0000000000000000 R11: 0000000000000246 R12: 000055615c14deb0
[13578.728522] R13: 0000000004000000 R14: 0000000000000000 R15: 000055615c14dea0
[13578.728523] Code: 30 48 c7 44 24 10 00 00 00 00 48 81 e2 00 f0 ff ff 89 44 24 14 c6 44 24 0f 00 89 54 24 18 48 c1 ea 20 89 54 24 1c e9 b8 fe ff ff <0f> 0b eb ab e8 fb b8 bd ff 90 66 2e 0f 1f 84 00 00 00 00 00 53 
[13578.728536] ---[ end trace 40aaac061620138e ]---
[13578.877746] AMD-Vi: Command buffer timeout
[13579.026934] AMD-Vi: Command buffer timeout
[13579.176122] AMD-Vi: Command buffer timeout
[13579.176131] WARNING: CPU: 3 PID: 6790 at drivers/iommu/amd_iommu.c:1255 __domain_flush_pages+0x1bc/0x1d0
[13579.176131] Modules linked in: nvidia_modeset(PO) ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nvidia(PO) nvidia_drm(PO) efivarfs
[13579.176134] CPU: 3 PID: 6790 Comm: qemu-system-x86 Tainted: P        W  O     4.16.9-gentoo #3
[13579.176135] Hardware name: Micro-Star International Co., Ltd. MS-7B09/X399 GAMING PRO CARBON AC (MS-7B09), BIOS 1.70 12/18/2017
[13579.176135] RIP: 0010:__domain_flush_pages+0x1bc/0x1d0
[13579.176136] RSP: 0018:ffffb01ec3c53c30 EFLAGS: 00010282
[13579.176136] RAX: 00000000fffffffb RBX: 00000000fffff000 RCX: ffffffff8924fe18
[13579.176137] RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffffa27f4a7e9814
[13579.176137] RBP: ffffb01ec3c53c98 R08: 0000000000000001 R09: 0000000000007179
[13579.176137] R10: 0000000000000002 R11: 0000000000000000 R12: 00000000fffffffb
[13579.176138] R13: 000000007fffffff R14: ffffa27f05c14a10 R15: ffffa27f05c14a10
[13579.176138] FS:  00007f453377bf80(0000) GS:ffffa27f6f0c0000(0000) knlGS:0000000000000000
[13579.176139] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13579.176139] CR2: 00000e3cc98e7000 CR3: 000000080938a000 CR4: 00000000003406e0
[13579.176139] Call Trace:
[13579.176141]  amd_iommu_unmap+0x70/0xa0
[13579.176142]  __iommu_unmap+0xeb/0x180
[13579.176143]  vfio_unmap_unpin+0xfe/0x1d0
[13579.176144]  vfio_remove_dma+0x1d/0x50
[13579.176145]  vfio_iommu_type1_ioctl+0x6a3/0x920
[13579.176145]  ? kvm_vm_ioctl+0x4b9/0x880
[13579.176146]  ? __handle_mm_fault+0x962/0xa20
[13579.176147]  do_vfs_ioctl+0xab/0x600
[13579.176148]  ? handle_mm_fault+0xf4/0x210
[13579.176149]  SyS_ioctl+0x91/0xa0
[13579.176149]  do_syscall_64+0x55/0x100
[13579.176150]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[13579.176151] RIP: 0033:0x7f452ecf1887
[13579.176151] RSP: 002b:00007ffd47f91bb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[13579.176152] RAX: ffffffffffffffda RBX: 00000000c4000000 RCX: 00007f452ecf1887
[13579.176152] RDX: 00007ffd47f91bc0 RSI: 0000000000003b72 RDI: 0000000000000021
[13579.176152] RBP: 00007ffd47f91c90 R08: 0000000000000001 R09: 00000000c7ffffff
[13579.176153] R10: 0000000000000000 R11: 0000000000000246 R12: 000055615c14deb0
[13579.176153] R13: 0000000004000000 R14: 0000000000000000 R15: 000055615c14dea0
[13579.176154] Code: 30 48 c7 44 24 10 00 00 00 00 48 81 e2 00 f0 ff ff 89 44 24 14 c6 44 24 0f 00 89 54 24 18 48 c1 ea 20 89 54 24 1c e9 b8 fe ff ff <0f> 0b eb ab e8 fb b8 bd ff 90 66 2e 0f 1f 84 00 00 00 00 00 53 
[13579.176166] ---[ end trace 40aaac061620138f ]---
[13579.325360] AMD-Vi: Command buffer timeout
[13579.474521] AMD-Vi: Command buffer timeout
[13579.623713] AMD-Vi: Command buffer timeout
[13579.623719] WARNING: CPU: 3 PID: 6790 at drivers/iommu/amd_iommu.c:1255 __domain_flush_pages+0x1bc/0x1d0

I wasn’t able to get a duymp of lspci, but the offending device did have this under it: !!! Unknown header type 7f. I’ve read that this means the card wasn’t reset correctly.

FurryJackman · July 12, 2018, 12:13pm

Your first post has a minor error:

The 1900X is Threadripper and runs on X399…

You need to suppress PCI-E Errors on Threadripper as the upstream kernel is still in the early stages of solving Threadripper issues with PCI-E. You need to update your AGESA to solve SOME of the issues with that.

Threadripper & PCIe Bus Errors Linux

Having lots of fun with Threadripper the last few weeks. I haven’t seen much traffic on this, so I’m making a post here in hopes that people searching for it will find this thread on the Level1 forum. I intend to link to other places on the internet where people might be discussing and/or resolving this issue. This seems really similar to a problem that Intel’s X99 had on launch that was fixed with a later software update. PCIe Bus errors occur in certain circumstances (“by default”) on popular distros such as Fedora and Ubuntu. I have tested as far as 4.13-git and the errors still occur. As I mentioned on my VFIO live stream the other day, you can mitigate the issue by disabling message signaled interrupts and/or memory mapped IO. I’ve had some conflicting reports that disabling mmio also disables msi, but I’m not too sure about that because of what I’m seeing in load testing. Here are the errors Threadripper users are seeing: [ 641.339624] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000 [ 641.339641] pcieport 0000:00:01.1: AER: Corrected error received: id=0000 [ 641.339646] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID) [ 641.339649] pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000040/00006000 [ 641.339652] pcieport 0000:00:01.1: [ 6] Bad TLP [ 643.385764] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000 [ 643.385783] pcieport 0000:00:01.1: AER: Corrected error received: id=0000 [ 643.385787] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID) [ 643.385791] pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000040/00006000 [ 643.385794] pcieport 0000:00:01.1: [ 6] Bad TLP [ 648.193351] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000 [ 648.193372] pcieport 0000:00:01.1: AER: Multiple Corrected…

Elitist-Neckbeard · July 13, 2018, 9:20pm

I did have those errors, and following derpology’s advice solved them for now. The alternative to turning pcie_aspm off is dissabling pcie error messages, which doesn’t seem to solve the problem more than covering it up.

I had that issue Wendell mentioned all the time before I set that kernel argument. This amd_iommu_unmap problem I’m having only occurs when I shutdown the virtual machine.

It seems like the Threadripper, x399, and Vega each have a variety of unrelated problems.

elixir77 · December 14, 2018, 4:19pm

Sorry, very dumb question here, but where in the XML is this added and does is need to point directly to the GPU with the domain, bus, and slot? Or does this not matter and the VM will figure it out itself?