Threadripper / Vega Reset Bug

Regarding kernel compilation, can’t imagine (as Debian user) that “apps” -as a distribution is- would have dependencies on the kernel. However you would be well advised to compile your own kernel.

Wonders happen, sometimes, dont’t ask me why! Had two Gigabyte GA-AX370-Gaming K7 MB’s dying probably on VRM’s. Windows wouldn’t activate. With the exact same VM config on the new generation X470 AORUS GAMING 7 WIFI MB windows activated!

Yeah. It all just seems like magic at some point. I feel lucky I got it working this well so far.

That lucky feeling, wonders happening, security by obscurity is the real windows experience! If you have the source code the magic disappears :slight_smile:

Are you saying Windows is like a box of chocolates?

1 Like

Dunno. Ever had real Belgium pralines?

The chocolate encapsulates the deeper experience . It still amazes me to see all the devices Linux runs on.

Using this BIOS didn’t seem to make any difference at all.

I know it’s working, because of the following qemu command that libvirt executes.

LC_ALL=C PATH=/bin:/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/usr/x86_64-pc-linux-gnu/gcc-bin/6.4.0:/usr/lib/llvm/5/bin:/opt/bin HOME=/root USER=root QEMU_AUDIO_DRV=spice /usr/bin/qemu-system-x86_64 -name guest=win8.1-gamer_1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-win8.1-gamer_1/master-key.aes -machine pc-i440fx-2.11,accel=kvm,usb=off,vmport=off,dump-guest-core=off -cpu EPYC,tsc-deadline=on,hypervisor=on,tsc_adjust=on,cmp_legacy=on,topoext=on,monitor=off,x2apic=off,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff -drive file=/usr/share/edk2-ovmf/OVMF_CODE.fd,if=pflash,format=raw,unit=0,readonly=on -drive file=/var/lib/libvirt/qemu/nvram/win8.1-gamer_1_VARS.fd,if=pflash,format=raw,unit=1 -m 8000 -realtime mlock=off -smp 4,sockets=1,cores=4,threads=1 -uuid c6b8fef9-4df6-4d7f-afd8-c674c2a8d786 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-win8.1-gamer_1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot menu=off,strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive file=/var/lib/libvirt/images/win8.1-gamer_1.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=20,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:0d:17:d3,bus=pci.0,addr=0x2 -device usb-tablet,id=input2,bus=usb.0,port=1 -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=2,bus=pci.0,addr=0xa -device intel-hda,id=sound0,bus=pci.0,addr=0xb -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device vfio-pci,host=0c:00.0,id=hostdev0,bus=pci.0,addr=0x4,rombar=1,romfile=/var/lib/libvirt/images/MSI.RXVega56.8176.171101.rom -device vfio-pci,host=0c:00.1,id=hostdev1,bus=pci.0,addr=0x7 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on

I’m only passing the BIOS to 0c:00.0. The HDMI audio is 0c:00.1, which is on the same device. Should I pass it to both?

The audio device doesn’t need the vBIOS file.

You should check that Interrupt remapping, vapic and AMD IOMMUv2 driver is enabled in the kernel .

Do a “dmesg | fgrep -i -e amd” :

0.255868] AMD-Vi: IOMMU performance counters supported

[ 0.258601] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40
[ 0.258603] AMD-Vi: Extended features (0xf77ef22294ada):
[ 0.258607] AMD-Vi: Interrupt remapping enabled
[ 0.258608] AMD-Vi: virtual APIC enabled
[ 0.258699] AMD-Vi: Lazy IO/TLB flushing enabled
[ 0.259597] amd_uncore: AMD NB counters detected
[ 0.259603] amd_uncore: AMD LLC counters detected
[ 0.259793] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[ 0.448873] AMD IOMMUv2 driver by Joerg Roedel [email protected]
[ 0.828228] QUIRK: Enable AMD PLL fix

That’s not in my dmesg. I assume it’s from enabling CONFIG_IRQ_REMAP in the kernel. When I was first compiling my kernel for KVM, I found that my computer wouldn’t boot. Through process of elimination, I found that it was CONFIG_IRQ_REMAP that was causing problems. I’m surprised Qemu works at all without it.

Could you post your kernel config? Here’s mine.
config.txt (113.7 KB)

For future reference, you can easily extract the video bios of your card in linux like this:

Check the pci ID of the card:

[me@tiny ~]$ lspci|grep -i vga
45:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] (rev c1)

Then find that card’s video bios file like this:

[me@tiny ~] find /sys/devices -name rom
/sys/devices/pci0000:00/0000:00:01.1/0000:01:00.1/rom
/sys/devices/pci0000:40/0000:40:03.1/0000:43:00.0/0000:44:00.0/0000:45:00.0/rom

The one you need has the pci ID we just found (45:00.0) in the pathname, it’s /sys/devices/pci0000:40/0000:40:03.1/0000:43:00.0/0000:44:00.0/0000:45:00.0/rom in this case.

Now do this:

[me@tiny ~] echo 1|sudo tee /sys/devices/pci0000:40/0000:40:03.1/0000:43:00.0/0000:44:00.0/0000:45:00.0/rom
[me@tiny ~] sudo cat /sys/devices/pci0000:40/0000:40:03.1/0000:43:00.0/0000:44:00.0/0000:45:00.0/rom > ~/vbios

You can use dd instead of cat, doesn’t make a difference.
And you’re pretty much done. The first command makes it so you can actually read the rom file, the second one writes it to a file called vbios in your home directory.

After this, you can do

[me@tiny ~] echo 0|sudo tee /sys/devices/pci0000:40/0000:40:03.1/0000:43:00.0/0000:44:00.0/0000:45:00.0/rom

to restore the read and write protection on the rom file.

It doesn’t make a difference if I copy the ROM using Windows or Linux? I was under the impression that it gets modified by the OS somehow.

No, it doesn’t matter at all if you do it on windows or linux, the ROM contains the BIOS that’s on your graphics card, the OS shouldn’t alter it in any way.

I just thought I’d mention a way to do it on linux, that way you don’t have to have a windows installation running to extract the bios, or trying to find it on a random website, just so you can run windows in a VM on linux with passthrough.

Edit: changed wording to make it less confusing

If you insist I will share my kernel configuration, nothing secret in there :slight_smile: It’s my opinion though that it will be confusing as I use a Ryzen 1700 CPU with the latest motherboard from Gigabyte and a RX580 GPU.

Then, there is, the Vega 56 GPU and the Threadripper 1900X (CPU) -rather expensive parts- that you employ in your system! Part of your problem just can be the combination of MB/CPU/GPU/driver interaction. Market penetration is abysmal for these parts -especially the GPU- due to coin-mining. Small user base -> less testing and bug reports -> early adopter problems.

A closer look at your OP shows:

Unknown Interrupt pin routed! I assume your BIOS is up to date? Can’t emphasize enough -with a “new” platform- to keep the UEFI BIOS up to date and just compile the most recent kernel. The kernel developers doing their best.

Specific kernel settings can f*ck up the data-structures in the UEFI NVRAM.

CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE:

Saying Y here will disable the use of efivars as a storage
backend for pstore by default. This setting can be overridden
using the efivars module's pstore_disable parameter._

The pstore (kernel permanent storage) happily can exhausts all the available NVRAM to the point that the UEFI BIOS can’t update it’s data. With my previous Gigabyte MB (dual bios) the only remedy was a fresh install of the BIOS.

“Persistence of Misery” is a fundamental property of the Universe, a fundamental Law you neglect at your own peril :slight_smile:

You confirmed to have got a “virgin” copy of vBIOS with:

diff virgin.rom kernel_copied.rom?

As far as my knowledge goes the kernel exposes a “shadow copy” of the vBIOS that lost it’s virginity! Could be mistaken though by recent developments in this kernel aspect.

Edit: how would the kernel see a vBIOS when started from pure UEFI or the UEFI operating in “former BIOS” compatibility mode?

Edit2: the kernel executes “acpi” code served by the BIOS to enumerate devices, essentially MB functionality. UEFI CSM or “pure” UEFI are different code-paths.

The method I described reads the ROM directly from the card, exactly like how it works in windows, actually. How do you think those download sites get their BIOS files?

I don’t have a clue what you’re on about regarding shadow copies or virginity, or how ACPI would alter the contents of the BIOS chip. That’s just not how it works.

diff is also pretty useless when it comes to binary files, like a video BIOS. You’d be better off using a checksum, like md5. And yes, writing a BIOS file this way, then reading it back, gets you a file with the exact same checksum.

CONFIG_IRQ_REMAP requires a CPU that supports x2apic mode, but it seems the Threadripper doesn’t support that, since lscpu | grep x2apic turns up nothing. My VM seems to work fine without that. The only issue is if I turn it off and leave it off.

Does anyone have some good reading material for x2apic? Is it deprecated, or is the expensive Threadripper missing a feature that other modern CPUs have?


When the GPU fan speeds up after extended disuse, I notice that dmesg says something about a header for the device being incorrect. I’ll try to capture that next time it happens.

Just before a crash, I managed to dump this from dmesg. It looks like the GPU is not able to be “unmapped” when the VM is shutdown.

[13578.281075] Modules linked in: nvidia_modeset(PO) ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nvidia(PO) nvidia_drm(PO) efivarfs
[13578.281079] CPU: 2 PID: 6790 Comm: qemu-system-x86 Tainted: P        W  O     4.16.9-gentoo #3
[13578.281079] Hardware name: Micro-Star International Co., Ltd. MS-7B09/X399 GAMING PRO CARBON AC (MS-7B09), BIOS 1.70 12/18/2017
[13578.281080] RIP: 0010:__domain_flush_pages+0x1bc/0x1d0
[13578.281081] RSP: 0018:ffffb01ec3c53c30 EFLAGS: 00010282
[13578.281082] RAX: 00000000fffffffb RBX: 00000000fffff000 RCX: ffffffff8924fe18
[13578.281082] RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffffa27f4a7e9814
[13578.281083] RBP: ffffb01ec3c53c98 R08: 0000000000000001 R09: 000000000000712b
[13578.281083] R10: 0000000000000002 R11: 0000000000000000 R12: 00000000fffffffb
[13578.281083] R13: 000000007fffffff R14: ffffa27f05c14a10 R15: ffffa27f05c14a10
[13578.281084] FS:  00007f453377bf80(0000) GS:ffffa27f6f080000(0000) knlGS:0000000000000000
[13578.281085] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13578.281085] CR2: 0000188cf014c000 CR3: 000000080938a000 CR4: 00000000003406e0
[13578.281085] Call Trace:
[13578.281087]  amd_iommu_unmap+0x70/0xa0
[13578.281088]  __iommu_unmap+0xeb/0x180
[13578.281089]  vfio_unmap_unpin+0xfe/0x1d0
[13578.281090]  vfio_remove_dma+0x1d/0x50
[13578.281091]  vfio_iommu_type1_ioctl+0x6a3/0x920
[13578.281092]  ? kvm_vm_ioctl+0x4b9/0x880
[13578.281093]  ? __handle_mm_fault+0x962/0xa20
[13578.281094]  do_vfs_ioctl+0xab/0x600
[13578.281095]  ? handle_mm_fault+0xf4/0x210
[13578.281096]  SyS_ioctl+0x91/0xa0
[13578.281096]  do_syscall_64+0x55/0x100
[13578.281097]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[13578.281098] RIP: 0033:0x7f452ecf1887
[13578.281098] RSP: 002b:00007ffd47f91bb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[13578.281099] RAX: ffffffffffffffda RBX: 00000000c4000000 RCX: 00007f452ecf1887
[13578.281099] RDX: 00007ffd47f91bc0 RSI: 0000000000003b72 RDI: 0000000000000021
[13578.281100] RBP: 00007ffd47f91c90 R08: 0000000000000001 R09: 00000000c7ffffff
[13578.281100] R10: 0000000000000000 R11: 0000000000000246 R12: 000055615c14deb0
[13578.281100] R13: 0000000004000000 R14: 0000000000000000 R15: 000055615c14dea0
[13578.281101] Code: 30 48 c7 44 24 10 00 00 00 00 48 81 e2 00 f0 ff ff 89 44 24 14 c6 44 24 0f 00 89 54 24 18 48 c1 ea 20 89 54 24 1c e9 b8 fe ff ff <0f> 0b eb ab e8 fb b8 bd ff 90 66 2e 0f 1f 84 00 00 00 00 00 53 
[13578.281114] ---[ end trace 40aaac061620138d ]---
[13578.430295] AMD-Vi: Command buffer timeout
[13578.579405] AMD-Vi: Command buffer timeout
[13578.728480] AMD-Vi: Command buffer timeout
[13578.728491] WARNING: CPU: 3 PID: 6790 at drivers/iommu/amd_iommu.c:1255 __domain_flush_pages+0x1bc/0x1d0
[13578.728492] Modules linked in: nvidia_modeset(PO) ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nvidia(PO) nvidia_drm(PO) efivarfs
[13578.728496] CPU: 3 PID: 6790 Comm: qemu-system-x86 Tainted: P        W  O     4.16.9-gentoo #3
[13578.728497] Hardware name: Micro-Star International Co., Ltd. MS-7B09/X399 GAMING PRO CARBON AC (MS-7B09), BIOS 1.70 12/18/2017
[13578.728498] RIP: 0010:__domain_flush_pages+0x1bc/0x1d0
[13578.728499] RSP: 0018:ffffb01ec3c53c30 EFLAGS: 00010282
[13578.728499] RAX: 00000000fffffffb RBX: 00000000fffff000 RCX: ffffffff8924fe18
[13578.728500] RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffffa27f4a7e9814
[13578.728500] RBP: ffffb01ec3c53c98 R08: 0000000000000001 R09: 0000000000007152
[13578.728501] R10: 0000000000000002 R11: 0000000000000000 R12: 00000000fffffffb
[13578.728501] R13: 000000007fffffff R14: ffffa27f05c14a10 R15: ffffa27f05c14a10
[13578.728502] FS:  00007f453377bf80(0000) GS:ffffa27f6f0c0000(0000) knlGS:0000000000000000
[13578.728502] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13578.728503] CR2: 00000e3cc98e7000 CR3: 000000080938a000 CR4: 00000000003406e0
[13578.728503] Call Trace:
[13578.728505]  amd_iommu_unmap+0x70/0xa0
[13578.728506]  __iommu_unmap+0xeb/0x180
[13578.728509]  vfio_unmap_unpin+0xfe/0x1d0
[13578.728510]  vfio_remove_dma+0x1d/0x50
[13578.728511]  vfio_iommu_type1_ioctl+0x6a3/0x920
[13578.728512]  ? kvm_vm_ioctl+0x4b9/0x880
[13578.728513]  ? __handle_mm_fault+0x962/0xa20
[13578.728515]  do_vfs_ioctl+0xab/0x600
[13578.728516]  ? handle_mm_fault+0xf4/0x210
[13578.728516]  SyS_ioctl+0x91/0xa0
[13578.728518]  do_syscall_64+0x55/0x100
[13578.728519]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[13578.728520] RIP: 0033:0x7f452ecf1887
[13578.728520] RSP: 002b:00007ffd47f91bb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[13578.728521] RAX: ffffffffffffffda RBX: 00000000c4000000 RCX: 00007f452ecf1887
[13578.728521] RDX: 00007ffd47f91bc0 RSI: 0000000000003b72 RDI: 0000000000000021
[13578.728522] RBP: 00007ffd47f91c90 R08: 0000000000000001 R09: 00000000c7ffffff
[13578.728522] R10: 0000000000000000 R11: 0000000000000246 R12: 000055615c14deb0
[13578.728522] R13: 0000000004000000 R14: 0000000000000000 R15: 000055615c14dea0
[13578.728523] Code: 30 48 c7 44 24 10 00 00 00 00 48 81 e2 00 f0 ff ff 89 44 24 14 c6 44 24 0f 00 89 54 24 18 48 c1 ea 20 89 54 24 1c e9 b8 fe ff ff <0f> 0b eb ab e8 fb b8 bd ff 90 66 2e 0f 1f 84 00 00 00 00 00 53 
[13578.728536] ---[ end trace 40aaac061620138e ]---
[13578.877746] AMD-Vi: Command buffer timeout
[13579.026934] AMD-Vi: Command buffer timeout
[13579.176122] AMD-Vi: Command buffer timeout
[13579.176131] WARNING: CPU: 3 PID: 6790 at drivers/iommu/amd_iommu.c:1255 __domain_flush_pages+0x1bc/0x1d0
[13579.176131] Modules linked in: nvidia_modeset(PO) ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nvidia(PO) nvidia_drm(PO) efivarfs
[13579.176134] CPU: 3 PID: 6790 Comm: qemu-system-x86 Tainted: P        W  O     4.16.9-gentoo #3
[13579.176135] Hardware name: Micro-Star International Co., Ltd. MS-7B09/X399 GAMING PRO CARBON AC (MS-7B09), BIOS 1.70 12/18/2017
[13579.176135] RIP: 0010:__domain_flush_pages+0x1bc/0x1d0
[13579.176136] RSP: 0018:ffffb01ec3c53c30 EFLAGS: 00010282
[13579.176136] RAX: 00000000fffffffb RBX: 00000000fffff000 RCX: ffffffff8924fe18
[13579.176137] RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffffa27f4a7e9814
[13579.176137] RBP: ffffb01ec3c53c98 R08: 0000000000000001 R09: 0000000000007179
[13579.176137] R10: 0000000000000002 R11: 0000000000000000 R12: 00000000fffffffb
[13579.176138] R13: 000000007fffffff R14: ffffa27f05c14a10 R15: ffffa27f05c14a10
[13579.176138] FS:  00007f453377bf80(0000) GS:ffffa27f6f0c0000(0000) knlGS:0000000000000000
[13579.176139] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13579.176139] CR2: 00000e3cc98e7000 CR3: 000000080938a000 CR4: 00000000003406e0
[13579.176139] Call Trace:
[13579.176141]  amd_iommu_unmap+0x70/0xa0
[13579.176142]  __iommu_unmap+0xeb/0x180
[13579.176143]  vfio_unmap_unpin+0xfe/0x1d0
[13579.176144]  vfio_remove_dma+0x1d/0x50
[13579.176145]  vfio_iommu_type1_ioctl+0x6a3/0x920
[13579.176145]  ? kvm_vm_ioctl+0x4b9/0x880
[13579.176146]  ? __handle_mm_fault+0x962/0xa20
[13579.176147]  do_vfs_ioctl+0xab/0x600
[13579.176148]  ? handle_mm_fault+0xf4/0x210
[13579.176149]  SyS_ioctl+0x91/0xa0
[13579.176149]  do_syscall_64+0x55/0x100
[13579.176150]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[13579.176151] RIP: 0033:0x7f452ecf1887
[13579.176151] RSP: 002b:00007ffd47f91bb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[13579.176152] RAX: ffffffffffffffda RBX: 00000000c4000000 RCX: 00007f452ecf1887
[13579.176152] RDX: 00007ffd47f91bc0 RSI: 0000000000003b72 RDI: 0000000000000021
[13579.176152] RBP: 00007ffd47f91c90 R08: 0000000000000001 R09: 00000000c7ffffff
[13579.176153] R10: 0000000000000000 R11: 0000000000000246 R12: 000055615c14deb0
[13579.176153] R13: 0000000004000000 R14: 0000000000000000 R15: 000055615c14dea0
[13579.176154] Code: 30 48 c7 44 24 10 00 00 00 00 48 81 e2 00 f0 ff ff 89 44 24 14 c6 44 24 0f 00 89 54 24 18 48 c1 ea 20 89 54 24 1c e9 b8 fe ff ff <0f> 0b eb ab e8 fb b8 bd ff 90 66 2e 0f 1f 84 00 00 00 00 00 53 
[13579.176166] ---[ end trace 40aaac061620138f ]---
[13579.325360] AMD-Vi: Command buffer timeout
[13579.474521] AMD-Vi: Command buffer timeout
[13579.623713] AMD-Vi: Command buffer timeout
[13579.623719] WARNING: CPU: 3 PID: 6790 at drivers/iommu/amd_iommu.c:1255 __domain_flush_pages+0x1bc/0x1d0

I wasn’t able to get a duymp of lspci, but the offending device did have this under it: !!! Unknown header type 7f. I’ve read that this means the card wasn’t reset correctly.

Your first post has a minor error:

The 1900X is Threadripper and runs on X399…

You need to suppress PCI-E Errors on Threadripper as the upstream kernel is still in the early stages of solving Threadripper issues with PCI-E. You need to update your AGESA to solve SOME of the issues with that.

I did have those errors, and following derpology’s advice solved them for now. The alternative to turning pcie_aspm off is dissabling pcie error messages, which doesn’t seem to solve the problem more than covering it up.

I had that issue Wendell mentioned all the time before I set that kernel argument. This amd_iommu_unmap problem I’m having only occurs when I shutdown the virtual machine.

It seems like the Threadripper, x399, and Vega each have a variety of unrelated problems.

Sorry, very dumb question here, but where in the XML is this added and does is need to point directly to the GPU with the domain, bus, and slot? Or does this not matter and the VM will figure it out itself?