Navi Reset Kernel Patch

now it even did some weirder stuff… This happened when I tried to start a macOS VM after shutting down the Q35 Windows 10 VM normally. It always crashes the card, but this time it just printed something different to the logs and I had to force shutdown the whole pc! Here we go, hope this helps:

[48504.498914] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x7fb573b20]
[48504.665666] AMD-Vi: Completion-Wait loop timed out
[48504.832874] AMD-Vi: Completion-Wait loop timed out
[48505.032100] AMD-Vi: Command buffer timeout
[48505.231066] AMD-Vi: Command buffer timeout
[48505.231096] ------------[ cut here ]------------
[48505.231115] WARNING: CPU: 10 PID: 26679 at drivers/iommu/amd_iommu.c:1272 __domain_flush_pages.cold.44+0xc/0x13
[48505.231116] Modules linked in: hid_generic(E) usbhid(E) hid(E) xt_nat(E) xt_tcpudp(E) veth(E) xt_conntrack(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) nft_counter(E) xt_addrtype(E) nft_compat(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E) llc(E) vhost_net(E) tun(E) vhost(E) macvtap(E) macvlan(E) tap(E) softdog(E) nf_tables(E) nfnetlink(E) overlay(E) cpufreq_conservative(E) cpufreq_userspace(E) cpufreq_powersave(E) btusb(E) edac_mce_amd(E) btrtl(E) btbcm(E) kvm_amd(E) btintel(E) iwlmvm(E) kvm(E) mac80211(E) bluetooth(E) libarc4(E) crct10dif_pclmul(E) nls_ascii(E) drbg(E) crc32_pclmul(E) nls_cp437(E) iwlwifi(E) ansi_cprng(E) ghash_clmulni_intel(E) vfat(E) efi_pstore(E) fat(E) snd_pcm(E) ecdh_generic(E) aesni_intel(E) snd_timer(E) cfg80211(E) sp5100_tco(E) ecc(E) snd(E) aes_x86_64(E) crypto_simd(E) soundcore(E) cryptd(E) glue_helper(E) efivars(E) evdev(E) wmi_bmof(E) pcspkr(E) k10temp(E) watchdog(E)
[48505.231156]  rfkill(E) sg(E) ccp(E) rng_core(E) button(E) acpi_cpufreq(E) sunrpc(E) efivarfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) btrfs(E) zstd_decompress(E) zstd_compress(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) raid0(E) multipath(E) linear(E) vfio_pci(E) irqbypass(E) vfio_virqfd(E) vfio_iommu_type1(E) vfio(E) raid1(E) md_mod(E) sd_mod(E) crc32c_intel(E) ahci(E) xhci_pci(E) libahci(E) r8169(E) i2c_piix4(E) xhci_hcd(E) realtek(E) libata(E) libphy(E) usbcore(E) scsi_mod(E) wmi(E) gpio_amdpt(E) gpio_generic(E)
[48505.231183] CPU: 10 PID: 26679 Comm: qemu-system-x86 Tainted: G        W   E     5.3.7-navi #1
[48505.231184] Hardware name: Gigabyte Technology Co., Ltd. AB350N-Gaming WIFI/AB350N-Gaming WIFI-CF, BIOS F25 01/16/2019
[48505.231187] RIP: 0010:__domain_flush_pages.cold.44+0xc/0x13
[48505.231190] Code: ff b8 fb ff ff ff e9 7e ab ff ff 48 c7 c7 d8 39 26 b2 e8 f8 9b ba ff 0f 0b e9 8f b4 ff ff 48 c7 c7 d8 39 26 b2 e8 e5 9b ba ff <0f> 0b e9 ab c8 ff ff 48 89 1c 24 48 89 d3 48 c7 c7 30 c5 2e b2 e8
[48505.231191] RSP: 0018:ffffb5f840a1fc70 EFLAGS: 00010246
[48505.231193] RAX: 0000000000000024 RBX: ffff9c1c3e79fa10 RCX: 0000000000000000
[48505.231194] RDX: 0000000000000000 RSI: ffff9c1f7ec97688 RDI: ffff9c1f7ec97688
[48505.231195] RBP: ffffb5f840a1fcd8 R08: 0000000000002647 R09: 0000000000000004
[48505.231196] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9c1f7b412000
[48505.231197] R13: 00000000fffffffb R14: ffff9c1c3e79fa10 R15: 7fffffffffffffff
[48505.231199] FS:  00007f30ecf2d3c0(0000) GS:ffff9c1f7ec80000(0000) knlGS:0000000000000000
[48505.231201] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[48505.231202] CR2: 000056520fe7b3c0 CR3: 000000000a1da000 CR4: 00000000003406e0
[48505.231203] Call Trace:
[48505.231208]  amd_iommu_flush_iotlb_all+0x23/0x30
[48505.231212]  vfio_sync_unpin.isra.18+0x2f/0xd0 [vfio_iommu_type1]
[48505.231216]  vfio_unmap_unpin+0x315/0x350 [vfio_iommu_type1]
[48505.231220]  vfio_remove_dma+0x19/0x60 [vfio_iommu_type1]
[48505.231223]  vfio_iommu_type1_ioctl+0x624/0x7bb [vfio_iommu_type1]
[48505.231227]  do_vfs_ioctl+0xa4/0x630
[48505.231231]  ksys_ioctl+0x60/0x90
[48505.231233]  __x64_sys_ioctl+0x16/0x20
[48505.231236]  do_syscall_64+0x55/0x110
[48505.231240]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[48505.231242] RIP: 0033:0x7f30eef6c427
[48505.231244] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
[48505.231245] RSP: 002b:00007fffa9f6b848 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[48505.231247] RAX: ffffffffffffffda RBX: 00007fffa9f6b930 RCX: 00007f30eef6c427
[48505.231248] RDX: 00007fffa9f6b850 RSI: 0000000000003b72 RDI: 0000000000000017
[48505.231249] RBP: 0000000000000000 R08: 0000000080000000 R09: 000000007fffffff
[48505.231250] R10: 0000000000000000 R11: 0000000000000246 R12: 00005581576d2ab0
[48505.231251] R13: 0000000080000000 R14: 0000000080000000 R15: 0000000000000000
[48505.231254] ---[ end trace 3532ad8f65694501 ]---
[48505.434060] AMD-Vi: Command buffer timeout
[48505.434112] ------------[ cut here ]------------
[48505.434136] WARNING: CPU: 0 PID: 0 at drivers/iommu/amd_iommu.c:1272 __domain_flush_pages.cold.44+0xc/0x13
[48505.434141] Modules linked in: hid_generic(E) usbhid(E) hid(E) xt_nat(E) xt_tcpudp(E) veth(E) xt_conntrack(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) nft_counter(E) xt_addrtype(E) nft_compat(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E) llc(E) vhost_net(E) tun(E) vhost(E) macvtap(E) macvlan(E) tap(E) softdog(E) nf_tables(E) nfnetlink(E) overlay(E) cpufreq_conservative(E) cpufreq_userspace(E) cpufreq_powersave(E) btusb(E) edac_mce_amd(E) btrtl(E) btbcm(E) kvm_amd(E) btintel(E) iwlmvm(E) kvm(E) mac80211(E) bluetooth(E) libarc4(E) crct10dif_pclmul(E) nls_ascii(E) drbg(E) crc32_pclmul(E) nls_cp437(E) iwlwifi(E) ansi_cprng(E) ghash_clmulni_intel(E) vfat(E) efi_pstore(E) fat(E) snd_pcm(E) ecdh_generic(E) aesni_intel(E) snd_timer(E) cfg80211(E) sp5100_tco(E) ecc(E) snd(E) aes_x86_64(E) crypto_simd(E) soundcore(E) cryptd(E) glue_helper(E) efivars(E) evdev(E) wmi_bmof(E) pcspkr(E) k10temp(E) watchdog(E)
[48505.434193]  rfkill(E) sg(E) ccp(E) rng_core(E) button(E) acpi_cpufreq(E) sunrpc(E) efivarfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) btrfs(E) zstd_decompress(E) zstd_compress(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) raid0(E) multipath(E) linear(E) vfio_pci(E) irqbypass(E) vfio_virqfd(E) vfio_iommu_type1(E) vfio(E) raid1(E) md_mod(E) sd_mod(E) crc32c_intel(E) ahci(E) xhci_pci(E) libahci(E) r8169(E) i2c_piix4(E) xhci_hcd(E) realtek(E) libata(E) libphy(E) usbcore(E) scsi_mod(E) wmi(E) gpio_amdpt(E) gpio_generic(E)
[48505.434232] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W   E     5.3.7-navi #1
[48505.434233] Hardware name: Gigabyte Technology Co., Ltd. AB350N-Gaming WIFI/AB350N-Gaming WIFI-CF, BIOS F25 01/16/2019
[48505.434237] RIP: 0010:__domain_flush_pages.cold.44+0xc/0x13
[48505.434240] Code: ff b8 fb ff ff ff e9 7e ab ff ff 48 c7 c7 d8 39 26 b2 e8 f8 9b ba ff 0f 0b e9 8f b4 ff ff 48 c7 c7 d8 39 26 b2 e8 e5 9b ba ff <0f> 0b e9 ab c8 ff ff 48 89 1c 24 48 89 d3 48 c7 c7 30 c5 2e b2 e8
[48505.434242] RSP: 0018:ffffb5f840003df0 EFLAGS: 00010246
[48505.434245] RAX: 0000000000000024 RBX: ffff9c1f70525010 RCX: 0000000000000000
[48505.434246] RDX: 0000000000000000 RSI: ffff9c1f7ea17688 RDI: ffff9c1f7ea17688
[48505.434247] RBP: ffffb5f840003e58 R08: 000000000000266d R09: 0000000000000004
[48505.434248] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9c1f70525860
[48505.434250] R13: 00000000fffffffb R14: ffff9c1f70525010 R15: 7fffffffffffffff
[48505.434252] FS:  0000000000000000(0000) GS:ffff9c1f7ea00000(0000) knlGS:0000000000000000
[48505.434253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[48505.434255] CR2: 00007f7a4fffdff8 CR3: 000000043fb6a000 CR4: 00000000003406f0
[48505.434256] Call Trace:
[48505.434259]  <IRQ>
[48505.434266]  ? cpumask_next_and+0x19/0x20
[48505.434270]  ? load_balance+0x144/0xb20
[48505.434273]  ? fq_ring_free+0xd0/0xd0
[48505.434276]  iova_domain_flush_tlb+0x23/0x30
[48505.434280]  iova_domain_flush+0x1a/0x30
[48505.434283]  fq_flush_timeout+0x2d/0x90
[48505.434287]  ? fq_ring_free+0xd0/0xd0
[48505.434293]  call_timer_fn+0x2d/0x130
[48505.434296]  run_timer_softirq+0x19e/0x410
[48505.434298]  ? __hrtimer_run_queues+0x130/0x280
[48505.434301]  ? ktime_get+0x3a/0xa0
[48505.434305]  __do_softirq+0xdf/0x2e5
[48505.434310]  irq_exit+0xa3/0xb0
[48505.434313]  smp_apic_timer_interrupt+0x74/0x130
[48505.434315]  apic_timer_interrupt+0xf/0x20
[48505.434317]  </IRQ>
[48505.434321] RIP: 0010:cpuidle_enter_state+0xbc/0x450
[48505.434324] Code: e8 79 6f ae ff 80 7c 24 13 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 67 03 00 00 31 ff e8 2b 92 b4 ff fb 66 0f 1f 44 00 00 <45> 85 e4 0f 89 d1 01 00 00 c7 45 10 00 00 00 00 48 83 c4 18 44 89
[48505.434325] RSP: 0018:ffffffffb2403e60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[48505.434328] RAX: ffff9c1f7ea2a580 RBX: ffffffffb24bc680 RCX: 000000000000001f
[48505.434329] RDX: 00002c1d7c356ba1 RSI: 0000000028133e14 RDI: 0000000000000000
[48505.434330] RBP: ffff9c1f6fcc4000 R08: 0000000000000002 R09: 0000000000029e00
[48505.434331] R10: 00008cfcfa1870a0 R11: ffff9c1f7ea294e4 R12: 0000000000000002
[48505.434332] R13: ffffffffb24bc758 R14: 0000000000000002 R15: 00000000d6d2f8aa
[48505.434338]  cpuidle_enter+0x29/0x40
[48505.434342]  do_idle+0x228/0x270
[48505.434345]  cpu_startup_entry+0x19/0x20
[48505.434350]  start_kernel+0x558/0x576
[48505.434355]  secondary_startup_64+0xa4/0xb0
[48505.434359] ---[ end trace 3532ad8f65694502 ]---

EDIT: Great now I can’t even boot debian anymore, hope I can fix it in a chroot :confused:

Since I switched to fedora 31, I made a small dirty Dockerfile to compile the Fedora-Kernel with the navi patch applied:

FROM fedora:31

ENV KVER kernel-5.3.7-301.fc31

RUN sudo dnf -y install fedpkg fedora-packager rpmdevtools ncurses-devel pesign
RUN rpmdev-setuptree && cd ~ && koji download-build --arch=src $KVER.src.rpm
RUN useradd -s /sbin/nologin mockbuild && rpm -Uvh ~/$KVER.src.rpm && cd ~/rpmbuild/SOURCES && curl -o navi.patch https://pastebin.com/raw/aKkLytWP
RUN cd ~/rpmbuild/SPECS/ && sudo dnf -y builddep kernel.spec
RUN cd ~/rpmbuild/SPECS/ && sed -i -e 's/# define buildid .local/%define buildid .navi/g' kernel.spec && sed -i "s/# END OF PATCH DEFINITIONS/# END OF PATCH DEFINITIONS\nPatch9001: navi.patch/" kernel.spec
RUN cd ~/rpmbuild/SPECS/ && rpmbuild -bb --without debug --target=x86_64 kernel.spec

ENTRYPOINT cp ~/rpmbuild/RPMS/x86_64/kernel* /output

Works fine with bildah too! Just use podman build --tag sth/fedora-navi-sth -v $PWD/out:/output and let it run in a screen or whatever :wink:

EDIT: And don’t forget to remove the old image :joy: :

REPOSITORY                     TAG      IMAGE ID       CREATED          SIZE
localhost/local/kernel-patch   latest   299b74aaaae3   21 minutes ago   23.5 GB
docker.io/library/fedora       31       f0858ad3febd   6 days ago       201 MB
1 Like

Does this mean the patch works for you on Fedora 31? Because I’m also using Fedora 31 and on every Kernel I tried, up to 5.4-rc6 I get the same Error when starting a VM for the second time.
But there is even more amiss here as I can’t unbind the device from the amdgpu driver as well.

No sorry I didn’t even try it yet. I just compiled the kernel didn’t have time for the rest…

Have you blacklisted the driver?

No and that’s not the point. The card binds to vfio-pci during boot just fine and I can then start a VM once OR bind the card to the amdgpu driver to use it with prime offloading to do the heavy lifting on the bare-metal fedora 31 machine. But neither can I start the VM again nor can I unbind the card from the amdgpu driver once bound. Also I can’t bind the card to the amdgpu driver after I shut down the VM. Well I can do it but bad things start to happen. But I didn’t have the time to take a look at the exact problem, yet.

EDIT:
Here is the dmesg log when trying to bind the rx5700 to amdpu after using it in a VM. Funny enough, the next try to bind the card does succeed. Too bad I can’t unbind from amdgpu. I’d love to see if I can start the VM at that point.

vfio-pci 0000:1f:00.0: Navi10: SOL 0x74cd892c
vfio-pci 0000:1f:00.0: Navi10: performing BACO reset
vfio-pci 0000:1f:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
amdgpu 0000:1f:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xe0000000 -> 0xefffffff
amdgpu 0000:1f:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xf0000000 -> 0xf01fffff
amdgpu 0000:1f:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xfcc00000 -> 0xfcc7ffff
amdgpu 0000:1f:00.0: enabling device (0400 -> 0403)
[drm] initializing kernel modesetting (NAVI10 0x1002:0x731F 0x148C:0x2399 0xC4).
[drm] register mmio base: 0xFCC00000
[drm] register mmio size: 524288
[drm:amdgpu_discovery_init [amdgpu]] *ERROR* invalid ip discovery binary signature
amdgpu 0000:1f:00.0: amdgpu_discovery_init failed
amdgpu 0000:1f:00.0: Fatal error during GPU init
[drm] amdgpu: finishing device.
------------[ cut here ]------------
sysfs group 'fw_version' not found for kobject '0000:1f:00.0'
WARNING: CPU: 5 PID: 11086 at fs/sysfs/group.c:278 sysfs_remove_group+0x74/0x80
Modules linked in: uinput macvtap macvlan vhost_net vhost tap rfcomm xt_CHECKSUM nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cmac bnep nct6775 hwmon_vid sunrpc vfat fat snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio edac_mce_amd snd_hda_codec_hdmi snd_hda_intel btusb snd_intel_nhlt kvm_amd btrtl snd_usb_audio snd_hda_codec btbcm kvm btintel snd_usbmidi_lib snd_hda_core snd_rawmidi snd_hwdep bluetooth snd_seq crct10dif_pclmul snd_seq_device crc32_pclmul snd_pcm snd_timer ecdh_generic rfkill ecc mc wmi_bmof pcspkr ghash_clmulni_intel k10temp snd joydev sp5100_tco
 i2c_piix4 ccp soundcore gpio_amdpt gpio_generic acpi_cpufreq binfmt_misc ip_tables xfs libcrc32c amdgpu amd_iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper drm r8169 crc32c_intel nvme nvme_core wmi pinctrl_amd vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio fuse
CPU: 5 PID: 11086 Comm: rebind-amdgpu.s Not tainted 5.4.0-0.rc6.git0.1.navi_reset.fc31.x86_64 #1
Hardware name: Micro-Star International Co., Ltd. MS-7B79/X470 GAMING PLUS (MS-7B79), BIOS A.70 01/23/2019
RIP: 0010:sysfs_remove_group+0x74/0x80
Code: ff 5b 48 89 ef 5d 41 5c e9 99 bc ff ff 48 89 ef e8 b1 b9 ff ff eb cc 49 8b 14 24 48 8b 33 48 c7 c7 b8 5c 36 ad e8 6a 95 d3 ff <0f> 0b 5b 5d 41 5c c3 0f 1f 44 00 00 0f 1f 44 00 00 48 85 f6 74 31
RSP: 0018:ffffbfa20944fc90 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffffffffc0972d20 RCX: 0000000000000006
RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff98485ed57900
RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000000005da
R10: 0000000000020c8c R11: 0000000000000003 R12: ffff98485a23a0b0
R13: ffff98477c914d68 R14: ffff9842519ec230 R15: 0000000000000000
FS:  00007f962eb6f740(0000) GS:ffff98485ed40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2c6f9be000 CR3: 00000001a58e0000 CR4: 00000000003406e0
Call Trace:
 amdgpu_device_fini+0x441/0x475 [amdgpu]
 amdgpu_driver_unload_kms+0x4a/0x90 [amdgpu]
 amdgpu_driver_load_kms.cold+0x39/0x5b [amdgpu]
 drm_dev_register+0x111/0x150 [drm]
 amdgpu_pci_probe+0xee/0x150 [amdgpu]
 ? __pm_runtime_resume+0x58/0x80
 local_pci_probe+0x42/0x80
 pci_device_probe+0x107/0x1a0
 really_probe+0x147/0x3c0
 driver_probe_device+0xb6/0x100
 device_driver_attach+0x53/0x60
 bind_store+0xd1/0x110
 kernfs_fop_write+0x10e/0x190
 vfs_write+0xb6/0x1a0
 ksys_write+0x5f/0xe0
 do_syscall_64+0x5b/0x180
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f962ec64467
Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffcd718ba48 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007f962ec64467
RDX: 000000000000000d RSI: 0000563674256960 RDI: 0000000000000001
RBP: 0000563674256960 R08: 000000000000000a R09: 000000000000000c
R10: 0000563674256720 R11: 0000000000000246 R12: 000000000000000d
R13: 00007f962ed35500 R14: 000000000000000d R15: 00007f962ed35700
---[ end trace 4c2228038f46333e ]---
amdgpu: probe of 0000:1f:00.0 failed with error -22
snd_hda_intel 0000:1f:00.1: Handle vga_switcheroo audio client
input: HD-Audio Generic HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:1d:00.0/0000:1e:00.0/0000:1f:00.1/sound/card3/input25
input: HD-Audio Generic HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:1d:00.0/0000:1e:00.0/0000:1f:00.1/sound/card3/input26
input: HD-Audio Generic HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:1d:00.0/0000:1e:00.0/0000:1f:00.1/sound/card3/input27
input: HD-Audio Generic HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:1d:00.0/0000:1e:00.0/0000:1f:00.1/sound/card3/input28
input: HD-Audio Generic HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:03.1/0000:1d:00.0/0000:1e:00.0/0000:1f:00.1/sound/card3/input29
input: HD-Audio Generic HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:03.1/0000:1d:00.0/0000:1e:00.0/0000:1f:00.1/sound/card3/input30

Sorry I didn’t get that… But I guess that’s the reset bug and the fix just does not work correctly… Dunno about the unbinding problem tho

It’s such a nice card, but the damn reset bug is driving me nuts :confused:

Just wanted to let you guys know that the Navi Reset Kernel Patch is now part of Manjaro’s 5.3 stable Kernel. So if you use Manjaro just install the the 5.3 Kernel and your ready to go with you Navi card. Big thanks to gnif and Manjaros philm.

Edit: Manjaros 5.4 LTS Kernel also includes the Reset Patch

7 Likes

I realize I’m late to the party but I just wanted to drop by and confirm that this works! Excellent, fantastic work! I’m wiring you a few shekels over paypal as a token of my gratitude.

Just get linux source, apply the patch (along with others patches you may need), compile, install, boot. Easy, it just works.

Again, fantastic work and infinite thanks to you.

3 Likes

Hey what card are you using XT or non-XT?

I am running the XT model.

Upstreaming the patch has been delayed as there have been reports from the Unraid community the patch causes working systems to hang on reboot. Further investigation is required.

Navi introduces a feature to discover what addresses the different IP blocks live at inside the GPU, Vega had this feature also but it wasn’t used (according to my sources).

This information is downloaded from the GPU, since it’s clearly failing to download it, i’d say the BACO reset failed. I will do some digging and see if I can come up with a reason why.

Can you please elaborate on the specifics of the hardware in your system so I have more information to pass onto my AMD contact should the need arise? (CPU, RAM, are you overlocking?, etc)

what is the minimum kernel version required for this patch to work? my Navi VM host is running debianized 4.19 kernel.

1 Like

I use debian and run Kernel 4.19.82 which I fetched from kernel.org. Just copy the config from debian’s /boot partition, apply the patch, compile the kernel, install kernel and modules and update grub. All there is to it.

1 Like

I’ll elaborate my detailed hardware information (lspci, uname -a, …). The BACO reset is failing too.
XEON E5-2696-V3(no overclock at all), 64GB RAM ECC, Motherboard ASROCK X99 Taichi, 2 video boards, Sapphire RX 5700 XT (PCIe x16, blower cooler) and Nvidia GTX960 2GB(primary, PCIe x16), M5015 SAS/SATA controller (PCIe x8)

Linux: PROXMOX 6
Patched Kernel: https://github.com/rogeriomm/proxmox-pve-kernel-amd-reset/tree/amd-reset

FX 5700XT PCIe Passtrought VMs:
Windows 10 running ok
Ubuntu 18 running ok
MAC OS untested

uname -a 
Linux world1l8 5.0.0-32-generic #34~18.04.2-Ubuntu SMP Thu Oct 10 10:36:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
lspci
00:00.0 Host bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2 (rev 02)
00:01.0 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI Express Root Port 1 (rev 02)
00:01.1 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI Express Root Port 1 (rev 02)
00:02.0 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI Express Root Port 2 (rev 02)
00:02.2 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI Express Root Port 2 (rev 02)
00:03.0 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI Express Root Port 3 (rev 02)
00:05.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Address Map, VTd_Misc, System Management (rev 02)
00:05.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Hot Plug (rev 02)
00:05.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 RAS, Control Status and Global Errors (rev 02)
00:05.4 PIC: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 I/O APIC (rev 02)
00:11.0 Unassigned class [ff00]: Intel Corporation C610/X99 series chipset SPSR (rev 05)
00:14.0 USB controller: Intel Corporation C610/X99 series chipset USB xHCI Host Controller (rev 05)
00:16.0 Communication controller: Intel Corporation C610/X99 series chipset MEI Controller #1 (rev 05)
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (2) I218-V (rev 05)
00:1a.0 USB controller: Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #2 (rev 05)
00:1b.0 Audio device: Intel Corporation C610/X99 series chipset HD Audio Controller (rev 05)
00:1c.0 PCI bridge: Intel Corporation C610/X99 series chipset PCI Express Root Port #1 (rev d5)
00:1c.2 PCI bridge: Intel Corporation C610/X99 series chipset PCI Express Root Port #3 (rev d5)
00:1c.3 PCI bridge: Intel Corporation C610/X99 series chipset PCI Express Root Port #4 (rev d5)
00:1c.4 PCI bridge: Intel Corporation C610/X99 series chipset PCI Express Root Port #5 (rev d5)
00:1d.0 USB controller: Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #1 (rev 05)
00:1f.0 ISA bridge: Intel Corporation C610/X99 series chipset LPC Controller (rev 05)
00:1f.3 SMBus: Intel Corporation C610/X99 series chipset SMBus Controller (rev 05)
03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)
04:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 1478 (rev c1)
05:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 1479
06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 (rev c1)
06:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
07:00.0 VGA compatible controller: NVIDIA Corporation GM206 [GeForce GTX 960] (rev a1)
07:00.1 Audio device: NVIDIA Corporation GM206 High Definition Audio Controller (rev a1)
09:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
0a:00.0 Network controller: Intel Corporation Wireless 3160 (rev 83)
0b:00.0 USB controller: ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller
ff:0b.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 R3 QPI Link 0 & 1 Monitoring (rev 02)
ff:0b.1 Performance counters: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 R3 QPI Link 0 & 1 Monitoring (rev 02)
ff:0b.2 Performance counters: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 R3 QPI Link 0 & 1 Monitoring (rev 02)
ff:0c.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0c.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0c.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0c.3 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0c.4 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0c.5 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0c.6 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0c.7 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0d.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0d.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0d.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0d.3 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0d.4 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0d.5 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0d.6 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0d.7 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0e.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0e.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Unicast Registers (rev 02)
ff:0f.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Buffered Ring Agent (rev 02)
ff:0f.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Buffered Ring Agent (rev 02)
ff:0f.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Buffered Ring Agent (rev 02)
ff:0f.3 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Buffered Ring Agent (rev 02)
ff:0f.4 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 System Address Decoder & Broadcast Registers (rev 02)
ff:0f.5 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 System Address Decoder & Broadcast Registers (rev 02)
ff:0f.6 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 System Address Decoder & Broadcast Registers (rev 02)
ff:10.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCIe Ring Interface (rev 02)
ff:10.1 Performance counters: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCIe Ring Interface (rev 02)
ff:10.5 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 02)
ff:10.6 Performance counters: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 02)
ff:10.7 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 02)
ff:12.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Home Agent 0 (rev 02)
ff:12.1 Performance counters: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Home Agent 0 (rev 02)
ff:12.4 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Home Agent 1 (rev 02)
ff:12.5 Performance counters: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Home Agent 1 (rev 02)
ff:13.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Target Address, Thermal & RAS Registers (rev 02)
ff:13.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Target Address, Thermal & RAS Registers (rev 02)
ff:13.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel Target Address Decoder (rev 02)
ff:13.3 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel Target Address Decoder (rev 02)
ff:13.6 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO Channel 0/1 Broadcast (rev 02)
ff:13.7 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO Global Broadcast (rev 02)
ff:14.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel 0 Thermal Control (rev 02)
ff:14.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel 1 Thermal Control (rev 02)
ff:14.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel 0 ERROR Registers (rev 02)
ff:14.3 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel 1 ERROR Registers (rev 02)
ff:14.4 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 0 & 1 (rev 02)
ff:14.5 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 0 & 1 (rev 02)
ff:14.6 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 0 & 1 (rev 02)
ff:14.7 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 0 & 1 (rev 02)
ff:16.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Target Address, Thermal & RAS Registers (rev 02)
ff:16.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Target Address, Thermal & RAS Registers (rev 02)
ff:16.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel Target Address Decoder (rev 02)
ff:16.3 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel Target Address Decoder (rev 02)
ff:16.6 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO Channel 2/3 Broadcast (rev 02)
ff:16.7 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO Global Broadcast (rev 02)
ff:17.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel 0 Thermal Control (rev 02)
ff:17.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel 1 Thermal Control (rev 02)
ff:17.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel 0 ERROR Registers (rev 02)
ff:17.3 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 1 Channel 1 ERROR Registers (rev 02)
ff:17.4 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 2 & 3 (rev 02)
ff:17.5 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 2 & 3 (rev 02)
ff:17.6 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 2 & 3 (rev 02)
ff:17.7 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DDRIO (VMSE) 2 & 3 (rev 02)
ff:1e.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit (rev 02)
ff:1e.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit (rev 02)
ff:1e.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit (rev 02)
ff:1e.3 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit (rev 02)
ff:1e.4 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit (rev 02)
ff:1f.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU (rev 02)
ff:1f.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU (rev 02)
Sapphire RX 5700XT 
rogermm@pve:~$ sudo lspci -vvv -s 06:00
06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 (rev c1) (prog-if 00 [VGA controller])
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 11
	NUMA node: 0
	Region 0: Memory at e0000000 (64-bit, prefetchable) [disabled] [size=256M]
	Region 2: Memory at f0000000 (64-bit, prefetchable) [disabled] [size=2M]
	Region 4: I/O ports at d000 [disabled] [size=256]
	Region 5: Memory at fb100000 (32-bit, non-prefetchable) [disabled] [size=512K]
	Expansion ROM at fb180000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed unknown, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed unknown, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [200 v1] #15
	Capabilities: [240 v1] Power Budgeting <?>
	Capabilities: [270 v1] #19
	Capabilities: [2a0 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [2b0 v1] Address Translation Service (ATS)
		ATSCap:	Invalidate Queue Depth: 00
		ATSCtl:	Enable-, Smallest Translation Unit: 00
	Capabilities: [2c0 v1] Page Request Interface (PRI)
		PRICtl: Enable- Reset-
		PRISta: RF- UPRGI- Stopped+
		Page Request Capacity: 00000100, Page Request Allocation: 00000000
	Capabilities: [2d0 v1] Process Address Space ID (PASID)
		PASIDCap: Exec+ Priv+, Max PASID Width: 10
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [320 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [400 v1] #25
	Capabilities: [410 v1] #26
	Capabilities: [440 v1] #27
	Kernel driver in use: vfio-pci

06:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin B routed to IRQ 10
	NUMA node: 0
	Region 0: Memory at fb1a0000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed unknown, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed unknown, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [2a0 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Kernel driver in use: vfio-pci
	Kernel modules: snd_hda_intel

navi 10 and vega 20 use the same ‘smu_v11_0_baco_enter/exit’ functions from in amd’s latest code (~agd5f’s baco branch) so it seems likely vega 20 reset could be implemented with a very similar patch to this one.

Has anyone tried this yet?

edit: A suspicion about instability: the SMU firmware was updated by AMD before adding the BACO reset code.
It might be necessary to have the newer SMU firmware loaded to use this without problems?
As long as all guests have this it should be fine, but if a reset was necessary and the older firmware had been loaded by a guest could that cause problems?
vfio-pci has no knowledge of what version of vega20_smc.bin has been loaded.

Last I looked the Vega 20 required an almost identical reset sequence to the Vega 10 series. I will be working on this today so while I am there I will have a look if things have changed, however without a Vega 20 I am not able to verify anything I create for this platform.

I can confirm that the instability still exists even with the latest available firmware. I believe there is an issue with some chipsets when the Navi resets as my test Ryzen 2600 system works 100% of the time, however my threadripper platform is unstable with the same GPU.

2 Likes

Thanks for all the work you’ve done on this. It’s such a key problem to work on.
Did anyone donate a Radeon VII @gnif - I’m keen on seeing that solution and I’m willing to donate some cash to see it fixed.

I am unfortunately away again for a few days so I can’t work on this at the moment, but last I left it I managed to reliably replicate the reset failure with the patch. At this point it more looks like an issue with the kernels IOMMU implementation as I can reset the GPU hundreds of times without issue, but as soon as vfio is involved we get IOTLB wait timeout errors at random when we perform the reset.

I believe the new reset code is performing the reset before some pending IOMMU message is properly processed and we need some way to sync before attempting to reset.

Hi @JCB and welcome to the forum :).

No I do not yet have a Radeon VII to work with, unfortunately. I could setup a GoFundMe for this, but in the past trying to get funding for pro-grade cards has failed as few in the community have or even care about these GPUs. If someone wants to pool funds and sponsor one though I would be happy to use it to work on the issue.

4 Likes

Thanks for taking a look. Little late response.

The system in question:

  • CPU Ryzen 2700
  • RAM 2x16G DDR4-3000 DIMM CL16-18-18-38
  • MSI X470 GAMING PLUS mainboard

Memory running at 2933 MHz. I will try and lower the ram speed to see if this makes any difference.