Radeon VII not initialising (PSP fails)

Hi,
I’m trying to get a small Radeon VII compute farm running, upgrading from a single card AM4 desktop (which worked flawlessly) to a many card mining motherboard using a Kaby Lake processor (which has been “fun” to get partially working to say the least). Have gone round the houses a few times progressing the problem from no cards being detected, to firmware issues with Vega20, to potential upstream/downstream/pro driver issues, now to the point where PSP fails to load. Thought it may have been something to do with cryptsetup not being properly setup as I had been upgrading the kernel with the debs on https://kernel.ubuntu.com/~kernel-ppa/mainline/ so avoiding those was the last ditch effort before this post. Have tried kernels 4.15, 4.18, 4.20, 5.0 and 5.1.6 at various points during testing on either 18.04 or 19.04 but some of those were while diagnosing the prior issues.

The current state of play is Ubuntu 18.04.2 server with default updated 4.15 kernel. CSM is enabled. Latest ROCm 2.4 not using the upstream drivers. A Vega 56 works fine in the same slot to rule out the slot, tried two Radeon VII’s which rules out a hardware problem with the card. Here’s the output of some things.

francis@francis:~$ /opt/rocm/bin/rocminfo
hsa api call failure at line 900, file: /data/jenkins_workspace/compute-rocm-rel-2.4/rocminfo/rocminfo.cc. Call returned 4104

francis@francis:~$ /opt/rocm/opencl/bin/x86_64/clinfo
ERROR: clGetPlatformIDs(-1001)

francis@francis:~$ /opt/rocm/bin/rocm-smi
========================ROCm System Management Interface========================

GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap SCLK OD MCLK OD GPU%

==============================End of ROCm SMI Log ==============================

francis@francis:~$ lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 06)

01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 07)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 14a0 (rev c1)
03:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 14a1
04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev c1)
04:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device ab20

This seems the most interesting of dmesg:

[ 2.966903] [drm:psp_hw_start [amdgpu]] ERROR PSP load sysdrv failed!
[ 2.966947] [drm:psp_hw_init [amdgpu]] ERROR PSP firmware loading failed
[ 2.966982] [drm:amdgpu_device_fw_loading [amdgpu]] ERROR hw_init of IP block failed -22
[ 2.966986] amdgpu 0000:04:00.0: amdgpu_device_ip_init failed
[ 2.966989] amdgpu 0000:04:00.0: Fatal error during GPU init
[ 2.967027] [drm] amdgpu: finishing device.

But here’s more in case I’m wrong:

francis@francis:~$ dmesg | grep “amdgpu”
[ 1.641479] [drm] amdgpu kernel modesetting enabled.
[ 1.641483] [drm] amdgpu version: 5.0.19.20.14
[ 1.643246] amdgpu 0000:04:00.0: enabling device (0000 -> 0003)
[ 2.183747] amdgpu 0000:04:00.0: BAR 2: releasing [mem 0x2ff0000000-0x2ff01fffff 64bit pref]
[ 2.183750] amdgpu 0000:04:00.0: BAR 0: releasing [mem 0x2fe0000000-0x2fefffffff 64bit pref]
[ 2.183843] amdgpu 0000:04:00.0: BAR 0: assigned [mem 0x2000000000-0x23ffffffff 64bit pref]
[ 2.183856] amdgpu 0000:04:00.0: BAR 2: assigned [mem 0x2400000000-0x24001fffff 64bit pref]
[ 2.184158] amdgpu 0000:04:00.0: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
[ 2.184162] amdgpu 0000:04:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 2.184164] amdgpu 0000:04:00.0: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[ 2.184254] [drm] amdgpu: 16368M of VRAM memory ready
[ 2.184257] [drm] amdgpu: 16368M of GTT memory ready.
[ 2.966903] [drm:psp_hw_start [amdgpu]] ERROR PSP load sysdrv failed!
[ 2.966947] [drm:psp_hw_init [amdgpu]] ERROR PSP firmware loading failed
[ 2.966982] [drm:amdgpu_device_fw_loading [amdgpu]] ERROR hw_init of IP block failed -22
[ 2.966986] amdgpu 0000:04:00.0: amdgpu_device_ip_init failed
[ 2.966989] amdgpu 0000:04:00.0: Fatal error during GPU init
[ 2.967027] [drm] amdgpu: finishing device.
[ 3.333613] WARNING: CPU: 1 PID: 153 at /var/lib/dkms/amdgpu/2.4-25/build/amd/amdgpu/amdgpu_object.c:960 amdgpu_bo_unpin+0xe8/0x110 [amdgpu]
[ 3.333618] Modules linked in: amdgpu(OE+) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc amdttm(OE) amd_sched(OE) i915(+) hid_generic aesni_intel amdkcl(OE) i2c_algo_bit amd_iommu_v2 aes_x86_64 crypto_simd drm_kms_helper glue_helper cryptd syscopyarea sysfillrect sysimgblt psmouse fb_sys_fops drm r8169 mii ahci libahci wmi video usbhid hid
[ 3.333676] RIP: 0010:amdgpu_bo_unpin+0xe8/0x110 [amdgpu]
[ 3.333732] ? amdgpu_vm_del_from_lru_notify+0x12/0x70 [amdgpu]
[ 3.333764] amdgpu_bo_free_kernel+0x90/0x150 [amdgpu]
[ 3.333806] gfx_v9_0_sw_fini+0x117/0x290 [amdgpu]
[ 3.333841] amdgpu_device_fini+0x390/0x4f0 [amdgpu]
[ 3.333871] amdgpu_driver_unload_kms+0xb4/0x150 [amdgpu]
[ 3.333898] amdgpu_driver_load_kms+0x158/0x350 [amdgpu]
[ 3.333939] amdgpu_pci_probe+0x136/0x1f0 [amdgpu]
[ 3.334013] amdgpu_init+0xab/0xba [amdgpu]
[ 3.334094] amdgpu 0000:04:00.0: (ptrval) unpin not necessary
[ 3.334251] Modules linked in: amdgpu(OE+) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc amdttm(OE) amd_sched(OE) i915(+) hid_generic aesni_intel amdkcl(OE) i2c_algo_bit amd_iommu_v2 aes_x86_64 crypto_simd drm_kms_helper glue_helper cryptd syscopyarea sysfillrect sysimgblt psmouse fb_sys_fops drm r8169 mii ahci libahci wmi video usbhid hid
[ 3.334335] amdgpu_vram_mgr_fini+0x31/0xb0 [amdgpu]
[ 3.334372] amdgpu_ttm_fini+0x8d/0x180 [amdgpu]
[ 3.334403] amdgpu_bo_fini+0x12/0x40 [amdgpu]
[ 3.334441] gmc_v9_0_sw_fini+0x6a/0x120 [amdgpu]
[ 3.334475] ? amdgpu_bo_free_kernel+0xbc/0x150 [amdgpu]
[ 3.334505] amdgpu_device_fini+0x390/0x4f0 [amdgpu]
[ 3.334534] amdgpu_driver_unload_kms+0xb4/0x150 [amdgpu]
[ 3.334562] amdgpu_driver_load_kms+0x158/0x350 [amdgpu]
[ 3.334598] amdgpu_pci_probe+0x136/0x1f0 [amdgpu]
[ 3.334671] amdgpu_init+0xab/0xba [amdgpu]
[ 3.334769] Modules linked in: amdgpu(OE+) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc amdttm(OE) amd_sched(OE) i915(+) hid_generic aesni_intel amdkcl(OE) i2c_algo_bit amd_iommu_v2 aes_x86_64 crypto_simd drm_kms_helper glue_helper cryptd syscopyarea sysfillrect sysimgblt psmouse fb_sys_fops drm r8169 mii ahci libahci wmi video usbhid hid
[ 3.334848] amdgpu_gtt_mgr_fini+0x31/0x80 [amdgpu]
[ 3.334883] amdgpu_ttm_fini+0x9a/0x180 [amdgpu]
[ 3.334913] amdgpu_bo_fini+0x12/0x40 [amdgpu]
[ 3.334952] gmc_v9_0_sw_fini+0x6a/0x120 [amdgpu]
[ 3.334985] ? amdgpu_bo_free_kernel+0xbc/0x150 [amdgpu]
[ 3.335015] amdgpu_device_fini+0x390/0x4f0 [amdgpu]
[ 3.335044] amdgpu_driver_unload_kms+0xb4/0x150 [amdgpu]
[ 3.335071] amdgpu_driver_load_kms+0x158/0x350 [amdgpu]
[ 3.335109] amdgpu_pci_probe+0x136/0x1f0 [amdgpu]
[ 3.335181] amdgpu_init+0xab/0xba [amdgpu]
[ 3.335316] [drm] amdgpu: ttm finalized
[ 3.335551] amdgpu: probe of 0000:04:00.0 failed with error -22

I’m really hoping it’s as simple as a bios or kernel parameter to get this working. The board has 12 PCIe 1x slots which are physically USB connectors for risers, the rocm repo only indicates it should work with 16x 8x and 4x slots but I have had Radeon VII’s working via these 1x risers on the other PC. Fiddling with bios settings I’m not well versed in sounds like a fantastic idea so that’s what I’m doing next unless someone has a magic bullet (even if it’s just to shoot me in the back of the head with at this point).

I’ve tried going back to the old PC and it no longer works there either with the same PSP failing symptoms. Could cutting the power when I had no other way to turn off the new PC have screwed the PSP firmware on both Radeon VII’s simultaneously somehow, and could updating the firmware fix it? AMD released an updated bios as an exe which is super unhelpful, I’ll download windows 10 if I have to but I’m on 4g going on 2g it could take a while and the system has already been down well over a week. If firmware is the issue then AMD cheaping out by not having a backup bios to switch to has screwed me right up. Does a firmware issue sound likely?

if the cards work fine on Windows, after installing the AMD drivers, then the cards are fine

Otherwise, you might have to reflash the VBIOS

I noticed you have the AMD IOMMU driver loaded. There are some serious issues with the AMD IOMMU driver combined with AMD cards running at 1x. Try running your kernel with iommu=soft to see if that solves the issue. This is one thing that will be different in Linux vs Windows.