I had a bit of a play and learned a bit more about the early post/init of the Vega series. There is a register to write that resets the ASIC on the chip, which works, but once done the card needs posting which involves a fair amount of code, which is doubtful would ever be allowed in as a PCI quirk.
It does seem possible, but really AMD should fix the VBIOS to properly support FLR, espesially since it’s so involved to reset the card properly.
Edit: I have been put into contact with a group of developers at AMD and I have started digging deeper into the problem. After shutting down the guest and unloading vfio-pci, even the amdgpu module can not re-init the card, offering the following errors in dmesg:
[15555.608910] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting
[15555.608956] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D780 (len 279, WS 16, PS 4) @ 0xD884
[15555.608996] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing A7F0 (len 219, WS 8, PS 4) @ 0xA8BB
[15555.609034] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing 9BAC (len 381, WS 0, PS 8) @ 0x9C31
Decompiling the AtomBIOS I have determined that this is occuring during “MemoryTraining”, which is stuck in the following loop:
0104: 01050000585c0100 MOVE reg[0000] [XXXX] <- 00015c58
010c: 3c61010000 COMP reg[0004] [..X.] <- param[00] [...X]
0111: 490401 JUMP_NotEqual 0104
Hopefully AMD can comment on what exactly is going wrong and why the ASIC will not re-init.
Edit2: Further digging reveals a change in the ASIC_Init AtomBIOS routine on later versions, it seems a block of init code has been jumped out rendering it completely inaccessible. Here is the intro to ASIC_Init in my VBIOS.
0006: 370000 SET_ATI_PORT 0000 (INDIRECT_IO_MM)
0009: 4be50004 TEST param[00] [X...] <- 04
000d: 496601 JUMP_NotEqual 0166
0010: 4be50002 TEST param[00] [X...] <- 02
0014: 441e00 JUMP_Equal 001e
0017: 4be50040 TEST param[00] [X...] <- 40
001b: 49da00 JUMP_NotEqual 00da
001e: 4a65530002 TEST reg[014c] [..X.] <- 02
0023: 49c000 JUMP_NotEqual 00c0
And in later versions:
0006: 370000 SET_ATI_PORT 0000 (INDIRECT_IO_MM)
0009: 4be50004 TEST param[00] [X...] <- 04
000d: 495901 JUMP_NotEqual 0159
0010: 4be50002 TEST param[00] [X...] <- 02
0014: 442100 JUMP_Equal 0021
0017: 4be50040 TEST param[00] [X...] <- 40
001b: 49cb00 JUMP_NotEqual 00cb
001e: 435301 JUMP 0153
Note the additional unconditional JUMP at the end! This bypasses all the early init calls, including MemoryInitialization, which in turn calls MemoryTraining. This makes me wonder how these are being setup in the first place if the calls are completely skipped, perhaps it relies on any pre-post configuration that may have been done at power on instead.
Edit 3: Interestingly and of note, the ATOMBios is run by an interpreter in the kernel, the card doesn’t actually execute any of this itself. It would be entirely possible to replace the ATOMBios with a version on disk instead of executing the one in the ROM, or even re-write it directly into the kernel module.