Navi Reset Kernel Patch

Any idea how to get this working in EXSi? running 6.7

Hello all,

first post on the forum, be gentle :S

I have tried to apply the patch in the kernel but I am having some problems as gcc cannot find the ioremap_nocache function. I found a xtenda_ioremap_nocache function but cannot get it to compile into the kernel.

I am using the “mainline” 5.6-rc4 kernel as of 2020-03-01.
I only patched the quirks.c file from drivers/pci
I used the config from my current kernel: 5.5.1-050501-generic

I am running an Asus X470-F Gaming motherboard running an AMD R2700X, an Asus Strix Vega 64 as main GPU and a Sapphire Pulse RX5700X as pass-through GPU.

I cannot use the vega as pass-through because the motherboard would only use first PCI express slot during boot
And I cannot swap the cards, as the vega card is 2.7 slot and I cannot place it at the bottom (no space).

The error:

drivers/pci/quirks.c: In function ‘reset_amd_navi10’:
drivers/pci/quirks.c:3930:9: error: implicit declaration of function ‘ioremap_nocache’; did you mean ‘ioremap_cache’? [-Werror=implicit-function-declaration]
 3930 |  mmio = ioremap_nocache(mmio_base, mmio_size);
      |         ^~~~~~~~~~~~~~~
      |         ioremap_cache
drivers/pci/quirks.c:3930:7: warning: assignment to ‘uint32_t *’ {aka ‘unsigned int *’} from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
 3930 |  mmio = ioremap_nocache(mmio_base, mmio_size);
      |       ^

Use ioremap directly instead of ioremap_nocache, the latter was removed as the former does exactly the same thing.

thanks, that was it, it worked to compile; just had a problem with the kernel config so I couldn’t test it out yet

is there anything that could be done during a guest’s shutdown script that could make resetting Navi more reliable?
i have both a Linux and a Windows guest i can test with.

What can be done is what this patch already does, that’s it. I have found that it’s random also, sometimes shutting down the guest completely breaks the reset, and doing a hard reset of the guest (qemu monitor commandsystem_reset) will often work provided the guest had not crashed.

i have a machine with 2 identical Navi 10 cards. and i have 2 VMs each with one Navi passed through.

when i try to reboot one VM without the patch, the host throws an error and only the VM that attempted the reboot is disabled until the host reboots.

when i try this with the patch, the VM that is not at all trying to reboot gets killed and disabled along with the VM that is trying to reboot.

is this the expected behavior of the patch? or is there some PEBCAK error in my system thats causing this?

Please read though this thread, the patch is incomplete and doesn’t fuction correctly in some situations/configurations still for unknown reasons.

2 Likes

out of curiosity: what makes AMDGPU able to BACO reset Navi10, but VFIO unable to?
what’s stopping VFIO from just using the same BACO reset procedure that AMDGPU uses?

That’s exactly what this patch does… but even the amdgpu driver doesn’t always reset/recover the GPU. It also does a ton of extra stuff that would never be accepted as a PCI quirk such as loading firmware into the SOC and booting it, etc.

2 Likes

I am following this thread to stay up to date on this.
@gnif: Out of curiosity: you say AMD is cooperative at your end, and yet the “tons of extra stuff” maybe makes AMDs intervention unavoidable.
Does something like a post-sales VBIOS upgrade, “made by AMD” actually happen in reality? I’ve only ever seen those ripped from newer cards.
thanks

Some of AMD’s staff stepped up and started to help with this but their hands are tied by the NDAs they are under. AMD as a company don’t seem to care about fixing this properly with a firmware and/or silicon update.

5 Likes

I see two ways of “not care” possible: a) not wanting to throw money at it, and b) not wanting to “letting it fix itself just by allowing it”.
To allow somebody, to tell something, that is bound by NDA, is relatively cheap, in money terms.
Is there a way the public can do something?
I bet it’s not hammering the support lines, because that will just hit the wrong wall.

To me it looks very strange. AMD is said to embrace linux, but is miles behind Nvidia in anything SR-IOV. I think it is possible that virglrenderer could change the game to AMDs advantage in the far future, but if they “dont care people doing it” (see Nvidia Err 43), they could as well grab the opportunity and push all GPUs into guests. It’s a missed chance, given there are hardly any AMD cards that support VM.

Will the new AMD source leak be of help if the entirety is published?

1 Like

No, it will hinder things as anyone that obtains information by looking at this code will be tainted and can not contribute to any open source projects any code pertaining to AMD forever.

Even if they do it in secret and nobody ever finds out?

Sometimes people have to do horrible things for the good of all mankind.

1 Like

How would one explain then to AMD how they managed to decode and understand secret register details that are completely undocumented that can not be known without the equipment and resources to reverse engineer the silicon of one of the most complex devices currently available in the industry? Literally millions of dollars worth of equipment and time would be required.

The odds would be that the code is tainted and as such to avoid the slight risk that including such code into the Linux kernel and tainting the entire project is enough to prevent it from being included by the powers that be (Linus, etc) without some reasonable explination of how the information was obtained.

It’s not even worth discussing this here I am sorry as this is a topic all on it’s own and is derailing this thread.

1 Like

I have this patch applied on an unraid setup but still can’t pass my 5700 XT successfully to macOS. I hit clover with no issue and start booting (ovmf resolution matches clover) but boot freezes halfway and I get the d3 power state in logs.

Fairly confident it’s an issue with my board (x570 taichi) or the bios revision (latest as of writing, 2.10 I believe). Hoping someone has similar hardware and worked through this – I have no issues with nvidia.

Finally bit the bullet and replaced my 290x with a 5700XT :smiley:

Glad to say the patch is working perfectly for me. Been up-and-running for a few days now. The card seems to be resetting just fine, and I haven’t run in to any issues.

I really hope AMD pull their finger out and address this. In the meantime I very much appreciate that we have people like @gnif

Cheers!

3 Likes

gnif already explained why he won’t use the stolen code. Please respect that.

2 Likes