Vega 10 and 12 reset application

Hi gnif, We would urge you to/welcome you to hang around in the Unraid forums also :smiling_face_with_three_hearts:. Though Unraid sounds like a storage focused OS (that part pretty much works), it also tries to be a seamless GPU pass-through/virtualization(qemu)/gaming platform…and that’s where there are challenges.

May be this thread could be your starting point? (Where your L1T post was featured before) :slightly_smiling_face:

Does any else here use the Radeon VII? I’d be interested in contributing to raise funds to get support for it.

I’m using the Vega Vii I’m more than happy to donate to get the reset bug fix

Welcome @WhalingWizard and @SiwatSirichai to the forums! :smiley:

I am currently away but will return in a couple of days, when I do I will setup a GoFundMe for this GPU as I have also had interest from others via PM to provide funding in this way.

1 Like

Thanks @gnif! Happy to be here. Let us know once you are ready, we’re happy to help / donate.

To keep this thread on topic I have created a new one over here:

4 Likes

I pass thru rvii too and happy to help/donate!

2 Likes

I’m passing VII through too - and experience the rest bug - and also happy to help.

1 Like

Thanks, please see the thread below where this is being discussed:

https://forum.level1techs.com/t/hardware-for-amd-reset-bug-fixes

2 Likes

@gnif is AMD doing anything itself (since you are in contact with them) with regard to a full fix? Its been over 6m since your fix and a LOT longer since this problem was identified…

1 Like

AMD are not doing anything directly. I have been continuing to work with AMD on this, a patch for Vega has not yet been made available as it’s far more complex to reset as compared with Navi, and Navi still has issues that are also present on the Vega generation. Once these issues with Navi are resolved I will move back to working on the Vega cards.

5 Likes

So Vega owners are left aside, sigh. Before kernel 5.4 in a guest I did not experience the Vega reset bug for whatever reason, go figure, but since 5.4.0 I have it. Oddly enough, a windows 10 guest doesn’t show the issue.

The reset application doesn’t work for me, “Failed to exit BACO”, quirk applied or not.

Recent kernel activity which looks related seems to focus on Navi as well

commit 210b3b3c7563df391bd81d49c51af303b928de4a upstream.

This patch fixes 2nd baco reset failure with gfxoff enabled on navi1x.
Clear state buffer (resides in vram) is corrupted after 1st baco reset,
upon gfxoff exit, CPF gets garbage header in CSIB and hangs.

I hope AMD won’t forget about Vega users, it’s been quite some time since https://twitter.com/tekwendell/status/1156727245387522048 but at the end of the day the issue is still around.

2 Likes

Thanks for that info, while it’s Navi it might also be portable to Vega. I will look into it as soon as I can.

3 Likes

I am also getting the Failed to exit BACO with quirk applied.

3970x
MSI Creator
2x Vega64 with successfully passed through the second GPU
quirk.c applied to kernel 5.3.13

~# lspci -nn | grep 4d
4d:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] [1002:687f] (rev c1)
4d:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:aaf8]

Ran then stopped the VM that has the GPU passed through - virsh destroyed it if it crashed. In all cases:

~# ./reset-test 0000:4d:00.0

Attempting Vega 10 reset
CMD_READMODIFYWRITE 0x00000e2b
CMD_DELAY_MS
CMD_READMODIFYWRITE 0x0001667c
CMD_READMODIFYWRITE 0x0001667c
CMD_READMODIFYWRITE 0x0001667c
CMD_READMODIFYWRITE 0x0001667c
CMD_READMODIFYWRITE 0x0001667c
CMD_READMODIFYWRITE 0x0001667c
CMD_READMODIFYWRITE 0x0001667c
CMD_READMODIFYWRITE 0x0001667c
CMD_READMODIFYWRITE 0x0001667c
CMD_WAITFOR 0x0001667c
Wait for timed out.
Failed to exit BACO

Hi,
I just signed up to share my experience with this reset tool, I tried the patch with kernel versions 5.3.18 and 5.4.2, but sadly it doesn’t work for me too:

    # ./reset-test 0000:43:00.0
============================================================================

AMD Vega 10/12 Reset Application (Version: 1.0)
Copyright (c) 2019 Geoffrey McRae <[email protected]>

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.

This tool is intended as an interim workaround while I port this into the
kernel driver. If you like my work and want to support it you can contribute
using the following methods:

* Ko-Fi   - https://ko-fi.com/lookingglass
* Patreon - https://www.patreon.com/gnif
* BTC     - 14ZFcYjsKPiVreHqcaekvHGL846u3ZuT13

============================================================================

Attempting Vega 10 reset
CMD_READMODIFYWRITE  0x00000e2b
CMD_DELAY_MS
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_WAITFOR          0x0001667c
Wait for timed out.
Failed to exit BACO

This are my PC specs:

  • AMD Ryzen Threadripper 1950x
  • Gigabyte X399 Aorus Gaming 7, BIOS v.F12
  • ROG Strix RX Vega 64 OC Ed.

I have a few questions (for @gnif mostly) that I can’t find any answer in this thread:

  • what’s the kernel version you tried the patch and reset tool with?
  • is there a way to check if the kernel has been correctly patched?
  • @gnif if I remember correctly you asked in the Navi thread to someone if he/she perhaps had a Threadripper, is this processor problematic with this patch?

Thanks

It has zero bearings on the reset, that tool pokes the GPU directly bypassing the kernel entirely.

I have been trying to isolate if the PCIe bus reset issues the TR was plagued with early on is still playing a part with some reset failures at times.

So no clues on why it shouldn’t work with my configuration? Are there any tests or something I could do to try and help with this problem?
Thanks for your work btw!

Not really, we are at a sit and wait for AMD to hopefully give us more information at this point in time.

Edit: you did patch your kernel to avoid the BUS resets didn’t you?

Yes I patched the kernel already, that’s why I asked you on what kernel version did you work with, and how to confirm it has been patched correctly (maybe a check somewhere?). When you patch it does it give you errors on some chunk?

If you got errors, it did not patch. The patch simply adds the gpu to the list of quirked devices to prevent a bus reset. It is very simple to apply by hand to drivers/pci/quirks.c.

Also make sure that it covers your GPU? you might need to add a line for yours. Check your GPUs PCI device ID using lspci -nn.

DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATI, 0x**DEVICE_ID_HERE**, quirk_no_bus_reset);

ie:

lspci -nn
...
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c3)
...

The ID would be 0x687f