Return to Level1Techs.com

Vega 10 and 12 reset application

amd
vfio
#1

Hi All,

As some of you may be aware I have been working to find either a workaround or fix to the AMD Vega reset bug. Last week I posted to AMD’s reddit a cry for help to fix this issue in an attempt to show AMD how much demand there is for this. As a result, an AMD Engineer got in touch and has guided me to a possible solution to the problem.

Over the weekend I have spent considerable time implementing what seems to be a working reset for Vega 10 and 12, initial testing by a few people confirm that it is working on Vega 10, however it needs further testing.

You must apply this patch to your kernel to prevent vfio-pci from attempting to reset the GPU incorrectly.

Please note that this application is intended as a interim workaround while I work on implementing this into the kernel for vfio.

Download reset-test.tar.gz

Usage is simple, obviously you must not be using the GPU at the time and it should be bound to vfio-pci.

./reset-test 0000:24:00.0

The expected output is:

============================================================================

AMD Vega 10/12 Reset Application (Version: 1.0)
Copyright (c) 2019 Geoffrey McRae <[email protected]>

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.

This tool is intended as an interim workaround while I port this into the
kernel driver. If you like my work and want to support it you can contribute
using the following methods:

* Ko-Fi   - https://ko-fi.com/lookingglass
* Patreon - https://www.patreon.com/gnif
* BTC     - 14ZFcYjsKPiVreHqcaekvHGL846u3ZuT13

============================================================================

Attempting Vega 10 reset
CMD_READMODIFYWRITE  0x00000e1c
CMD_WRITE            0x00000e1f
CMD_READMODIFYWRITE  0x00000e2b
CMD_READMODIFYWRITE  0x00000e2b
CMD_WAITFOR          0x0001667c
CMD_READMODIFYWRITE  0x00000e2b
CMD_READMODIFYWRITE  0x00000e2b
CMD_READMODIFYWRITE  0x00000e2b
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x00000e2b
CMD_DELAY_MS
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_WAITFOR          0x00000e2b
CMD_READMODIFYWRITE  0x00000e2b
CMD_DELAY_MS
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_WAITFOR          0x0001667c
CMD_READMODIFYWRITE  0x0001667c
CMD_READMODIFYWRITE  0x00000e2b
CMD_READMODIFYWRITE  0x00000e2b
CMD_READMODIFYWRITE  0x00000e2b
CMD_WAITFOR          0x00000e2b
CMD_WRITE            0x00000052
CMD_WRITE            0x00000053

At this point the GPU should successfully post inside a VM, even after a dirty shutdown or VM crash.

A reset for Vega 20 and Navi is possible, but as I do not have these devices to develop against I can not safely implement it. Poking blindly at the wrong registers is dangerous and can destroy the GPU.

If you would like to see Navi also supported you can contribute to the cost to purchase a suitable card below:

Edit: Funding is complete! Thank you everyone for your support!

31 Likes

Which graphics cards have the reset bug again?
GPU Passthrough & Streaming
New PC hardware, virtualization and Linux
#2

In the past you did a gofundme. I wonder if people would be willing to fund another one for you to get a Navi GPU.

2 Likes

#3

Good idea, done:

Edit: Funding is complete! Thank you everyone for your support!

3 Likes

#4

Good work

3 Likes

#5

I have a Radeon 7 I’m more than happy to risk for science. If it would be at all helpful I’m more than willing to throw it in a non production machine and set you up a tunnel to it. Assuming you can find out what you need without physical access.

1 Like

Zen2 (Ryzen 3700X) and X570 (ASUS C8H) Win 10 VR on Arch Linux
#6

Thank you kindly for the offer however while writing this the GPU ends up in a hung state that often crashes the entire machine requiring a physical power down. I also need to be able to see the physical GPU output, hear the fan spin up/down, etc, to validate what is going on as there is no other form of debug output available from the GPU.

1 Like

#7

Edited the OP, but will to post this here also to be sure it’s not missed.

You must apply this patch to your kernel to prevent vfio-pci from attempting to reset the GPU incorrectly.

0 Likes

#8

Not currently using VFIO but just want to say thanks for working on this, as i may do in future. I just haven’t gotten around to it because time-poor at home…

Flicked you a small BTC donation :slight_smile:

1 Like

#9

Wanted to thank you for doing what AMD wouldn’t do. Hopefully a similar solution would apply to the other GPUs including Polaris and Fiji.
I remember Linus from LTT struggled with the Fury Nanos in his 7 gamers rig precisely because of this bug.
Quick question though; what is Vega 12 as far as graphics cards go?

0 Likes

#10

Vega 12 is the Vega Pro Workstation cards.

2 Likes

#11

It is funny that a gofundme fundraiser is needed to sort out AMD issues. At this point they really do start to look like the old saying - AMD for the poor people. I guess the only path forward is for AMD marketing to just send you a full line-up of gpu cards - two of each and call it a day.

I sent some small donation - thank you for doing this!

3 Likes

#12

Noice. Even if you don’t get Radeon 7 fixed that’s OK. I just use my R7 for host and get a navi GPU for my gaming VM.

2 Likes

#13

I have an R7 that I want to passthrough to a VM, let me know how I can help to test.

2 Likes

#14

Thanks for the offer but in order to add support for R7 I will need one on hand.

0 Likes

#15

Well let me know. I would like to dive into more of the PCIe structure of my system and I know some of the ins and outs of the linux kernel.

0 Likes

#16

I am still working with AMD on getting the bugs and specifics of this reset worked out, other generations will also require one on one with an engineer for the very same reason. It’s not about poking around in the kernel, but having the literal datasheet for the Vega/Navi GPUs.

3 Likes

#17

Would be really cool if someone on the AMD side could help test your code.

As i’m sure AMD have a plentiful supply of AMD GPUs, and this additional support for their cards under linux is going to cost them basically nothing in terms of R&D seeing as you’re doing it - and as per the linux kernel submissions they make, they clearly already have linux developers for Polaris, Vega and Navi cards…

Has anyone you’ve spoken with at AMD indicated whether or not this may be possible?

0 Likes

#18

Thank you for working on this issue. Do you know if the application works equally well with a macOS guest? A fast, modern GPU working with a macOS VM would be phenomenal!

0 Likes

#19

Ordered a 5700xt. It’ll be here Aug 1st. Once it’s installed my Radeon 7 is totally free for testing. I’m more than willing to cover shipping both ways, and sign a piece of paper stating I wont hold you responsible if it dies or falls into the ocean blah blah. Is that something that would be helpful? I do need the card back but not for a while, and if it dies/gets lost I’ll just get another 5700xt when I need it

6 Likes

#20

It certainly would be helpful! PM me when you’re ready and I will pass on the shipping details. Thanks!

2 Likes