Vega 10 and 12 reset application

gnif · August 2, 2019, 4:00am

A huge thank you to everyone that has contributed to the cost of the AMD RX 5700!

The GoFundMe is now fully funded, once the funds are available to me I will order this card and do my best not to dissapoint!

xracer · August 2, 2019, 5:11pm

I’m working on a hardware that switch off(PCIe hot swap -> reset GPU) the external 8pin connector power ,that is controlled by an Arduino(prototype) and host USB connection. Hardware: Sapphire RX-580 8G nitro. It’s working!! A software only solution would be nice!!

nothingburger3001 · August 3, 2019, 7:58am

R7 works well in macOS.

You also don’t need to really reboot your host once you shutdown the VM. You can reset the R7 by writing to a few /sys/bus/pci/… files and performing a suspend/resume. (That’s what I’m doing).

nothingburger3001 · August 3, 2019, 8:01am

I wish you would have mentioned that your campaign was in AUD. Because I was totally willing to spend USD 50 but spent only ~30 in the end

nothingburger3001 · August 3, 2019, 8:03am

Funny. I have the RX580 8G pulse and that one doesn’t require any reset hack. Resets itself just fine. Debugging this issue must be “”"“fun”"""

Ashark · August 16, 2019, 12:20am

Have you tried it with a Linux vm? My RX580 have a reset bug if using linux vm, but has no one if using windows vm. See this: Secondary gpu passthrough of amd rx580 to Linux vm problem

gnif · August 16, 2019, 1:02am

This is not about the rx580, vega 10/12 only! (ie, vega 56/64/FE)

PeterisSudakovs · August 19, 2019, 5:03am

yep - have two RX580, when both are used they have some issues.

SIGSEGV · August 31, 2019, 3:13pm

Anyone know how to apply this patch in arch?

I seem to be to incompetent to figure it out myself.

Otherwise how long for this to be upstreamed?

xracer · August 31, 2019, 11:04pm

I’ve fixed the AMD reset bug by using a hardware hacking. It just work by powering off the GPU(8 pin power + small video board hacking) using an Arduino to do this and a software on linux(Source code prototype). It’s a protototype by now, but I designing the final printed circuit boad, it’ll be open source and can work with 3 GPU,can cascade several boards(AMD Linus 7 Gamers, 1 CPU ?) . I’ve tested this on Saphire RX580 nitro+ 8g, Linux/Windows/Mac OS VM’s, Proxmox, but possibly can be done on another GPU’s: RX470,RX480, Navi GPU,… and others nasty future GPUs. Current prototype Reset GPU in action

gnif · September 1, 2019, 7:49am

There is no patch yet, it’s coming through, I am now just waiting on a Navi card to arrive in the post to target it so that when the patch is released it targets both generations.

Why? We have a proper solution without any special hardware.

This is very dangerous, the inrush current when re-connecting the GPU’s power can pull down the power rail, it’s why true hot-swap motherboards have ICs that are used to pre-charge the device with current limiting before setting the power good signal and enabling the device logic.

The surprise removal of the GPU is also not well supported on all chipsets (usually only enterprise grade devices) as it has to be programmed to support it and renegotiate the link. The OS also needs to be aware of the removal/insertion so that bus configuration can take place correctly, you’re likely to halt the system, cause it to become unstable, or worse, blow up the driving gates in the PCIe transceivers.

If you really want to go down this path the far simpler and safer solution is to isolate the power good signal on the bus and cycle it, this is the same signal that is asserted during a warm reboot and could be done with a PCIe riser instead of needing to modify a GPU or motherboard.

SIGSEGV · September 1, 2019, 10:48pm

Why indeed?

I will just put it out there that:

A: I have some GPU’s that actively panic when you pull the PCI-e 8P without first disconnecting the PCI-e slot power.

B: I am an electronic engineer and will go on the record to say gnif is correct. Playing rough with a Processor’s power is how you end up with blown VRM’s and dead silicon. (Consider yourself warned)

I am after a patch (no matter how dirty) I can apply to my kernel to proof of concept this with my pair of Vega64’s, you stated in your video it was working with vega 10 now.

gnif · September 1, 2019, 11:28pm

It certainly is but I don’t want to release the patch until Navi is included, otherwise, attention from AMD may be diminished and I may not get the support from them I need to get Navi working.

fallenguru · September 2, 2019, 8:28pm

How’s the odds that vega20 / the Radeon VII might be included?

Gorgonbert · September 2, 2019, 10:14pm

I‘d be interested as well. Radeon VII

gnif · September 2, 2019, 11:10pm

Radeon VII should be possible but again, I need the device to do this. At the moment Navi and Vega are the main targets but if I can get a VII I will also see about implementing it there also.

One of my AMD contacts is pushing to get some sponsorship from AMD and he has proposed to them to provide one of these cards amongst some other things, if this happens I will certainly do my best to include support for it.

xracer · September 3, 2019, 7:28pm

There is no patch yet, it’s coming through, I am now just waiting on a Navi card to arrive in the post to target it so that when the patch is released it targets both generations.

I’m glad to hear that Polaris GPU will be fixed!!

Why? We have a proper solution without any special hardware.

My hardware is a plan B, if all fail(AMD good will finished) I’ve a GPU reset fixed that can run MAC OS Mojave VM. It’s out of PCIE specification, no proper hotplug, but it’s working. My plan C is to buy 3 Vega 64(very expensive mainly here in Brazil) that you already have fixed.
My PCB design have 3 GPU power controllers(reset), four hard disk power controllers(HD hot swap), 8 high power PWM cooler controllers, 3 addressable RGB led strip controllers, 2 general purpose I2C/SPI to connect displays, and others spare I/O pins. It can be easily programmed using an Arduino (STM32) IDE. It will help fix the AMD reset bug but have other nice functions

This is very dangerous, the inrush current when re-connecting the GPU’s power can pull down the power rail, it’s why true hot-swap motherboards have ICs that are used to precharge the device with current limiting before setting the power good signal and enabling the device logic.

I precharge the graphics cards capacitors to control the inrush current on power-up. I don’t have control on “power good signal” to control the inrush current on GPU, but I measured(indirectly by measuring the VRM PWM pulse width) the GPU current on power up. I measured several times and I can assure that’s OK. PROBABLY(I don’t have schematics, AMD datasheets, …) my specific graphics cards have some circuit to prevent some improper “Power good” signal" by monitoring the 12v I don’t blow up my expensive GPU (Here in Brazil it’s really expensive)

The surprise removal of the GPU is also not well supported on all chipsets (usually only enterprise grade devices) as it has to be programmed to support it and renegotiate the link. The OS also needs to be aware of the removal/insertion so that bus configuration can take place correctly, you’re likely to halt the system, cause it to become unstable,

Before powering off the GPU I disconnect the GPU device and the PCIE bridge from Linux. The GPU power is cycled then the devices will show up by rescaning the bus. The PCIe configuration is ok after the scan. no kernel error messages(DMESG).
The PCIE ASPM must be disabled(kernel parameter pcie_aspm=off) and the GPU must be capable to enter on D3hot state(feature can be disabled on BIOS)

GPU reset in action
Hardware: Motheboard ASROCK X99 Taichi, XEON E5-26296 V3, 2x Sapphire RX580 nitro+ 8G, Proxmox 6.0.6,

FPGA PCIe device development without restarting the computer:

or worse, blow up the driving gates in the PCIe transceivers.

Why? From PCIE specification:
"4.3.5.2.Short Circuit Requirements
All Transmitters and Receivers must support surprise hot insertion/removal without damage to the component. The Transmitter and Receiver must be capable of withstanding sustained short circuit to ground of D+ and D-."

https://www.globalspec.com/reference/66556/203279/surprise-insertion-removal

"PCI Express physical architecture is designed with ease of use in mind. To support this concept PCI Express has built in the ability to handle surprise insertion and removal of PCI Express devices. All transmitters and receivers must support surprise insertion/removal without damage to the device. The transmitter and receiver must also be capable of withstanding sustained short circuit to ground of the differential inputs/outputs TX/RX+ and TX/RX . "

xracer · September 3, 2019, 8:00pm

B: I am an electronic engineer and will go on the record to say gnif is correct. Playing rough with a Processor’s power is how you end up with blown VRM’s and dead silicon. (Consider yourself warned)

I’m aware of risks, but the “gnif” also could blowup the GPU by messing around the PCIe registers! Of course this is the proper way. My way to do this was the MYTHBUSTERS way(sometimes they blow up things) by breaking the rules(engineering).

gnif · September 3, 2019, 11:09pm

I see where you’re coming from but please note that because you’re using the chipset in a way that was not intended by the manufacturer of your motherboard it may not actually conform to the PCI specification in these niche cases.

How? These are very fast transients that would only be visible on a DSO.

Your GPU might, but what about the rest of your system when the power rail suddenly dips and the PSU tries to keep up with the sudden extreme high demand put upon it? Will the voltage spike to try to correct for the sudden voltage drop? It depends on the PSU’s implementation and how it handles such things.

http://www.ti.com/lit/an/slva670a/slva670a.pdf

There are two key concerns associated with inrush current. The first is exceeding the absolute maximum current ratings of the traces and components on a PCB. All connectors and terminal blocks have a specific current rating which, if exceeded, could cause damage to these parts. Likewise, all PCB traces are designed with a certain current carrying capability in mind and are also at risk to damage. When designing the PCB traces and selecting connectors, not taking the inrush current peak into account can damage the power path and lead to system failure; however, appropriately designing for a large inrush current peak will lead to thicker PCB traces and more durable connectors which can increase the size and cost of the overall design.

The second problem occurs when a capacitive load switches onto an already stable voltage rail. If the power supply cannot handle the amount of inrush current needed to charge that capacitor, then the voltage on that rail will be pulled down. Figure 4 is an example of a 100 µF capacitance being applied to a voltage supply without any slew rate control. The capacitance generates 6.88 A of inrush current and forces the voltage rail to drop from 3.3 V down to 960 mV.

If other modules are connected to this power rail and the voltage drops, then these modules may reset themselves and put the rest of the system into an undesired state. If the voltage regulator is unable to supply enough current at turn-on, the voltage rail could collapse completely leading to system failure

While the pre-charge in an extremely short transient, the inrush current can affects your entire system, not just the GPU. If you have HDD activity occurring for example, you might end up with undetected data corruption, or bit flips in RAM causing undefined system behaviour. The VRMs on your motherboard may overshoot when trying to relgulate burning out or damaging your CPU. The list goes on.

According to 6.7.2.9. Port Capabilities and Slot Information Registers a slot capable of surprise removal, and hot-plug must set the bits associated in the PCI configuration space:

Hot-Plug Capable (Slot Capabilities) – When Set, this bit indicates this slot is capable of supporting hot-plug.

Hot-Plug Surprise (Slot Capabilities) – When Set, this bit indicates that adapter removal from the system without any prior notification is permitted for the associated form factor.

It is quite common for PCI implementations to not fully support the specification even when technically required to do so. Short circuit protection also doesn’t cover you for cases where D+ or D- spike to voltages outside of specification due to VRM overshoot during startup, which is more the concern here. It may work 10 or 100 times, but eventually pushing the transceivers out of spec will cause them to fail eventually.

Firstly, how? I don’t see anything on your prototype board that could do such a thing and your code doesn’t show any form of pre-charge capability, you’re simply turning on a MOSFET.

void setGpu(bool value)
{
  digitalWrite(MOSFET, value ? HIGH : LOW);
  gpuPower = value;
}

Without controlling the PERST# signal you could damage your GPU. If your GPU logic is trying to start up during this power-starved period it could behave in undefined ways causing all sorts of erronous behaviour. Computers and complex logic use a power good signal of sorts for a very good reason and you should not ignore it.

Further to this, disconnecting the GPU power is overkill when you could simply just assert PERST#.

6.6.1. Conventional Reset
In all form factors and system hardware configurations, there must, at some level, be a hardware mechanism for setting or returning all Port states to the initial conditions specified in this document - this mechanism is called “Fundamental Reset.” This mechanism can take the form of an auxiliary signal provided by the system to a component or adapter card, in which case the signal must be called PERST#, and must conform to the rules specified in Section 4.2.4.8.1

My main concern with this is that you’re suggesting it as a “fix”, which while it seems to work with your hardware configuration, if you provide a printed PCB and people start to use this “fix” we may start to see widespread motherboard and GPU failures due to people trying to use it on platforms that don’t take so kindly to such abuse of the bus.

Certainly it’s possible, which is why several AMD engineers are helping me to implement this in a way that won’t cause any harm to the masses hardware.

xracer · September 4, 2019, 5:01pm

I see where you’re coming from but please note that because you’re using the chipset in a way that was not intended by the manufacturer of your motherboard it may not actually conform to the PCI specification in these niche cases.

I don’t pretend this is the right way.

How? These are very fast transients that would only be visible on a DSO.

I agree. Instantaneous current can’t be measured this way(pulse width)!

Your GPU might, but what about the rest of your system when the power rail suddenly dips and the PSU tries to keep up with the sudden extreme high demand put upon it? Will the voltage spike to try to correct for the sudden voltage drop? It depends on the PSU’s implementation and how it handles such things.

I precharge the graphics card capacitors. this isn’t an issue!! I measured the GPU VRM pulse width and I can assure that the PSU is working properly. An oscilloscope on 12V can shows up issues(high frequency spikes) too.

According to 6.7.2.9. Port Capabilities and Slot Information Registers a slot capable of surprise removal, and hot-plug must set the bits associated in the PCI configuration space:

I don’t pretend that my solution comply to PCIE specification

It is quite common for PCI implementations to not fully support the specification even when technically required to do so.

I have researched. The mondern I7/I9…XEON haswell/boadwel family fully comply wiht the PCIE specification. Since the PCIe controller is at processor I think the PCIe lanes have short circuit protection

AMD GPU PCIe short circuit protection: one can ask AMD engineers

Short circuit protection also doesn’t cover you for cases where D+ or D- spike to voltages outside of specification due to VRM overshoot during startup, which is more the concern here. It may work 10 or 100 times, but eventually pushing the transceivers out of spec will cause them to fail eventually.

I can work to fix(we don’t know that is happning) this by setting the PCIe reset properly

Firstly, how? I don’t see anything on your prototype board that could do such a thing and your code doesn’t show any form of pre-charge capability, you’re simply turning on a MOSFET.

The prototpe photo shows only some eletronics componentes. Below board there is the P channel mosfet and an analog circuit that put the Mosfet on linear region(Resistor) to precharge the graphics capacitor. It don’t need that Arduino handle this. In fact this particular MOSFET cannot hande 300w that my GPU require. It’s only a prototype. In the final version the PWM on microcontroller will control a low power voltage controlled boost DC/DC to power the gate of N channel mosfet. The power mosfet work on linear region on power up just a few miliseconds(to prevent damages) only to precharge the capacitors. Another solution is put a resistor and an low power mosfet(on/off) just to precharge the capacitor

Without controlling the PERST# signal you could damage your GPU. If your GPU logic is trying to start up during this power-starved period it could behave in undefined ways causing all sorts of erronous behaviour. Computers and complex logic use a power good signal of sorts for a very good reason and you should not ignore it.

I’ll fix this!!! a small 0.5pol2 glued on GPU board can handle this!

Further to this, disconnecting the GPU power is overkill when you could simply just assert PERST#.

I’ll test this. If this works I’ll abandon my current solution!!
The PERST# signal(and others) can be rerouted to the SENSE pin on GPU 8 pin power connector using a small 0.5pol2 PCB glued on GPU board.
I can put a small 8 pin microcontroller on this small board, and use the SENSE(GPU 8 pin connectyor) pin to exchange data(serial assincronous half duplex using only 1 pin) whith an external board connected to host USB.
This brainstorming is valuable!!

My main concern with this is that you’re suggesting it as a “fix”, which while it seems to work with your hardware configuration, if you provide a printed PCB and people start to use this “fix” we may start to see widespread motherboard and GPU failures due to people trying to use it on platforms that don’t take so kindly to such abuse of the bus.

This is the beauty of open source. One more capable than me can(possibly) fix the issues. Anyway no one will use my hardware since the AMD Polaris reset will be fixed on software!!! As I said, this is a plan B. I’d like to be a beta tester of VFIO reset fixer on RX-580

Certainly it’s possible, which is why several AMD engineers are helping me to implement this in a way that won’t cause any harm to the masses hardware.

I agree. I’ve worked as eletronic enginner several year ago, but now I’m just a software developer.Believe me, I blowed up several devices, high power PSU too, one time an large capacitor blow up in my face!. Even engineers(even good ones) can fail!