Letter to AMD: Ongoing AMD hardware/software/firmware problems

Lets try to get AMD to fix things again!

https://www.reddit.com/r/Amd/comments/1bsjm5a/letter_to_amd_ongoing_amd/

13 Likes

What’s the tldr?

Or does it need people to go there and sign-up/vote/comment?

2 Likes

Telling people to upvote a thread on reddit can get the thread removed, so no, I am not telling anyone to vote on it.

The TLDR though is the issues with GPU recovery after a crash, both for VFIO/VM passthrough, and ROCm use cases.

The failure on AMDs part to see the importance of these issues even after 5+ years now of trying to make them aware of the problem.

Fixing this problem would not only benefit VFIO/ROCm use cases, but everyone as the AMD GPUs would be more recoverable when there is a fault even under gaming workloads.

Thanks man, is it worth bumping a couple of the vfio guide/discuss threads with the link & call to action for non regulars who might be watching them?

Or does that needlessly spam?

2 Likes

It’s entirely up to you, I do not want to advise either way because, obviously, I am biased and want my letter to succeed.

I’m all for it. I wish AMD would get their act together, but this post by george hotz (and others in the thread) dampens my hopes of that happening in any reasonable time frame.

I have had multiple replies from Lisa Su on both Twitter and by e-mail. It doesn’t help.

AMD is structurally incapable of fixing these issues. They don’t even have a 7900XTX in CI. Crashing the firmware is so trivial there’s no way anyone fuzzed anything.

They debug by the application, adding mitigations at many layers of the stack to make it work. While this strategy is fine for the 20 mainstream games that come out per year, unless you root cause issues and have a good CI, you’ll never build a stable GPU for general compute.

6 Likes

To be honest, I feel much the same. It was completely random coincidence that I posted this today as someone pointed out to me (after I had posted this) that George had offered a bounty if I could fix the reset on the 7900xt.

Spent most of the day trying to repo the crash gehot has so I can get started on the issue (not for the bounty though, but that is a nice gesture).

On a more postiive note I managed to get my hands on an AMD Mi-100 and have been testing and adding support vendor-reset for this GPU.

4 Likes

Well thats rather damning statement from the mouth of babes (figuratively). Also snippet is amazing commentary:

From a business side, tiny corp is selling boxes with both AMD and NVIDIA, we’ll let the user choose how much they value money vs pain.

I am not looking glass user, but I have bought 7900xtx last month for raw performance and better linux support.

Is the lack of bus reset why system acts oddly after driver crashes? Happened to me just now, video decoding had an oopsie moment, and browser hw acceleration did not recover correctly after resulting driver restart.

Also wtf, how the hell can that happen on newest and freshly update drivers?
image

1 Like

For the most part, yes. It’s called a bus reset because it’s triggered by the bus, a way to signal to the GPU to perform a total and complete reset of it’s internal state machines, clear memory, etc.

When a bad driver/bug, whatever, crashes the device out, unless it knows exactly what went wrong inside the device and exactly how to recover it to working again, all bets are off on the stability of the device.

If AMD did the sane thing and just reset the PCI device entirely when these things happend, not only would the code be far simpler, the GPU would return to a 100% functional state again.

I have seen my 7900xt do this in a VM when it fails to load properly, again due to failure to properly reset.

3 Likes

i have not ever done any actual troubleshooting into this, but i always assumed it was an issue with the actual firmware initialization that AMD uses. It is possible for an AMD GPU to fail out at boot on real hardware and not only not boot but not reset until a hard power off and on is used.

i am a hobby miner, so again i have not ever delved into the reset aspect specifically, but i have seen a LOT of different AMD cards and i have had to do some odd things to use some of them reliably.

1 Like

And what will get amd’s attention?

The enterprise.

Would be great if wendell did video on the main channel covering this error.

3 Likes

Its wild to me they dont have the card in a CI setup that can’t possibly be right and just the CEO not knowing whats going on in the trenches right?

I’m in the same camp though im all for competition and supporting open source and the underdog but I think its probably going to be intel in years whos actually competitive. Tired of AMD, a billion dollar company, using the open source community as a QA department. My nvidia cards on linux never crashed my whole system and I get a fault like at least once a month with a 6900xt…

2 Likes

So it seems to have stirred some people at AMD. I have been invited to the AMD Vanguard Program and sent the following:

Thank you so much. I’m syncing with several of my colleagues familiar with this issue (John Bridgman included).

I’ve been aware of this problem since before I was an employee at AMD. I’ll be honest, this particular area is out of my remit, though I fully agree with you that this reset behaviour is unacceptable. I can’t make any guarantees at this point on resolution but I’d like to do whatever I can to to help get this addressed.

It’ll take me a little while to get everybody up to speed. In the interim, I’ll have an invitation to the Vanguard program delivered to you via email. If you run into any issues at all, please reach out.

10 Likes

I was just about to post the geohot rants but I see you already mentioned it in the letter.

Hopefully the vanguard program works out!

1 Like

I am slightly hopeful about this. Given that many of AMDs recent innovations came from the trenches at least the company seems to be permeable from the bottom up. It’s great this caught a few employees attention and I hope this time more comes from contacts of these kinds.

1 Like

I feel like this is somewhat relevant as it is a “problem” (nothing is broken but performance is severely affected) with AMD CPUs

Whatever the case is, AMD’s poor handling of super-alignment plagues a lot of applications that naturally align to powers-of-two. This is the case for applications that rely heavily on FFTs (as is the case for y-cruncher). y-cruncher already goes to somewhat extreme lengths to avoid super-alignment hazards, but they still keep popping up on a regular basis - and only on AMD chips. Rarely is it this bad though. I’m not sure what Intel does that’s so much better.

While I haven’t tested on anything other than Zen4 and Intel Skylake, it’s likely that this hazard occurs on all AMD processors going all the way back to Bulldozer since I’ve seen similar behavior on AMD Piledriver.

from News (2023)

I know CPU design is hard, especially dealing with the specifics of cache aliasing (honestly cache design is in general very hard with problems of coherency, snooping, latency etc.) , and thus I understand this is a very non trivial thing to fix, it still would be nice if it was fixed, as Intel not having the problem indicates it is definitely solvable.

2 Likes

SMH, the one thing AMD needs to do is to improve its software stack. The Big Nerds on AI, LLM’s and others are trying to move to AMD since Nvidia has gone hostile towards those with multiple vendors. AMD needs to hurry up or just open source what they can from the firmware.

3 Likes

I thought nvidia was no longer a graphics company? i.e. I thought they are repositioning as creating hardware for AI, which just incidentally works for gaming as well.

I have a 7900XTX currently and have a couple games that cause this GPU reset consistently. I hope this gets actually looked at at AMD as it led me to discover how ubiquitous it is.

I had Nvidia cards only for the last 15 years or so, the last 2 years running Linux day to day and my last nvidia card was rock solid. I’m disappointed as everyone said AMD was so much better on Linux, and at the time had much better wayland support. I hope they get their act together. Tighter competition with Nvidia would be a good thing for all of us. Intel has a lot of ground to cover to catch up but they are making some decent strides on the GPU side.

2 Likes

The thing about Intel, is they have a bunch of engineers and developers on staff writing the things that make it all tick. OneAPI will gain ground quickly, and with the GPU’s they have and how accessible they are, do not be surprised if they catch up to AMD, they are also Open Source and the OneAPI is just the icing on the cake.