Letter to AMD: Ongoing AMD hardware/software/firmware problems

235SAS · April 2, 2024, 1:50pm

They also have Gaudi which is a seperate architecture from the GPU/Xe stuff, and is already somewhat decent and holds a lot of promise IMO. Hopefully there is more competition in both the GPU space and the AI space, and it seems like Intel will have offerings for both.

We anticipate that with further optimization, Gaudi 2 will soon outperform A100s on this model. In earlier tests on our SDXL model with base PyTorch, Gaudi 2 generates a 1024x1024 image in 30 steps in 3.2 seconds, versus 3.6 seconds for PyTorch on A100s and 2.7 seconds for a generation with TensorRT on an A100.
The higher memory and fast interconnect of Gaudi 2, plus other design considerations, make it competitive to run the Diffusion Transformer architecture that underpins this next generation of media models.

I do agree on them doing a LOT for open source, but I am also wary because of their history like their compiler which they were court mandated to say may not perform as well on AMD processors, amongst many other things. Honestly though NVIDIA sets the bar very low with just how greedy they are ( case in point, the whole VDI, vGPU licensing situation etc.)

gnif · April 2, 2024, 5:03pm

Intel also actually publicly document their hardware so we can debug/fix/improve on their work.

https://www.intel.com/content/www/us/en/docs/graphics-for-linux/developer-reference/1-0/overview.html

On this front alone intel are already winning.

NukeDukem · April 2, 2024, 6:33pm

I have no face and I must scream.

Not because this is some news for me, but because it is going for such a long time I have a really hard time understanding why. In my opinion the only reasons left are:

a braindead empty suit(s) in the middle management that clearly needs to go
they don’t give a fuck and only pretend for PR
backroom deal with NVIDIA

They are a billion dollar company that wants to design, produce and sell high-end compute solutions and yet they fail on such basic level. I wished in the past that Jim Keller would be hired at AMD again - this time in RTG and do some house cleaning, but sadly he is deep into Tenstorrent game now.

I’ve watched how they rolled ROCm/HIP support in Blender and how broken and/or unstable it is to this day especially on Linux.

Also @gnif could you quote or screenshot the original reddit post? Some people behind VPNs or with niche browsers cannot access reddit at all.

235SAS · April 2, 2024, 6:44pm

Also feel like this is worth mentioning AMD Confirms Vapor Chamber to Blame for Radeon 790... - AMD Community I know nothing about vapor chamber manufacturing, but I know plenty of products use them and this is the only time I’ve seen this issue being reported.

NukeDukem · April 2, 2024, 6:48pm

Thou this is a serious fuckup it is not even close to fucking up multiple years in a row and not upgrading CI/CD and hardware-software automated testing for their flagship products.
Vapor chamber is an one time thing that can be blamed entirely on a rushed launch. On the other hand ROCm/drivers problems date back years if not decades.

235SAS · April 2, 2024, 6:50pm

Well my point about that was that the issue is very quick and easy to replicate on affected cards. It is just another indication that they don’t do thorough and sufficient quality assurance. They only tested cards in a single orientation and it is very common for both horizontal and vertically mounted GPUs, otherwise this issue would have been caught.

rrubberr · April 2, 2024, 9:24pm

I’ve been following this topic over the past few weeks, and epiphanies from @gnif , George Hotz et al., and so on, characterize AMD products as a broken experience on every level. Software, firmware, and hardware have all been implicated in this thread alone.

There is clearly a lot of angst and anger surrounding the state of AMD GPUs, and much of it seems to be coming from developers and fans of the brand

As a full-time Linux user who doesn’t use AMD, my question is:

Is the (sometimes rabid) promotion of AMD hardware in open-source communities harmful to newcomers? And are these issues impactful to less-savvy non-power users?

wendell · April 2, 2024, 9:34pm

On the whole? No. In that scale, it’s a bit of a tempest in a teacup.

So from my point of view, I’ve had a bit better luck with 7000 series GPUs… they are more tempermental and picky than their older 6000 series cousins. What’s frustrating is that they are so good, and so close to being perfect, that its just a couple of warts that will drive you insane. It’s like having a beautiful hardwood floor that occasionally gives you splinters.

That you also aren’t allowed to fix.

I, personally, have a better experience on linux with amd than with nvidia, even for the vfio use case. The exceptions to that have been vgpu unlock, when came at the end of the product cycle for those cards, and is not universal (ho hum), and the license bypass for older-ish tesla type cards. That’s kind of a lot of fun I suppose.

Arc gpus on linux have been awesome, and also, frustrating, in different ways. If anything Arc and their progress really underscores how nearly-at-parity AMD and nvidia are for a lot of things (and how far Arc has yet to go).

AMD’s hardware approach seems good, warts aside, and what they lack in optimization they make up for in brute force.

Arc has an absolutely top notch software team behind it, but maybe suffers from some weird executive dysfunction or hyperfocus on the wrong things? It’s so bizzare. Pontevecchio is legit bananas… but probably we didn’t see much more about it because it’s crazy expensive to build? Would be my guess? Meanwhile it looks like AMD is going to be able to do fast GPUs without gddr7 because those crazy clever mofos?

It is an interesting time in that the hardware is so complex that… in the past what would have been trivially reverse-engineered… is harder to get at. Hence all this angst. Nvidia, for their part, has been pretty explicit from day one that No, No Soup For You. And somehow that is a better outcome than AMD’s half-measure of… well we don’t want to help you but if you figure it out… okay then.

y’know? It’s weird.

GigaBusterEXE · April 2, 2024, 10:22pm

Doing My Part GIFs - Find & Share on GIPHY

aBav.Normie-Pleb · April 3, 2024, 7:49am

For me the first sign that there was something not quite right at AMD was the Ryzen 3000 launch in 2019:

With early AGESA versions the random numbers generator (RDRAND) was broken leading to for example ESXI and some Linux distributions (like Fedora) to always (!) crash during the boot process:

That means during their all of their testing and development phases no one at AMD even once tried to boot these operating systems before launching the parts.

Another story is that I’m still a bit salty about the “HDMI 2.1” false advertising for the Ryzen 4000G and 5000G APUs (you can’t go to UHD/60 Hz with more than 8 bit per channel colors without chroma subsampling). That made me have to get rid of already built 2D graphics/video systems hooked up to LG OLED TVs since they actually didn’t fulfill their purpose.

foppe · April 3, 2024, 9:27pm

Huh. https://twitter.com/amdradeon/status/1775261152987271614

As community interest grows in ROCm on Radeon, we’ve created a tracker to capture feedback and provide updates.

Coming soon: Open sourcing additional portions of our software stack and more hardware documentation.

This almost sounds hopeful…

gnif · April 4, 2024, 7:55am

Just to be clear, I am not angry with AMD, I fully understand that these are issues that were likely not even envisioned as such no standardised testing was devised to ensure these features existed, or worked as we need them.

It’s only because of ROCm are these issues becoming more important. Having one accelerator in a compute node need a reset for any reason requires restarting the entire node and halting all the other work it was doing.

If these problems are fixed for VFIO, and ROCm usage, it will be also fixed for your cheaper every day GPUs, improving the experience for everyone.

While it’s important that the GPUs never crash, it’s important to understand that they will, there will always be some edge case that was not thought of that can crash them out. Ie, what if a malicious actor rents a compute VPS, crashes out the GPU in the host in a way that allows a VM escape?

0xDE57 · April 4, 2024, 8:23am

sounds like the correct direction.

I hope amd invites geohot to the vanguard program too.

foppe · April 4, 2024, 4:50pm

I dunno but I never got the impression AMD did much with that, and it doesn’t even sound that relevant given that they’re only talking about drivers and not firmware (which they probably can’t directly publish for their partners anyway?). Anyway, I mostly hope that this leads to AMD kicking the radeon group until it ups its game when it comes to continuous testing its own code and hardware.

0xDE57 · April 5, 2024, 4:48am

Iron_Bound · April 5, 2024, 10:42am

I think the issue is management making time to fix issues, unless you feel the pain why would you allocate time vs feature/optimization tasks.

Overall I want AMD to do well and us consumers having compition in the market. Else we’ll see team greens next gen of gaming cards, will be more expensive, the same amount of vram and less feature i.e. no p2p transfer.

PaintChips · April 5, 2024, 1:51pm

There are tons of developers that know Intel CPU/IGP better as stuff is more documented, AMD’s GPU issues go back to ATI and many modern Radeon API calls date back to the early 2000s and some of those low level hardware API is still unknown. Radeon IGP used on APU layers API differently across desktop vs laptop chips, it may use the same “core” but voltage management is a mess–reason to the Ryzen 8000G series having driver hell.
Things aren’t that great on Team Green, lots of API is still heavily closed source.

From a driver angle both sides are equally bad in their unique way, it took Nvidia until the RTX era to offer “GeForce Studio” to be the “stable branch” and keep “GeForce” drivers as bleeding edge. AMD hasn’t done this, if you prefer “stable branch” its a forced Radeon Pro GPU.

system · January 4, 2025, 7:51am

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.