[Discuss] ZLUDA and CUDA

Hello world,

So what is the actual deal with CUDA? How is it so important for so many workloads, but so insignificant that Intel and AMD are not interested in it?

How does ZLUDA play a major role and is it going to improve the overall platform performance on Linux?

What do you think about ZLUDA? Is it better to emulate than to offer competition?

Some notable links:

also we need an ‘intelgpu’ tag

In my mind I go, finally. I find this a big reason why so many people keep wasting money on expensive nVidia cards, and maybe if this gets a comeback, people could finally buy the competition and split the market more evenly.
I have not ever really used or more like never needed to use CUDA for anything. I hear that it is quite good for rendering shadows and light, LLM and ML. I haven’t really done much with any of those on a local level.
I’ve wanted to setup my own ML system up for many reasons but bound by poor financial situation, I haven’t been able to do that.
Could ZLUDA help AMD in LLM and ML workloads?

1 Like

AMD (and Intel) are doing-non-CUDA-CUDA (= ROCm, OpenCL, Vulkan compute) to be independent from Nvidia. They have to not care to not undermine their own offerings.

Since it is not officially an AMD-solution, they are probably skirting some legal BS. Having the option to (seamlessly) do one on the others is nice, I would prefer one OPEN INDEPENDENT standard.
image
XKCD 927

1 Like

Done

1 Like

thinking about trying it to how it does for F@H…

1 Like

One open standard is still one monopoly. I like on Linux that I can choose if I want sudo or doas, x11 or wayland, pipewire or pulse-audio. There should always be atleast 1 other option, that does thing a bit differently.

ZLUDA is a shim not an independent product. So it will always be playing catch up and dealing with inconsistencies introduced from nVidia (which of course would never introduce inconsistencies…).

  • Is it useless? No
  • Does it probably have a market? Yes
  • Does it give more leverage to nVidia? Yes
  • Does it provide a compelling place for Intel or AMD to invest? No

I’m sure Intel and AMD are more interested in spending their time improving performance with tensorflow and pyTorch.

1 Like

Because CUDA just works. On any system I tried, no matter how obscure, Windows, Linux, bare metal, virtualized, if only I had a working envy card, I could run CUDA on it. That isn’t my experience with any of the competing products.

ROCm on Windows? Forget it. On consumer GPUs? Uhh, if you’re lucky. On Instinct? How can I even make a case for buying an Instinct card if I can’t test the software stack on consumer cards first?


Besides, half of the ROCm stack looks as if AMD engineers just took NVIDIA code and ran sed through it, c.f. GitHub - ROCm/rccl: ROCm Communication Collectives Library (RCCL). (FFS, the code even has NVIDIA copyright disclaimers all over itself).

4 Likes

Parallel Computing !

Intel has been doing things to try to get into the CUDA space for a while, They’ve just failed at getting GPU’s where they would compete for compute. KnightsLanding was essentially a cash fire and now with Arc. . . Not getting enough traction. Too soon to tell.

AMD has made attempts, but AMD is a hardware company “First” that of recent history crawled out of bankruptcy (nearly), sold off it’s fabrications facilities and focused on working for it’s customers, (Sony Playstation more so than Xbox), Mining ( Geological not Crypto) and other more niche solutions.

So for 15yrs, Nvidia has run CUDA down the throats of Scientist, College students, Creatives ( Video, Audio, 3D modeling), and Finance Nerds which has fostered solutions that are very popular in the space.
PyTorch, SciPy Adobe suite, AutoCad have all benefited from being “Works with CUDA”, “Accelerated with CUDA”, while seeing competing platforms like OpenCL suffer from bad decisions, OpenCL 2.x, The Khronos group essentially scrapping that and returning to OPenCL 1.2 as a reflection point with OpenCL 3.0.

So it’s not that AMD and Intel are not interested in Parallel computing, They are in fact, i’s just that they are approaching it differently, and the Open standard that should have competed was a mess. AMD introduced HSA into their APU’s with Kaveri. it worked great, but no one used it. The problem was that AMD was in a bad place financially and not popular in any space. It was/is a great technology at a bad time for the company. BTW LibreOffice with Hardware Acceleration could fly through large spreadsheets because it could leverage the GPU to do math as well. Now with ROCm, AMD is really showing it can do these task. Time will tell.
.

This is not a Linux specific thing. Will it, of course it will, if it is used. OpenCL making a comeback has a far larger role to play, I look at Darktable, Inkscape, GIMP, Krita as the main users of what a great OpenCL implementation can provide on Linux . GIMP just spent a couple years, converting their NDE extensions, filters to work with OpenCL. Inkscape could leverage OpenCL better for LPE’s along with OpenGL for redraws and handling images with tons of paths. We all know Blender needs OpenCL/ROCm/HSA for cycles rendering .
.

.

Well, Getting PyTorch working on AMD will be great, it’ll get more people to buy AMD, but there still a lot of work to be done. Having teams working on code to get application specific optimizations is expensive. AMD is not “rolling in cash” although they are better off now than they were 10yrs ago, they are clearly far behind. Open standards and Open collaboration is how AMD can compete.
.

Weeeeeell. . . Not really seeing as the X11 team are the developers of Wayland, and X11 is not getting anything outside of security patches at this point. Pipewire not being a replacement of Pulse or jack but a middleman/broker for both.

.

You have some valid points. I still do not understand the business model Instinct cards? They should be selling to consumers, creatives would jump to them if they were available IMO.
ROCm on Linux is readily available now, but the caveat without a proper card, it will take some hoops to get going. I have been on other forums helping people recently get RoCm <5.x working on APU’s and GPUs. Any card pre gfx9Xx for AmD is a dice roll, ROCm 3.x works but you need to first find it on the web, and compile it yourself. . . Look at what this guy did with a APU

2 Likes

HIP is a switch between cuda & rocm.

we do still have to fix things but no programmer is going to move over…
if they have to rewrite all their code.

Should do what Xilinx did: Offer versions of the fanless cards except with a fan integrated for workstation use.

Offering compute at “not Ngreedia expensive” even if it meant either ZLUDA or re-learning how to write would surely quickly get marketshare.

Depends on budget. If budget for “new big cluster” is just w/e, then CUDA is the choice. When “big cluster for 80% of the cost” sounds right, then “stop crying and make it work” will be acceptable.

I have been trying to get pytorch with ROCm to compute at least a single convolution on gfx1010 (RX 5700XT) and so far nailed it down to a specific docker image with I think ROCm 5.3 and PyTorch ~1.12 that seems to work with the gfx override hack, that no longer works on current builds (I guess they started using features that are present on gfx1030 and not on 1010). I might try with slightly newer revisions, but every trial equals roughly 30GB download in docker images. FFS, compare that with cuda images.

Are you on Linux? what distro are you using?

Of course on Linux, ROCm only works on Linux (or at least only worked when I first done my research).

Tested Ubuntu 22.04 and Arch.

1 Like

FTR, I’m testing this very simple script:

torch-test.py
import os
import torch

assert torch.cuda.is_available()

print('HSA_OVERRIDE_GFX_VERSION =', os.environ.get('HSA_OVERRIDE_GFX_VERSION', None))

device = torch.device('cuda')

input_ = torch.rand(1024, device=device)
print("input:", input_.cpu())

model = torch.nn.Sequential(
    torch.nn.Linear(1024, 2048),
    torch.nn.ReLU(),
    torch.nn.Linear(2048, 16)
).to(device)
print("model-lin:", model)

print(model(input_).cpu())

model = torch.nn.Sequential(
    torch.nn.Unflatten(0, (4, 256)),
    torch.nn.Conv1d(4, 8, 17),
    torch.nn.MaxPool1d(8),
    torch.nn.ReLU(),
    torch.nn.Flatten(0, -1),
    torch.nn.LazyLinear(16)
).to(device)
print("model-conv:", model)

print(model(input_).cpu())

The run command is:
docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G -v ~/code/torch-test.py:/torch-test.py --env HSA_OVERRIDE_GFX_VERSION=10.3.0 rocm/pytorch:$TAG python3 /torch-test.py

And the recent results (remember, 30G download for each image!) are as follows:

rocm/pytorch:latest - crash & kernel panic
rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_1.12.1 - HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION
rocm/pytorch:rocm5.6_ubuntu20.04_py3.8_pytorch_1.12.1 - hangs
rocm/pytorch:rocm5.5_ubuntu20.04_py3.8_pytorch_2.0.0_preview - hangs
rocm/pytorch:rocm5.5_ubuntu20.04_py3.8_pytorch_1.12.1 - works
rocm/pytorch:rocm5.4.2_ubuntu20.04_py3.8_pytorch_2.0.0_preview - hangs
rocm/pytorch:rocm5.4.1_ubuntu20.04_py3.7_pytorch_1.12.1 - works
rocm/pytorch:rocm5.3.2_ubuntu20.04_py3.7_pytorch_1.12.1 - works

works - the script exits with “a” tensor (haven’t even checked for correctness)
hangs - the script doesn’t exit under 1 minute, cpu at 100%
If it hangs, it always stops at executing the Conv1d layer.

$ uname -a
Linux [hostname] 6.4.3-060403-generic #202307110536 SMP PREEMPT_DYNAMIC Tue Jul 11 05:43:58 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ apt info amdgpu-dkms
Package: amdgpu-dkms
Version: 1:6.3.6.60000-1697589.22.04
Priority: optional
Section: misc
Maintainer: Advanced Micro Devices (AMD) <[email protected]>
Installed-Size: 469 MB
Provides: rock-dkms
Depends: dkms (>= 1.95), libc-dev | libc6-dev, autoconf, automake, initramfs-tools, shim-signed, amdgpu-dkms-firmware (= 1:6.3.6.60000-1697589.22.04)
Conflicts: rock-dkms (<< 1:6.3.6.60000-1697589.22.04)
Breaks: rock-dkms (<< 1:6.3.6.60000-1697589.22.04)
Replaces: rock-dkms (<< 1:6.3.6.60000-1697589.22.04)
Download-Size: 10.7 MB
APT-Manual-Installed: yes
APT-Sources: https://repo.radeon.com/amdgpu/latest/ubuntu jammy/main amd64 Packages
Description: amdgpu driver in DKMS format.
$ apt info rocm
Package: rocm
Version: 6.0.0.60000-91~22.04
Priority: optional
Section: devel
Maintainer: ROCm dev support <[email protected]>
Installed-Size: 13.3 kB
Depends: rocm-utils (= 6.0.0.60000-91~22.04), rocm-developer-tools (= 6.0.0.60000-91~22.04), rocm-openmp-sdk (= 6.0.0.60000-91~22.04), rocm-opencl-sdk (= 6.0.0.60000-91~22.04), rocm-ml-sdk (= 6.0.0.60000-91~22.04), mivisionx (= 2.5.0.60000-91~22.04), migraphx (= 2.8.0.60000-91~22.04), rpp (= 1.4.0.60000-91~22.04), rocm-core (= 6.0.0.60000-91~22.04), migraphx-dev (= 2.8.0.60000-91~22.04)
Homepage: https://github.com/RadeonOpenCompute/ROCm
Download-Size: 858 B
APT-Manual-Installed: yes
APT-Sources: https://repo.radeon.com/rocm/apt/debian jammy/main amd64 Packages
Description: Radeon Open Compute (ROCm) software stack meta package

Host ROCm stack version shouldn’t matter with dockerized pytorch though as it only relies on the driver.

Something seems off. . . Gimme a bit to look this over. :thinking:

Keep in mind that gfx1010 was never officially supported and was always kept in the gray area:

but neither were the gfx8Xx and gfx9Xx (Fully, partially etc) yet there have been ways to get them to work. Even the APU’s (gfx9Xx +) are working. There needs to be some tweaking of course. If you are able to, building PyTorch from source with all the necessary flags, ROCm info might be an option.

:joy: I’ve even tried building the whole ROCm stack but that is also not officially supported and, let’s just say, there were many roadblocks.


… okay, not only not supported, but also very undocumented.

IIRC torch-only rebuild wasn’t worth it, because the missing parts were mostly in rocBLAS/Tensile and MIOpen.

gfx9xx were more supported because MIs are GCN/CDNA, i.e. gfx906/gfx908/gfx90a, c.f. GPU Support and OS Compatibility (Linux) — ROCm 5.6.0 Documentation Home.
No RDNA (1) card was ever on the support list, because CDNA and RDNA diverged pretty heavily.
RDNA2 cards (RX6xxx) were unofficially supported because we got RDNA2-based PRO cards (V620 and W6800).

1 Like

AMD could sell a ton just for content creators “render boxes”, instead just easier to build a GeForce/Quadro render box for a specific task.
Doubtful ZLUDA can hold up the shim translation layer, next version(s) of CUDA could exclusively rely upon Tensor cores and AMD doesn’t have that on consumer GPUs(yet).