MI100 pls help

Greetings all,

I recently got a heavy discount on a MI100 that was being excised. I am trying and currently failing to get it to be able to be used as… well anything currently.


Before I get into the annoying sob and beg for assistance.

I modeled up and printed a cooling shroud for the unit that uses dual 40mm server fans.
Also uploaded to Printables.com
shroudV4.zip (6.6 MB)


Anyone can use them if they would like.

Now, down to business. Currently I am trying to use this card without vflashing it like the thread about the MI25.
I am running into many troubles.

  • amdgpu probe fails to init card with a -12 fatal error
  • Passed to a VM it still init fails
  • ROCm installs but does not detect any GPU let alone any render card
  • A1111 (and other pytorch builds) sees that there is an amd thing there and installs/builds for ROCm
  • many other misc items (man I need to change to proxmox or truenas-scale)

Here are some interesting snapshots of errors I am collecting. I can make slogs available to those that would like them.


Unraid (base system) just errors out so I changed to passing it to an Ubuntu VM instead.
I can see something though
chrome_2023-08-02_09-03-14

here is where I am at in the VM




I have tried the last two ROCm builds and neither work. I also thought maybe the kernel was giving me guff so I tried many other kernels (5.2.4, 6.2.9, 6.4.6) with no luck either.

SO… maybe someone can help me get this working? I am open to trying most anything though I am not the codemonkey I once was.

2 Likes

The reason for your discount is probably why you’re struggling with the card right now.

Assuming you have this card on a test-bed system, try plain Debian or SysrescueCD (used to be Gentoo based, now Arch IIRC) to see if any unRAID or Ubuntu special sauce fscks you up here. I’m not a coder and I don’t have any of the enterprise grade GPU cards anyway, so I’m afraid that’s all the help I can give you. :heart_hands:

Okay,

Been a few LONG days. But progress!

After moving the card to a benchtop tester system, that matches the server hardware, I tried Ubuntu 22.04 LTS (same behavior), Debian 12.1.0 (same behavior), SLE15-sp4 (same behavior), Ubuntu 20.04 LTS (same behavior).

At this point I was not feeling well about the card. I thought that maybe a portion of the card is just bad and I’ll have to try to get it replaced.

Then, something I randomly read a few days before teased my brain. TL;DR ‘on AMD IOMMU+SRI-OV have some special issues’
This is the advisory IOMMU and AMD Instinct

So I started poking in the BIOS (should probably have done this days ago).

Hidden away in the CSM menu I found something I had set a long time ago to get my SBA going as a boot device.

The original GPU (Tesla P100) was booting in legacy mode same as the SBA. It did not seem to care about that and I was even able to use the card in the base system, dockers, and VMs without binding it to the VFIO. Neat stuff.

Sadly the Instinct card just wont do that it seems. So I tested that theory. I set the card to UEFI mode.

Reboot… and well… no more errors!

Well this got me a bit excited. “The card was not bad” I said.

Sure enough in 20.04 I could get ROCm working for the first time! So I thought I’d just install A1111 and it would just work. right?

Kinda? While it certainly installed with ROCm it just crashed constantly when trying to get any output. More on that later.

I thought, at this point, it would be worth at this point moving the card back to the server and fix my mistake in the CSM. Then try a few thing with dockers and VMs that I knew worked before the swap.

Boom, zero errors and woah. Lots of data that the card was alive now.

So I tested my Jellyfin docker with an encode test. After a few minor docker-compose changes… Boom worked and I could get radeontop to monitor the fact the card was transcoding.

Okay… the card is working. Now to test via VMs and try to get a docker setup for A1111. But first… sleep… zzzzzzzzzzzz

2 Likes

You need to disable CSM completely
Enable above 4g decoding
Might be named mimo

1 Like

So interesting info overall.

Thanks for the info GigaBusterEXE!

I tried to completely disable CSM for everything. But I was not able to boot back into Unraid as that was setup with a non UEFI boot. Thus that was not an option for me sadly.

Good news though I don’t need to have CSM disabled completely. I just need to disable legacy booting for the GPU.

I am still testing VMs with passthrough.

Docker images however I have tested in many different ways and it seems to work well tm.

The interesting thing I found is:

  • Use the ROCm Docker that AMD provides (rocm/rocm-terminal:latest)
  • if using A1111 (either HID or ML) make sure to use the Navi 3 settings as Navi 1 will not work!
  • the ML version of A1111 (here) is at near performance parity with the standard A1111 repository

I am working on getting SDXL training to work but I am not there yet.
I am also trying out a more efficiently coded SD renderer (VoltaML) that appears promising.

Here are some images from the HID and ML A111 rendering I did earlier this week.




Here are some outputs from the testing





I also tested upscaling and for the AMD cards do way better at upscaling defect reduction vs Nvidia.

If anyone has gotten ROCm training working please let me know so I can play with that.

Back to the saltmines for me. Time to test the VMs more and get openCUDAML working for some interesting projects.

Feel free to ask me any questions as this is not really a guide and I went through many other issues not outlined here trying to get NAVI 1 code working.

I forgot to upload my start script as I could not use the main one that is often used for A1111

Here it is:

sudo TORCH_COMMAND=“pip install --pre torch==2.1.0.dev-20230614+rocm5.5 torchvision==0.16.0.dev-20230614+rocm5.5 --index-url https://download.pytorch.org/whl/nightly/rocm5.5” python3 launch.py --precision full --no-half --listen

I just tossed it into a .sh file, made it executable and off to the races.

2 Likes