Mi25, Stable Diffusion's $100 hidden beast

i see, well hopefully shared memory support will get added eventually, HIP/cuda does support it , but there is no support with automatic111/torch. if you haven’t try yet, you can use the --medvram and --lowvram, both have a performance hit, but will reduce memory usage.

also, I Discovered that you can use the HIP allocator instead of the torch one, saves a few hundred mb , no or negligent performance penalty.

can be enabled by using

export PYTORCH_CUDA_ALLOC_CONF=“backend:cudaMallocAsync”

allows me to do a batches of, 14, 512x512 images instead of 13

I did this method and it worked, thanks! Now just need to get a fan shroud adapter setup to get this into full time use

anyone have following VBIOS? i think it will be ideal for those MI25 crossflashes

techpowerup .com/gpu-specs/radeon-pro-ssg.c2998

has anyone experienced “Memory access fault by GPU node-1 (Agent handle: 0x8b3e3c0) on address (nil). Reason: Page not present or supervisor privilege.
Aborted (core dumped)” with vladmandic’s fork, also with base AUTOMATIC1111 it seems to just halt/crash as well, though without an error (doesn’t close just hangs)

Running NixOS unstable but I don’t think that should majorly affect it, I also have another gpu for my video output… I’ve tried flipping them, to no avail.

Edit: A1111 complains about not finding limits, then later crashes…

As an artist who has been using Stable Diffusion since its inception, I recently spent some time using a Mi25 and faced a few challenges.

One significant hiccup with Torch + ROCm 5.2 is its limitation on VRAM usage to just 10GB. This is a problem, especially considering that larger batches offer speed benefits.

This issue seems to be addressed with ROCm 5.5. For these GPUs, you can obtain a pre-compiled version of torch & torchvision by pulling the Docker container using docker pull rocm/pytorch. It’s important to ensure that you have the ROCm 5.5 driver and the appropriate kernel before you begin.

I’ve found Doggettx cross attention optimizations the best with this setup in terms of VRAM usage & speed.

In terms of performance, the Mi25 delivers approximately 40% of a 3060’s performance (with WX9100 bios). Despite being good for its price, it has high power consumption, and I’ve struggled to get Torch+ROCm to work with motherboard slots other than the main x16 CPU lanes. Seems I am not alone. If this were resolved, I’d certainly use one of these in the third slot of my system. On particular upscalers like GANs and LSDR, it competes with the speed of a 3060, suggesting that Stable Diffusion possibly relies more heavily on tensor cores as opposed to being more FP16-focused like some upscalers. I think this because the newer Radeon cards with tensor cores are showing much better results.

For those more serious about Stable Diffusion but not wanting to spend 7900XT money, the 3060 is a strong recommendation. If you’re considering used options, the 2080ti has serious FP16 performance. In my personal testing, the 2080ti was just 12% slower than a 3080 at FP16 models, while consuming significantly less power. The 2060 12GB is a good pick also.