Progress! Got triton flash attention to compile for the older Mi50 cards, and I was able to see tensor parallelism working briefly with 2 cards. I’m now running up against the limitations of my testbench hardware–not enough cooling for these passive cards and not enough power from my test PSU. So now I need to move everything into my server chassis and (hopefully) see what 4 of them can do with adequate power and ventilation.
Edited to add: Just in case anyone wants to follow what I did to get the Mi50s working with vLLM: I installed the ROCm drivers as described on AMD’s site. After a reboot and checking that Linux sees the card(s), I moved on to the vLLM install. I built everything from source based on these instructions. A couple notes:
- The instructions are not explicit, but you do need to clone the vLLM repository (duh, but I didn’t on my first attempt).
- The pytorch url for ROCm 6.3 in Step 0 is incorrect. The actual URL is
https://download.pytorch.org/whl/nightly/rocm6.3
- I used the Triton repository here:
https://github.com/triton-lang/triton.git
- After cloning the Triton repository, you need to modify two files in the source as described here. These changes add support for the Mi50 and Mi60 (
gfx906
). Then compile Triton as described in the vLLM instructions. I did not use CK flash attention. - vLLM should now compile as described in the instructions, and you do not need to disable flash attention before running vLLM.