DeepSeek Deep Dive R1 at Home!

adman-c · March 4, 2025, 3:51am

Progress! Got triton flash attention to compile for the older Mi50 cards, and I was able to see tensor parallelism working briefly with 2 cards. I’m now running up against the limitations of my testbench hardware–not enough cooling for these passive cards and not enough power from my test PSU. So now I need to move everything into my server chassis and (hopefully) see what 4 of them can do with adequate power and ventilation.

Edited to add: Just in case anyone wants to follow what I did to get the Mi50s working with vLLM: I installed the ROCm drivers as described on AMD’s site. After a reboot and checking that Linux sees the card(s), I moved on to the vLLM install. I built everything from source based on these instructions. A couple notes:

The instructions are not explicit, but you do need to clone the vLLM repository (duh, but I didn’t on my first attempt).
The pytorch url for ROCm 6.3 in Step 0 is incorrect. The actual URL is https://download.pytorch.org/whl/nightly/rocm6.3
I used the Triton repository here: https://github.com/triton-lang/triton.git
After cloning the Triton repository, you need to modify two files in the source as described here. These changes add support for the Mi50 and Mi60 (gfx906). Then compile Triton as described in the vLLM instructions. I did not use CK flash attention.
vLLM should now compile as described in the instructions, and you do not need to disable flash attention before running vLLM.