ROCm support for llama.cpp merged

Don’t want to hijack another thread so I’m creating this one.

It seems SlyEcho’s fork of llama.cpp is about to get merged into the main project. It has been approved by Ggerganov and others has been merged a minute ago!

I’ve been using his fork for a while along with some forks of koboldcpp that make use it it. Although it is stated that it is still flawed but even then better than what we had with OpenCL.

Llama.cpp has OpenCL support for matrix operations that work on AMD cards but it’s not as fast as CUDA. With this change AMD cards should be able to achieve competitive performance. It might be not bumping shoulders with Nvidia for now but hey a 7900 costs as half as a 4090 right?

You’ll probably need to set the CC and CXX variables to the LLVM compilers provided in the ROCm runtime and run make with LLAMA_HIPBLAS=1.

Something like this:

export CC=/opt/rocm/llvm/bin/clang
export CXX=/opt/rocm/llvm/bin/clang++
git clone https://github.com/ggerganov/llama.cpp
make LLAMA_HIPBLAS=1 -j

Merging to the main project will make it easier to use ROCm on other derivative projects like ooba’s webui or langchain which rely on the python lib. Also great timing since Zuck just released his code focused llama-2 model.

1 Like

Just got a 7900 and have been testing this and pytorch.

Thankfully they have containers with the platform already setup, as I’ve failed to build rocm5.7 so far.

1 Like

Cool. I’ve been sticking to 5.4 since its the version I’ve got less issues both with torch and llama. My GPU is RDNA2.

I’ve wrote a guide here for SD and Llama. It uses distrobox so it should be distro agnostic.

https://habla.news/a/naddr1qqxnzd3ex56rwvfexvurxwfjqgsfam9gjjew3qcwqhkgdax3r80yzx3d6w4uke2jtkmfcjr0ftl93qsrqsqqqa28vfwv5f

Funny while writing the guide I’ve noticed GNU make produces a binary that seg faults while the cmake version somehow works fine.


Was a kick to train a model on Shakespeare, I’m learning about training for now but will move to something bigger went I go pro.

I’m also glad llama.cpp has training, as I hit an issue when trying nanoGPT & pytorch.

1 Like

God damn I’m jelly. I make do with a 6800m with 12GB vram sometimes offloading to normal RAM which I have plently. Never attempted to train anything.

In any case things are improving very fast. We can now use CUDA-like features and speed on LLM and just today I’ve tried stable diffusion with fp16 support.

I can’t say I used that much ram to create the Shakespeare model, it’ll take more than 10min without hardware matrix acceleration but give it a go and see how it fares?

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.