Recently i stumbled on a youtube video about running a LLAMA model on a raspberry pi with LLaMA.cpp.
While it runs find on the CPU, I was interested in running it on the GPU. However since the V3D does not support OpenCL and is not well documented i was having trouble finding any established projects around it.
So i decided to learn how to program the V3D myself and then try to write the the “Framework” to run the LLAMA (or what ever other LLM) on the V3D.
By searching online, reading the DRM drivers for the VC4 and the V3D and watching a series about RPI Bare Metal Programming, i manage to have some understanding on how the V3D works.
My Goal for now is to write a program that:
Creates a Buffer on the GPU (buffer a)
Filles buffer “a” with values
Creates a second Buffer on the GPU (buffer b)
Tells the V3D to copy the values for Buffer “a” into buffer “b”
Print both buffers to the screen
But i was having trouble understanding how to create the “DRM Buffer Objects”, mapping those “BO” to Userland with mmap and do IO on those “BO”. The same also for the Buffer that would hold the “QPU Machine Code” to execute on the GPU.
So i was hoping to find someone here that is able to help guide me and maybe be interested in working on the original idea of running an LLM on the V3D.
I don’t think that running a LLM on the V3D is going to improve token generation times, but i also don’t have much knowledge in this context. However i believe that it is a great opportunity to learn about how hardware components work with each other ? how to interface with Linux DRM Subsystem? and low level GPGPU in general.
Well technically the GPU inside the BCM2711 (the SoC on the RPI4) is called Videocore VI or Videocore6. However it’s mostly referenced to by the term “V3D 4.2”, even the Linux drivers for it is called “V3D” (in the kernel source under: drivers/gpu/drm/v3d). so i don’t know exactly what it wants to be called
thanks to your suggestion i checked the llama.cpp again and noticed that they got a vulkan backend to do the compute. So i am currently trying to see if i can get it to work with vulkan on the RPI4 since the v3d does support vulkan 1.1.
The poor thing have been compiling the Vulkan-SDK for hours now.
Once the vulkan-sdk is built i can build the llama.cpp with vulkan backend and hope it works!
However i was really hoping to gain the knowledge on how to write programs that do compute on the GPU at a low level. But i guess it’s best if i learn CPU assembly first before trying some edge case GPU
Never the less, thanks for your help and i will let you know if i had any success.
After several attempts i still get several vulkan errors.
The errors seem to be related to memory and memory allocation. I tested it with a raspberry PI 4 (4GB and 8GB) versions but still got the same error. Also on my 4GB raspberry pi i did expand the swap file to 10GBs just to be sure. I have no idea what is the problem here, my best guess is that the LLM i am using is too large (2.9GB) to fit into some “Queue Buffer thingy” within the GPU. Also i don’t know of any smaller LLMs i could test.
I am not going to open an Issue with the maintainers of the llama.cpp Repo, since the raspberry pi is such an edge case hardware… also i don’t want to get banned from the Internet
if anyone knows what is causing the issue and how to work around it, please share with us.
But for now, i guess there is no point in sinking more hours into this topic.
Next block details the invocation details and result: