(Need Help) Running LLAMA on V3D

A.Alta · February 9, 2024, 5:18pm

Hello L1T Community!

Recently i stumbled on a youtube video about running a LLAMA model on a raspberry pi with LLaMA.cpp.
While it runs find on the CPU, I was interested in running it on the GPU. However since the V3D does not support OpenCL and is not well documented i was having trouble finding any established projects around it.
So i decided to learn how to program the V3D myself and then try to write the the “Framework” to run the LLAMA (or what ever other LLM) on the V3D.
By searching online, reading the DRM drivers for the VC4 and the V3D and watching a series about RPI Bare Metal Programming, i manage to have some understanding on how the V3D works.

My Goal for now is to write a program that:

Creates a Buffer on the GPU (buffer a)
Filles buffer “a” with values
Creates a second Buffer on the GPU (buffer b)
Tells the V3D to copy the values for Buffer “a” into buffer “b”
Print both buffers to the screen

But i was having trouble understanding how to create the “DRM Buffer Objects”, mapping those “BO” to Userland with mmap and do IO on those “BO”. The same also for the Buffer that would hold the “QPU Machine Code” to execute on the GPU.

So i was hoping to find someone here that is able to help guide me and maybe be interested in working on the original idea of running an LLM on the V3D.

I don’t think that running a LLM on the V3D is going to improve token generation times, but i also don’t have much knowledge in this context. However i believe that it is a great opportunity to learn about how hardware components work with each other ? how to interface with Linux DRM Subsystem? and low level GPGPU in general.

Thanks for reading this Far!

progressEdd · February 11, 2024, 7:12am

I’ve never worked with or heard of V3D before. Is it a gpu driver for a raspberry pi?

Also did you see that llama.CPP merged GPU support

Haven’t deployed it before, but these might be some good resources for working with llama

github.com

garyexplains/examples/blob/master/how-to-run-llama-cpp-on-raspberry-pi.md

# How to Run a ChatGPT-like AI Bot on a Raspberry Pi
Here is my step-by-step guide to running Large Language Models (LLMs) using llama.cpp on a Raspberry Pi. These instructions accompany my video [How to Run a ChatGPT-like AI on Your Raspberry Pi](https://youtu.be/idZctq7WIq4).

## Links

- [Introducing Code Llama, a state-of-the-art large language model for coding](https://ai.meta.com/blog/code-llama-large-language-model-coding/)
- [llama.cpp](https://github.com/ggerganov/llama.cpp)
- [Llama-2-7b-Chat-GGUF](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF) - ([Q4_K_S](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_S.gguf)) - ([Q2_K](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q2_K.gguf))
- [Llama 2 13B Chat](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF) - ([Q4_K_S](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_K_S.gguf))
- [CodeLlama-7B-Instruct-GGUF](https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGUF) - ([Q4_K_S](https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGUF/resolve/main/codellama-7b-instruct.Q4_K_S.gguf))
- ~~[CodeLlama 7B Instruct - GGML](https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGML/tree/main) - ([Q4_K_S](https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGML/blob/main/codellama-7b-instruct.ggmlv3.Q4_K_S.bin))~~

## Prepare development environment
To start you need to have the C/C++ compiler, and tools like make and git.
```
sudo apt update
sudo apt install git g++ wget build-essential
```

## Download and compile llama.cpp

This file has been truncated. show original

A.Alta · February 11, 2024, 7:45pm

Thnaks @progressEdd for your reply!

Well technically the GPU inside the BCM2711 (the SoC on the RPI4) is called Videocore VI or Videocore6. However it’s mostly referenced to by the term “V3D 4.2”, even the Linux drivers for it is called “V3D” (in the kernel source under: drivers/gpu/drm/v3d). so i don’t know exactly what it wants to be called

thanks to your suggestion i checked the llama.cpp again and noticed that they got a vulkan backend to do the compute. So i am currently trying to see if i can get it to work with vulkan on the RPI4 since the v3d does support vulkan 1.1.
The poor thing have been compiling the Vulkan-SDK for hours now.
Once the vulkan-sdk is built i can build the llama.cpp with vulkan backend and hope it works!

However i was really hoping to gain the knowledge on how to write programs that do compute on the GPU at a low level. But i guess it’s best if i learn CPU assembly first before trying some edge case GPU

Never the less, thanks for your help and i will let you know if i had any success.

A.Alta · February 16, 2024, 6:25am

So, here is a little update.

After several attempts i still get several vulkan errors.
The errors seem to be related to memory and memory allocation. I tested it with a raspberry PI 4 (4GB and 8GB) versions but still got the same error. Also on my 4GB raspberry pi i did expand the swap file to 10GBs just to be sure. I have no idea what is the problem here, my best guess is that the LLM i am using is too large (2.9GB) to fit into some “Queue Buffer thingy” within the GPU. Also i don’t know of any smaller LLMs i could test.
I am not going to open an Issue with the maintainers of the llama.cpp Repo, since the raspberry pi is such an edge case hardware… also i don’t want to get banned from the Internet

if anyone knows what is causing the issue and how to work around it, please share with us.
But for now, i guess there is no point in sinking more hours into this topic.

Next block details the invocation details and result:

Blockquote
pi@rpi:~/Repos/llama.cpp $ ./build_vulkan/bin/main -m ./models/llama-2-7b-chat.Q2_K.gguf -p “what is the answer to life, the universe and everything?” -s 1708062597
Log start
main: build = 2150 (594fca3f)
main: built with cc (Debian 12.2.0-14) 12.2.0 for aarch64-linux-gnu
main: seed = 1708062597
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: V3D 4.2.14 | uma: 1 | fp16: 0 | warp size: 16
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/llama-2-7b-chat.Q2_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 10
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = [“”, “~~”, “~~”, “<0x00>”, "<…
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000…
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, …
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q2_K: 65 tensors
llama_model_loader: - type q3_K: 160 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q2_K - Medium
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 2.63 GiB (3.35 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ‘’
llm_load_print_meta: EOS token = 2 ‘’
llm_load_print_meta: UNK token = 0 ‘’
llm_load_print_meta: LF token = 13 ‘<0x0A>’
llm_load_tensors: ggml ctx size = 0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 2694.32 MiB
…
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_vulkan: Failed to allocate pinned memory.
ggml_vulkan: No suitable memory type found: ErrorOutOfDeviceMemory
llama_kv_cache_init: CPU KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
ggml_vulkan: Failed to allocate pinned memory.
ggml_vulkan: No suitable memory type found: ErrorOutOfDeviceMemory
llama_new_context_with_model: CPU input buffer size = 10.01 MiB
ggml_vulkan: Failed to allocate pinned memory.
ggml_vulkan: No suitable memory type found: ErrorOutOfDeviceMemory
llama_new_context_with_model: Vulkan_Host compute buffer size = 70.50 MiB
llama_new_context_with_model: graph splits (measure): 1
ggml_vulkan: Memory allocation of size 262176768 failed.
ggml_vulkan: No suitable memory type found: ErrorOutOfDeviceMemory
terminate called after throwing an instance of ‘vk::SystemError’
what(): No suitable memory type found: ErrorOutOfDeviceMemory
Aborted
pi@rpi:~/Repos/llama.cpp $