Hi all, I beg upon your collective wisdom to aid me. I have 2 Intel arc A770 cards with a threadripper 2920x in my machine and I want to run Ollama on it. I tried numerous premade images but none of them worked. I read some articles and looked at some code examples and decided to try and make my own docker that would have ollama built, include all the intel drivers and ipex. When I run the docker image and connect into it I can see both intel cards using lspci but get the error below. Does anyone have any ideas how I can fix my image or know a way to use an existing image?
Even in a multi-layer virtualized environment. Has Ollama and llama.cpp integrated. If you change the start-ollama.sh file in the container you can choose how many layers to offload to the GPU and you can add settings like specified in the Intel docs about IPEX.
@wardtj and @MetalizeYourBrain : Thanks! Regarding a Xe iGPU: apparently, itās possible to manually allocate a bit more than the usual 50% of the total system RAM to the GPU. If that also works with this, it could be especially interesting for anyone with 64 or 96 GB RAM.
Also, if any of you (or anyone else) has tried to use a Xe iGPU, Iād be curious to know how that performance compares to running a smaller distill on the CPU cores? Thanks!
The container solution I shared you I think has this option set up.
I tried that and 6P Alder Lake cores are about the same speed as the Irix Xe 96EU GPU with Lama3.2 3B on Ollama. But this seems to be the case only for the first runs. After that, for some reason, the model starts to allucinate pretty badly. About 14/15 token/s.
llama.cpp is faster (18 token/s) than Ollama on CPU but suffers of the same issue for some reason.
Mixed CPU and GPU is slightly slower than just iGPU or just CPU.
Keep in mind Iām running everything in a Proxmox VM with virtualized iGPU so maybe Iām losing some performance there, but shouldnāt be much worse.
IBMās Granite3.1 Dense 2B runs at 24 token/s on the iGPU which isnāt too bad. Granite3.1 MOE 1B throws an error Iāve not looked in yet, but I think it wouldāve been a lot better.
I can do more testing if you have more questions about it.
@MetalizeYourBrain : Thanks for your response, I really appreciate it! I was a bit surprised that the Xe iGPU was only about as fast as six Alder Lake P cores (I have one of those CPUs sitting on my desk). I also have a larger Xe-based GPU here (A770, I know ), and hope to give that one a go when I have the chance. It does have a lot more Xe cores and 16 GB VRAM going for it, and Intel has openVino tools for it available.
Maybe thereās something flawed in my testing. Seems weird to me too but thatās what I can test. My mistake was also not running some tests on just a Linux host, no virtualization.
If you have 6P cores could you do some tests to validate mine?
Have you taken a look at intel_gpu_top? When the LLM is running you can see similar to nvtop what the igpu is doing. Youll see it just use cpu if the mhz does not show. On Alderlake it should go to 1445Mhz. If it says 0 its just using CPU.
Also I recommend the modded i915 driver not Xe on the host. I use that which generates /drv/dri/card0 and card1 which you can see in intel_gpu_top.
I did not, I used htop to check if itās using the CPU or not. When running all the layers on GPU only one core is loaded up. When some layers are offloaded to the CPU all the cores are used.
Iām using the i915 drivers on the host with the 7 split. I did not blacklist the Xe driver in GRUB so itās throwing me a warning in DMESG but nothing that compromises the system (itās something about the device ID not recognized).
Does it work with the i915 driver on the guest? Or should I use it on the host?
In Windows GPUZ shows 0MHz but itās loading the GPU correctly.
You should be able to see the activity in the host and the container. If its showing 0 its likely doing cpu only. I fussed quite a bit with patches and kernel.params to get iGPU offload working on the host side so it would work in a container. Not sure on Windows, but if intel_gpu_top shows 0mhz but wattage being drawn its using CPU. When its running properly cery little cpu is cinsumed if all layers are sent to igpu.
I tried again and remembered that itās throwing the PMU error and requires kernel 4.16 or above inside the VM. No segmentation fault at the end. I didnāt find a convincing solution for this issue so I gave up looking for an answer.
Iād rather not touch Proxmox to avoid issues, so Iām not going to install the Intel GPU Tools on Proxmox.