Trying to get ollama to work with intel arc

Hi all, I beg upon your collective wisdom to aid me. I have 2 Intel arc A770 cards with a threadripper 2920x in my machine and I want to run Ollama on it. I tried numerous premade images but none of them worked. I read some articles and looked at some code examples and decided to try and make my own docker that would have ollama built, include all the intel drivers and ipex. When I run the docker image and connect into it I can see both intel cards using lspci but get the error below. Does anyone have any ideas how I can fix my image or know a way to use an existing image?

ollama | terminate called after throwing an instance of ā€˜sycl::_V1::exception’
ollama | what(): No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html
ollama | SIGABRT: abort
ollama | PC=0x7efd1149eb2c m=0 sigcode=18446744073709551610
ollama | signal arrived during cgo execution
ollama |
ollama | goroutine 1 gp=0xc0000061c0 m=0 mp=0x559a645a1480 [syscall]:
ollama | runtime.cgocall(0x559a638587b0, 0xc000515960)
ollama | runtime/cgocall.go:167 +0x4b fp=0xc000515938 sp=0xc000515900 pc=0x559a62cb752b
ollama | ollama/llama/llamafile._Cfunc_llama_print_system_info()
ollama | _cgo_gotypes.go:838 +0x4c fp=0xc000515960 sp=0xc000515938 pc=0x559a6307a9cc
ollama | ollama/llama/llamafile.PrintSystemInfo()
ollama | ollama/llama/llamafile/llama.go:70 +0x79 fp=0xc0005159a8 sp=0xc000515960 pc=0x559a6307ba59
ollama | ollama/cmd.NewCLI()
ollama | ollama/cmd/cmd.go:1427 +0xc08 fp=0xc000515f30 sp=0xc0005159a8 pc=0x559a63850d68
ollama | main.main()
ollama | ollama/main.go:12 +0x13 fp=0xc000515f50 sp=0xc000515f30 pc=0x559a63857bf3
ollama | runtime.main()
ollama | runtime/proc.go:272 +0x29d fp=0xc000515fe0 sp=0xc000515f50 pc=0x559a62c88f3d
ollama | runtime.goexit({})
ollama | runtime/asm_amd64.s:1700 +0x1 fp=0xc000515fe8 sp=0xc000515fe0 pc=0x559a62cc6001

2 Likes

Take a look at this

1 Like

that looks like what I need, thanks. I will try it tomorrow

This solution worked like a charm for me: ipex-llm/docs/mddocs/DockerGuides/docker_cpp_xpu_quickstart.md at main Ā· intel/ipex-llm Ā· GitHub

Even in a multi-layer virtualized environment. Has Ollama and llama.cpp integrated. If you change the start-ollama.sh file in the container you can choose how many layers to offload to the GPU and you can add settings like specified in the Intel docs about IPEX.

1 Like

Worked like a charm. If anyone wants a docker compose instead of podman it’s here

services:
  intel-llm:
    image: docker.io/intelanalytics/ipex-llm-inference-cpp-xpu:latest
    container_name: intel-llm
    devices:
      - /dev/dri
    volumes:
      - models:/root/.ollama/models
    environment:
      - no_proxy=localhost,127.0.0.1
      - OLLAMA_HOST=0.0.0.0
      - DEVICE=Arc
      - OLLAMA_INTEL_GPU=true
      - HOSTNAME=intel-llm
      - OLLAMA_NUM_GPU=999
      - ZES_ENABLE_SYSMAN=1
    restart: unless-stopped
    command: sh -c 'mkdir -p /llm/ollama && cd /llm/ollama && init-ollama && exec ./ollama serve'

  openwebui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: openwebui
    volumes:
      - open-webui:/app/backend/data
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://intel-llm:11434
    restart: unless-stopped

volumes:
  models:
  open-webui:
4 Likes

And… if you use DEVICE=iGPU, then you can take advantage of the builtin GPU on Alderlake and newer CPU’s as well.

It’s not fast, but it works!

1 Like

@wardtj and @MetalizeYourBrain : Thanks! Regarding a Xe iGPU: apparently, it’s possible to manually allocate a bit more than the usual 50% of the total system RAM to the GPU. If that also works with this, it could be especially interesting for anyone with 64 or 96 GB RAM.
Also, if any of you (or anyone else) has tried to use a Xe iGPU, I’d be curious to know how that performance compares to running a smaller distill on the CPU cores? Thanks!

The container solution I shared you I think has this option set up.

I tried that and 6P Alder Lake cores are about the same speed as the Irix Xe 96EU GPU with Lama3.2 3B on Ollama. But this seems to be the case only for the first runs. After that, for some reason, the model starts to allucinate pretty badly. About 14/15 token/s.
llama.cpp is faster (18 token/s) than Ollama on CPU but suffers of the same issue for some reason.

Mixed CPU and GPU is slightly slower than just iGPU or just CPU.

Keep in mind I’m running everything in a Proxmox VM with virtualized iGPU so maybe I’m losing some performance there, but shouldn’t be much worse.

IBM’s Granite3.1 Dense 2B runs at 24 token/s on the iGPU which isn’t too bad. Granite3.1 MOE 1B throws an error I’ve not looked in yet, but I think it would’ve been a lot better.

I can do more testing if you have more questions about it.

1 Like

Did anyone run into this error?

./llm/scripts/start-llama-cpp.sh: line 8: 2279 Bus error (core dumped) ./llama-cli -m $model -n 32 --prompt ā€œWhat is AI?ā€ -t 8 -e -ngl 999 --color

If you use Docker compose, replace the entrypoint line, give this a try,

entrypoint: bash /models/startmeup.sh

This works consistently for me. The other settings, for some reason, do not work for me.

@MetalizeYourBrain : Thanks for your response, I really appreciate it! I was a bit surprised that the Xe iGPU was only about as fast as six Alder Lake P cores (I have one of those CPUs sitting on my desk). I also have a larger Xe-based GPU here (A770, :man_facepalming:t2: I know :stuck_out_tongue_winking_eye:), and hope to give that one a go when I have the chance. It does have a lot more Xe cores and 16 GB VRAM going for it, and Intel has openVino tools for it available.

1 Like

Maybe there’s something flawed in my testing. Seems weird to me too but that’s what I can test. My mistake was also not running some tests on just a Linux host, no virtualization.
If you have 6P cores could you do some tests to validate mine?

Have you taken a look at intel_gpu_top? When the LLM is running you can see similar to nvtop what the igpu is doing. Youll see it just use cpu if the mhz does not show. On Alderlake it should go to 1445Mhz. If it says 0 its just using CPU.

Also I recommend the modded i915 driver not Xe on the host. I use that which generates /drv/dri/card0 and card1 which you can see in intel_gpu_top.

I did not, I used htop to check if it’s using the CPU or not. When running all the layers on GPU only one core is loaded up. When some layers are offloaded to the CPU all the cores are used.

I’m using the i915 drivers on the host with the 7 split. I did not blacklist the Xe driver in GRUB so it’s throwing me a warning in DMESG but nothing that compromises the system (it’s something about the device ID not recognized).

Does it work with the i915 driver on the guest? Or should I use it on the host?
In Windows GPUZ shows 0MHz but it’s loading the GPU correctly.

You should be able to see the activity in the host and the container. If its showing 0 its likely doing cpu only. I fussed quite a bit with patches and kernel.params to get iGPU offload working on the host side so it would work in a container. Not sure on Windows, but if intel_gpu_top shows 0mhz but wattage being drawn its using CPU. When its running properly cery little cpu is cinsumed if all layers are sent to igpu.

I tried again and remembered that it’s throwing the PMU error and requires kernel 4.16 or above inside the VM. No segmentation fault at the end. I didn’t find a convincing solution for this issue so I gave up looking for an answer.
I’d rather not touch Proxmox to avoid issues, so I’m not going to install the Intel GPU Tools on Proxmox.

try using AI playground 2 by intel. It will work out of the box.