SFF Ada RTX4000 - Docker AI Plex Stable Diffusion Automatic1111 Quick Start

wendell · March 31, 2024, 4:52pm

Background

The best things about this card: It’s TINY. It’s shockingly low power. It has 20gbs of vram. It has 4 DP outputs; quadro features like framesync. It does NOT require any pcie power connector.

It has 2 nvenc and nvdec encoder/decoers, and they support AV1.

There is one con for this card, cost, and sometimes double-slot half-height slots aren’t a thing. “I’ll add this to my nas!” – 2 problems with that idea. Most NAS are designed for up to 25w pcie cards, not 75w, and most NAS do not physically have room for a double-slot card, even though it is tiny.

Our Platform

Part of the theme of this build is low power usage – both idle and peak. This is an AMD 7900 12-core CPU. It pairs well

The ASRock motherboard does support ECC DDR5 udimm (not rdimm of course) and I can get kernel messages about single bits having been corrected with some cajoling on newer versions of the kernel… however support for this should be regarded as iffy on Ubuntu 22.04 LTS.

Install Ubuntu 22.04 LTS

Even though I’m having better luck on Arch linux these days, most of the tutorials and other writeups one can still find with Google searches are centered around Ubuntu. As cohesive as nvidia’s ecosystem is there can be unexpected and deep pitfalls mixing versions.

The general strategy I am using to keep one away from these is a combination of docker, for containers, and container orchestration, and Ubuntu 22.04 LTS. If you know what you’re doing, feel free to translate these instructions into Arch Linux.

Why docker? Generally I have an easier time lifting-and-shifting what I have done to new software versions (which can often improve performance – I have personally experienced an over 100% performance uplift from cuda ~7 to cuda ~12 on certain projects.)

Installing Ubuntu , it’ll give you the option to install the nvidia driver (if the gpu is installed) and docker. Select to install the nvidia driver but, paradoxically, don’t select docker. The installer installing docker didn’t install docker-ce ;

*There is some disturbance in the state of things RE docker of late; podman, from redhat, has been heralded as a more open alternative with the Docker folks backpedaling on some enshittification as a result of podman stepping up. There are more open variants of docker you get by default in Ubuntu, but
for things of this scale you’re probably fine to run Docker Proper *

Once installed, reboot and install all updates

apt update && apt upgrade -y

What else does docker do for us? Well with this containerization system we can easily deploy Plex media server, or Jellyfin, and have access to both our iGPU’s (AMD)'s ecoders, as well as nvenc/nvdec.

Installing the Docker Ecosystem

This guide is accurate, and there aren’t too many landmines. One thing though… the user you created for yourself in the installer, once you’ve got docker up and running do a command:

usermod -a -G docker your_username

to add your regular user to the docker group.

Installing the nvidia Ecosystem

The guide here is pretty good:

You might recognize that sudo apt-get install -y nvidia-driver-550-open was already run by the Ubuntu installer.

The second part is enabling functionality in docker, documented here:

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Be sure not to miss that step.

Next steps

Reboot. Seriously.

sudo apt install nvtop

run nvidia-smi

w@powersippy:~$ nvidia-smi
Sun Mar 31 16:20:13 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 4000 SFF Ada ...    Off |   00000000:01:00.0 Off |                  Off |
| 30%   39C    P8             12W /   70W |     332MiB /  20475MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2523      C   python                                        326MiB |
+-----------------------------------------------------------------------------------------+

If this is the output you see, you’re good to go for the next step.

Install Ollama

Ollama doesn’t really depend on docker, kinda-sorta-mostly. The gui does, but not the service.

" Get up and running with LLMs, Locally " – and it does what it says.

The inst ructions are pretty clear:

github.com

ollama/ollama/blob/main/docs/linux.md

# Ollama on Linux

## Install

Install Ollama running this one-liner:

>

```bash
curl -fsSL https://ollama.com/install.sh | sh
```

## AMD Radeon GPU support

While AMD has contributed the `amdgpu` driver upstream to the official linux
kernel source, the version is older and may not support all ROCm features. We
recommend you install the latest driver from
https://www.amd.com/en/support/linux-drivers for best support of your Radeon
GPU.

This file has been truncated. show original

By default Ollama only listens on localhost; let’s expose it to your local network (DANGER, if you don’t understand the implications and/or if your LAN is not safe.)

edit /etc/systemd/system/ollama.service and add the second Environment line.

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
Environment="OLLAMA_HOST=0.0.0.0"

[Install]
WantedBy=default.target

That’s identical to their howto, except for the one OLLAMA_HOST entry.

use systemctl to start/restart ollama.

Here’s the web gui:

Ollama Web Gui

And this command works as advertised (with the above change)

* docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

At this point you should be able to get to a similar gui as in the video. The sign up process is entirely local. The first account to sign up becomes the admin of this instance, fwiw, so set a user and password immediately.

From the Ollama CLI, it is possible to download models:

ollama pull llama2:13b

and any models pulled will show up in the gui automatically, assuming the web gui is connected to the instance (gear icon as seen in the video can configure that).

As we saw in the video, you can run 70b parameter modles, but it helps to have a lot more ram and cpu in the host because the model overruns the 20gb of available vram on this card. The 13b parameter model works just fine, though.

Automatic1111

Automatic1111, the web gui for stable diffusion, depends on having cuda and the cuda container stuff installed locally (even though we can run it from docker).

For this video, I found this variant of the front end which has some nice quality of life improvements out of the box:

… as you get far into this rabbit hole be sure to check out ComfyUI

Plex/ Jellyfin

With a fully-working docker + nvidia stack, the sky is the limit. While in the video I stopped the Ollama service to free some vram for other things – many tasks like encode/decode don’t use that much vram resources. I wish that this card had more encoder/decoders unlocked, but this is reasonable for a ~70w card imho.

Techically nvidia does not limit the # sessions on the encoder/decoder anymore – one just gets diminished performance stacking many sesions on the encoders/decoders.