Open Assistant: 12 Billion Parameters ought to be enough for anyone! (Quick Setup Guide)

Background

This guide is written for someone that has only just heard of Open Assistant, and is going in completely blind. You want to to be able to just run a local model, or at least try to run a local model, but are not quite sure the details of this particular project.

You may have cloned the git repo and found the default testing model is GPT2 and a re looking for something more powerful?

A lot of folks have seen the hype around open assistant, became very excited, downloaded everything, run it, and found that the default configuration is only GPT2. Not to worry! I will show you how to get the 12 billion parameter model up and going while we wait for an open 30+ billion parameter model to be built or made available.

At the hugging face link below you will find the 12 billion parameter model I am talking about, under Open Assistant’s datasets on HF. But you will need a working CUDA/Docker installation (this guide has some hints about that as well).

If you cannot run the nvidia cuda hello-world programs you’re likely to run into other issues.

I have only tested this guide on RTX A6000 sent to me by an anonymous benefactor.

Thank you anonymous benfactor. <3

Links:

The 30 billion parameter llama model they’re talking about xoring on the HF repo – that model is Non-Free. While we wait for licensing issues to be sorted out

Due to license-issues around llama-based models, we are working furiously to bring these to you in the form of XORed files.

… the 12 billion parameter model can do for now to get everything up and running:

From Zero to Running the 12 billion parameter model

First make sure that everything with docker AND docker’s connection to your GPU is working. Next you’ll want to clone the Open Assistant github project, and follow their documentation to run docker compose to get the gp2 model up and going.

I made some changes from there to get the 12 billion parameter running:

Here’s the diff of everything I changed:


diff --git a/docker-compose.yaml b/docker-compose.yaml
index ec69cc0f..cc5e25cb 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -217,6 +217,7 @@ services:
       context: .
       target: dev
     image: oasst-inference-server:dev
+
     environment:
       PORT: 8000
       REDIS_HOST: inference-redis
@@ -246,15 +247,23 @@ services:
     image: oasst-inference-worker:dev
     environment:
       API_KEY: "0000"
-      MODEL_CONFIG_NAME: distilgpt2
+      #MODEL_CONFIG_NAME: distilgpt2
+      MODEL_CONFIG_NAME: custom
       BACKEND_URL: "ws://inference-server:8000"
       PARALLELISM: 2
     volumes:
       - "./oasst-shared:/opt/inference/lib/oasst-shared"
       - "./inference/worker:/opt/inference/worker"
+    profiles: ["inference"]
+    privileged: true
     deploy:
       replicas: 1
-    profiles: ["inference"]
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]

   inference-safety:
     build:
diff --git a/docker/inference/Dockerfile.worker-full b/docker/inference/Dockerfile.worker-full
index f7908611..1c2ac599 100644
--- a/docker/inference/Dockerfile.worker-full
+++ b/docker/inference/Dockerfile.worker-full
@@ -22,7 +22,8 @@ RUN /opt/miniconda/envs/worker/bin/pip install -r requirements.txt
 COPY ./${APP_RELATIVE_PATH}/*.py .
 COPY ./${APP_RELATIVE_PATH}/worker_full_main.sh /entrypoint.sh

-ENV MODEL_CONFIG_NAME="distilgpt2"
+ENV MODEL_CONFIG_NAME="oasst1"
+#ENV MODEL_CONFIG_NAME="distilgpt2"
 ENV NUM_SHARDS="1"

 # These are upper bounds for the inference server.
diff --git a/oasst-shared/oasst_shared/model_configs.py b/oasst-shared/oasst_shared/model_configs.py
index 12d7764f..e5200307 100644
--- a/oasst-shared/oasst_shared/model_configs.py
+++ b/oasst-shared/oasst_shared/model_configs.py
@@ -96,6 +96,11 @@ MODEL_CONFIGS = {
         max_input_length=1024,
         max_total_length=1792,  # seeing OOMs on 2048 on an A100 80GB
     ),
+    "custom": ModelConfig(
+        model_id="OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5",
+        max_input_length=1024,
+        max_total_length=1792,  # seeing OOMs on 2048 on an A100 80GB
+    ),
     "OA_SFT_Llama_30Bq_6": ModelConfig(
         model_id="OpenAssistant/oasst-sft-6-llama-30b",
         max_input_length=1024,


The prose walkthrough is to create a profile called custom in the docker compose, then add it to /oasst-shared/oasst_shared/model_configs.py

which, curiously, didn’t seem to include the 12 billion parameter model.

from there

 docker compose --privileged --profile frontend-dev --profile ci --profile inference  up --build --attach-dependencies

seemed to work properly. Privileged was a quick and dirty way to give docker access to the GPUs as I am on pop-os and ubuntu 22.04LTS users shouldn’t need this step; this is just an artifact of having to jump through a couple extra steps for nvidia cuda being sort of not exactly the same setup between Pop and Ubuntu (Pop is usually more usable/sane, tbh).

This is probably something I’ll correct in a future version of this how-to.

An easy way to know if your docker is setup NOT to need special permissions to access your GPUs is to run nvidias cuda test.

docker run --privileged --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

Tue Apr 18 15:22:46 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:01:00.0 Off |                  Off |
| 33%   62C    P2    85W / 300W |  29170MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    Off  | 00000000:41:00.0 Off |                  Off |
| 35%   64C    P2    88W / 300W |  27310MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

if the docker command above works WITHOUT the privileged command then you won’t need it for your docker compose definitions either.

Up and Running

Note this really is unnecessary and you can play with this model in their chat (and better for future training anyway). It R(eally Is running locally though which is mind blowing!! :smiley:

Once you’re up and running you will have to go to http://localhost:3000 and “login as debug user.” In chat choose the ‘custom’ model you defined earlier:

Once you have the model up and running it is a lot of fun:

Happy hacking! More prose later…

12 Likes

One will likely also need to do the following (I’m on Fedora 37). This is what worked for me:

Create the daemon.json file and add the nvidia runtime.

sudo dnf install -y nvidia-container-runtime

/etc/docker/daemon.json

{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

and then issue

sudo systemctl daemon-reload
sudo systemctl restart docker
sudo docker info | grep -i runtime

Hmm, seems that this crashes my computer.

OOM gets invoken once I ask a question for the oasst-inference-worker:dev container.

I have a R9 5900 and a 3060Ti I am using with 32 GiB RAM but it just shoots instantly up.

uses about 24gb ram on 48gb a6000 most of the time, sometimes shoots higher

2 Likes

ahh that would explain it.

1 Like

Your anonymous benefactor has Great Expectations. And so do we. :slightly_smiling_face:

1 Like

This is extremely cool! I’m not able to try right this minute, but has anybody been able to get this to run on a 24GB card?

I will note here that you can trade speed and run 4-bit quantized weights, this lowers the vram requiments to 7.46 GB plus some working memory, so 10-12GB cards should be ok here.

1 Like

Working on a red eye graphic and wav file containing “I’m sorry Dave. I’m afraid I can’t do that.”

1 Like

Due to license-issues around llama-based models, we are working furiously to bring these to you in the form of XORed files.

While I can have a good guess what that means, that’s a hilarious sentence to make in all seriousness.

Is there a way to connect this to the home assistant and/or voice assistant? I want to be able to talk to the AI :open_mouth:

yeah, we’re getting there. AI is moving so fast it’s about a year per week.

The 12 billion parameter model isn’t quittteeeee there

but there was another project to do these LLMs via webgl and boy that was something else… you can load an LLM via a web page?? yeah, seems like that’s going to be a thing soon.

1 Like

I feel like it’s gradually accelerating too. I wonder if this means that devs use their own AI to help code AI now. Or could it just be an explosive interest in this that accelerated development by so much.

Now for most important question, will your own bot that buys stuff on ebay be using AI in the future ? :stuck_out_tongue:

Access to a professional GPU with 24GB ram is only $1/hr away using the G5 instances from AWS.

Multi-GPU instances with up to 192GB memory are readily available and obscenely expensive.

Long term it might be cheaper getting second hand 3090.

Or multiple slower GPUs. Some AIs are capable of using multiple GPUS at the same time.

Of course. But at this point we have a ton of folks that really want to try out large models and have no way to get their hands on 24GB GPUs.
Using AWS is quite reasonable for a couple of days. At the cost of highend GPUs you can go quite a way :slight_smile:

Spoken as someone who doesn’t expect to spend more than a couple of hours on this in the short term.

Don’t really 24GB of ram for that. As mentioned by someone above there is a way to drastically reduce memory usage at the cost of speed and accuracy.

Yep and someone said…

:slight_smile:

Not arguing, just providing short-term options for folks that can’t wait.

Very exciting,

I’m running this, and everything seems to be up and running, but after I enter a prompt, nothing seems to be happening,

there is a message on top of the prompt line, “Your message is queued, you are at position 0 in the queue.” and a spinning fiddle at the right of the prompt line but to the text produced by the model.

No CPU or GPU activities are detected.

No significant error messages are present, just a complaint about a missing Python module in open-assistant-backend-worker-beat-1 and open-assistant-backend-worker-1: “ModuleNotFoundError: No module named ‘requests’”

after a while:

:frowning:

I’m not using the --privileged in the compose command as it is not recognized, but containers seem to be able to access GPUs with no problem when running the Nvidia test.

Some details:

AMD Ryzen Threadripper 3970X 32-Core Processor
2x NVIDIA TITAN RTX
Pop!_OS 22.04 LTS
Docker version 23.0.5, build bc4487a

Anybody tried to get this working with an AMD GPU?