Ubuntu 22.04 - From Zero to 70b Llama (with BOTH Nvidia and AMD 7xxx series GPUs)

wendell · April 10, 2024, 3:03am

that’s awesome!! what kinda perf you seeing? tokens/sec on mistral 70b for example

Iron_Bound · April 11, 2024, 7:05pm

Would be good to get TFlop numbers for flash-attention

Dao-AILab/flash-attention/blob/main/benchmarks/benchmark_flash_attention.py

# Install the newest triton version with
# pip install "git+https://github.com/openai/triton.git#egg=triton&subdirectory=python"
import pickle
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

from einops import rearrange, repeat

from flash_attn.utils.benchmark import benchmark_all, benchmark_forward, benchmark_backward
from flash_attn.utils.benchmark import benchmark_fwd_bwd, benchmark_combined

from flash_attn import flash_attn_qkvpacked_func

try:
    from triton.ops.flash_attention import attention as attention_triton
except ImportError:
    attention_triton = None

This file has been truncated. show original

cros13 · April 12, 2024, 9:10am

I have almost exactly the same setup as the video waiting for me to replace my home server hardware (A2000/5700/B450D4U) with for the last few months,

RTX 4000 Ada
Ryzen 7900
ASrockRack B650D4U-2L2T motherboard

Added a 80+ titanium power supply (one with specifically high idle efficiency as this is not really included in the 80+ rating)
A Coral Dual Edge TPU for running models for Frigate
A Crucial T705 4TB PCIe 5 SSD (only one m.2 slot on the board)
And a 1.5TB Optane driver for the write heavy stuff

Poor thing is gathering dust waiting for me to find time

Domrockt · April 12, 2024, 2:56pm

i have around eval rate: 6.21 tokens/s with my RTX a4500 with an wizard-vicuna-uncensored:30b Model running on Unraid. Is that good or not? i mean it feels runnable.

wendell · April 12, 2024, 3:24pm

that’s around the speed you get on cpu only/7900 +/-

Domrockt · April 12, 2024, 4:03pm

hm ok i tested with llama2:13b-chat-q6_K --verbose that went fully into VRam and i have about 34.38 tokens/s, so the bigger the model and it swapps to my DDR4 the more it uses CPU, i see.

iam downloading some 30b in the sub 20GB range and test them.

alpha754293 · April 14, 2024, 6:23am

It looks like that for me, with only a RTX A2000 6 GB to play with, that’s nowhere CLOSE to having enough VRAM/powerful enough to be able to play with the 70b model.

Bummer.

edit
In the latest video, @wendell mentioned about using a WebUI for Automatic 1111 (GitHub - AbdBarho/stable-diffusion-webui-docker: Easy Docker setup for Stable Diffusion with user-friendly UI)

but the error message that I am getting is:

ubuntu@nvidia-ai:~/stable-diffusion-webui-docker$ sudo docker compose --profile download up --build
[sudo] password for ubuntu:
WARN[0000] /home/ubuntu/stable-diffusion-webui-docker/docker-compose.yml: `version` is obsolete
[+] Building 0.8s (6/8)                                                                                            docker:default
 => [download internal] load build definition from Dockerfile                                                                0.0s
[+] Building 0.8s (6/8)                                                                                            docker:default
 => [download internal] load build definition from Dockerfile                                                                0.0s
 => => transferring dockerfile: 185B                                                                                         0.0s
 => [download internal] load metadata for docker.io/library/bash:alpine3.19                                                  0.4s
 => [download internal] load .dockerignore                                                                                   0.0s
 => => transferring context: 2B                                                                                              0.0s
 => CACHED [download 1/4] FROM docker.io/library/bash:alpine3.19@sha256:5353512b79d2963e92a2b97d9cb52df72d32f94661aa825fcfa  0.0s
 => [download internal] load build context                                                                                   0.0s
 => => transferring context: 128B                                                                                            0.0s
 => ERROR [download 2/4] RUN apk update && apk add parallel aria2                                                            0.4s
------
 > [download 2/4] RUN apk update && apk add parallel aria2:
0.248 runc run failed: unable to start container process: error during container init: unable to apply apparmor profile: apparmor failed to apply profile: write /proc/self/attr/apparmor/exec: no such file or directory
------
failed to solve: process "/bin/sh -c apk update && apk add parallel aria2" did not complete successfully: exit code: 1

Apparmor, in my Ubuntu 22.04 LTS privileged LXC container is already set to unconfined in my <<CTID>>.conf in Proxmox 7.4-17.

My RTX A2000 6 GB has been successfully passed through to the LXC container and I’ve also got the Nvidia Container Toolkit installed successfully and the sample workload of running nvidia-smi also ran successfully as well.

Cakepans · April 15, 2024, 8:01pm

I used LM Studio and just picked the model that had the most parameters that my GPU was capable of running.

I have a 9700TX 20GB other than hanging sometimes at the start its quick.

it glitches out pretty bad sometimes and just starts putting out trash but otherwise its for sure chatting away.