Notes on what I did:
Manjaro, full proprietary driver install
2080Ti + Tesla V100
Pre Reqs
sudo pacman -S python3 python-pip tensorflow-cuda cudnn python-tensorflow-cuda
sudo pacman -S python-pytorch-cuda cuda
Playground git repo
https://github.com/saharmor/dalle-playground
- Clone or fork this repository
- Create a virtual environment
cd backend && python3 -m venv ENV_NAME
- Install requirements
pip install -r requirements.txt
- Make sure you have pytorch and its dependencies installed Installation guide
- Run web server
python app.py --port 8080 --model_version mini
(you can change from 8080 to your own port) - In a different terminal, install frontendβs modules
cd interface && npm install
and run itnpm start
- Copy backendβs url from step 5 and paste it in the backendβs url input within the web app
What if I get ptxas errors and it falls back to using CPU?
Even if you donβt have a CUDA device, it is still possible for it to run from the CPU. It was decently fast from a 32 core threadripper system.
2022-06-16 06:46:21.595653: I external/org_tensorflow/tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2022-06-16 06:46:21.596077: I external/org_tensorflow/tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2022-06-16 06:46:21.596087: W external/org_tensorflow/tensorflow/stream_executor/gpu/asm_compiler.cc:80] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
2022-06-16 06:46:21.596665: I external/org_tensorflow/tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2022-06-16 06:46:21.596695: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:460] ptxas returned an error during compilation of ptx to sass: 'INTERNAL: Failed to launch ptxas' If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
This error was related to both pip install jax
and not having /opt/cuda/bin in the path. I corrected with
declare -x PATH=$PATH:/opt/cuda/bin
What do I do if it grabs the wrong GPU? or I get GPU errors about it
2022-06-16 06:49:58.770162: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2141] Execution of replica 1 failed: INVALID_ARGUMENT: executable is built for device CUDA:0 of type "Tesla V100-PCIE-32GB"; cannot run it on device CUDA:1 of type "NVIDIA GeForce RTX 2080 Ti"
In my case I have both a 2080Ti and V100 in this threadripper system. I wanted it to use the V100 with its 32gb of vram as shown in the video. This error is a bit obtuse.
The fix was:
TF_CPP_MIN_LOG_LEVEL=0 CUDA_VISIBLE_DEVICES=0 python3 app.py --port 8080 --model_version mega_full
Device 0 was the V100, device 1 was the 2080Ti. Your system may have different indexes for devices.
You can troubleshoot CUDA and gpus in python further, too, with commands like:
import tensorflow as tf
print(tf.test.gpu_device_name())
How do I know what the GPU is doing and/or that the GPU is busy?
nvidia-smi
Output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 0% 35C P0 74W / 300W | 578MiB / 11264MiB | 64% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:21:00.0 Off | Off |
| N/A 32C P0 35W / 250W | 11577MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1387 G /usr/lib/Xorg 255MiB |
| 0 N/A N/A 1478 G /usr/bin/gnome-shell 66MiB |
| 0 N/A N/A 2508 G /usr/lib/firefox/firefox 169MiB |
| 0 N/A N/A 5603 G /usr/bin/gjs 7MiB |
| 0 N/A N/A 48157 G obs 72MiB |
| 1 N/A N/A 1387 G /usr/lib/Xorg 4MiB |
| 1 N/A N/A 45112 C python3 11537MiB |
+-----------------------------------------------------------------------------+
Errors I ran into and workarounds
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)