Steps I took
# first fully update 22.04 LTS
apt update && apt upgrade -y
# reboot ... you probably got a newer kernel...
# ensure remote access
Since we are updating the video driver, and it is likely you don't have more than one gpu in the system, ensure you can ```ssh``` into the system from another system. This is useful for both setup and troubleshooting, Should Something Go Wrong.
# nvidia part 1
We need the nvidia GPU proprietary driver first. If the only GPU in the system is nvidia and you're using the nouveau driver, it must be blacklisted first. Before you reboot, install the nvidia drivers. Then reboot.
```lsmod``` and check the output to confirm the ```nvidia``` module is loaded; check ```dmesg``` to be sure you do NOT see messages like:
[ 1044.501389] NVRM: The NVIDIA probe routine was not called for 1 device(s).
... this message indicates "something else" has claimed your nvidia card (most likely nouveau).
Once nvidia is loaded and ```dmesg``` is free of errors that might indicate the nvidia driver
# nvidia part 2
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install python3-pip
The next part is deciding… what CUDA version do I need?
Start Locally | PyTorch
This page helps make that decision for us.
apt search
shows cuda 11-(lots of versions) as well as 12.1 and 12.2; if we want the “stable” Pytorch, then it makes sense to get CUDA 12.1 to match this, and to lower the headache that we have to deal with.
sudo apt install cuda-12-1
… this version made the most sense, based on the information on the pytorch website.
Longer Explanation
If you aren’t familiar with Python, especially version 3, Python does support running multiple virtual environments and managing versions of things separately.
The analagous facility on Linux is probably… docker? (Me saying that is a bit heretical if you already know about these things, but the reason I say that is docker is a convenient containerization system that abstracts away some of this complexity. It is also possible to setup docker and let it interface the cuda hardware directly. If you need to run CUDA 11.8 and CUDA 12.1 and CUDA 12.2 on the same box without a lot of headache I think this is the best approach… or at least… I haven’t seen another approach with worse tradeoffs.
For thepurposes of this demo/guide we are installing CUDA 12.1 because that’s all we need. Perhaps another guide expanding on The Docker Way I can link here in the future.
Status Check
To be confident one is at the right part of the process, nvidia-smi
should be present on the system AND have reasonable output such as:
# this command
sudo nvidia-smi
#outputs this:
Tue Jan 30 01:58:37 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:41:00.0 Off | Off |
| 30% 57C P8 26W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
… *don’t worry that it says 12.3 or some other cuda version you didn’t pick here; that’s okay *
Next we can actually run the command recommended by the pytorch installer website, in my case that was
pip3 install torch torchvision torchaudio
and that should look like
$ pip3 install torch torchvision torchaudio
Defaulting to user installation because normal site-packages is not writeable
Collecting torch
Downloading torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━ 502.6/670.2 MB 116.7 MB/s eta 0:00:02
... (lot of the downloading and installing happening...)
Test cuda is okay now?
python3 -c "import torch; print(torch.cuda.device_count())"
The output should be 1
or the # of cuda devices you actually have.