Deeplearning on a gaming GPU!?
TODO
Beware the Leopard
Licensing – okay, so buried in the license is an agreement that you won’t use these gaming cards in a data center context.
NV is keenly aware of very positive case studies like the Deep Dream app, where they went from paying $40k/mo running expensive “commercial grade” GPUs at Amazon to running a bunch of gaming GPUs out of their garage, and all the interesting related business things fell out of that as a result of just that one project. Business Lines, verticals and segmentation must be fiercely maintained.
Just be aware that if you have a large project and you need to scale, There May Be Licensing Issues with these Team Green GPUs. And those issues may not exist if you use a competitor.
Getting Started
Nvidia, and many others, have spent a king’s ransom working on user-friendliness. It is super easy to get started with CUDA and spin up machine learning for any kind of experiment you might want to undertake.
Our main goal is to test ResNet 50, but other benchmarks could include Inception (v3/v4) or ResNet 152, at some different batch sizes to see how they do.
Pre-Requirements
This guide is for Ubuntu!
You’ll need to start by blacklisting nouveau in /etc/modprobe.d/blacklist.conf
sudo vi /etc/modprobe.d/blacklist.conf
Add this to any open line
blacklist nouveau
Now reboot, if you’re running a UI it will not start, it will be terminal only.
Make sure ‘build-essential’ is installed
sudo apt install build-essential -y
At this point you’ll need to install Nvidia’s CUDA Toolkit.
Now reboot again.
Setting up the Environment
Getting things ready is pretty easy now.
Docker
You’ll need two things - docker and nvidia-docker2
sudo apt install docker.io
sudo apt install curl
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install nvidia-docker2
sudo systemctl restart docker
Baremetal/Without Docker
Add NVIDIA package repositories "
(This section taken from https://www.tensorflow.org/install/gpu)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
sudo apt-get updatewget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt-get update
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
sudo apt install ./libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
sudo apt-get update
sudo apt-get install --no-install-recommends \ cuda-11-0 \ libcudnn8=8.0.4.30-1+cuda11.0 \ libcudnn8-dev=8.0.4.30-1+cuda11.0
sudo apt-get install -y --no-install-recommends libnvinfer7=7.1.3-1+cuda11.0 \ libnvinfer-dev=7.1.3-1+cuda11.0 \ libnvinfer-plugin7=7.1.3-1+cuda11.0
sudo apt install python3-pip
pip3 install ---upgrade pip
pip3 install tensorflow
Now Grab TF Benchmarks
wget https://github.com/tensorflow/benchmarks/archive/master.zip
Look around the TensorFlow github, but we’re going to get started there.
In a larger sense, you should also be aware that a lot of the highest-end machine learning has already moved on from General Purpose Graphical Processing Units (GP-GPU). TensorFlow has it’s own hardware now But still, useful as a learning exercise.
unzip master.zip
cd benchmarks-master/scripts/tf_cnn_benchmarks/
Run some preliminary tests:
The tests are located in /benchmarks-master/scripts/tf_cnn_benchmarks/ (check path please)
FP32 - Only 256 batch size will fit into VRAM. Notice XLA being on and off.
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server --data_format=NCHW
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --data_format=NCHW --xla_compile=true
FP16 - 512 batch size will stuff into VRAM. XLA on and off here as well.
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --data_format=NCHW --use_fp16 --xla_compile=true
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --data_format=NCHW --use_fp16 --data_format=NCHW
Now, we can run some nvidia examples:
cd ../nvidia-examples/cnn/
python resnet.py
Finally:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --data_format=NCHW --xla_compile=true --data_format=NCHW
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server --data_format=NCHW --xla_compile=true --data_format=NCHW
python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=280 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=1 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --local_parameter_device=gpu --num_gpus=1 --display_every=10 --xla_compile=true
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server --data_format=NCHW
With Docker (Easy button)
Tensorflow 2.3 has a lot of performance enhancements and general improvements. Let’s take it for a spin, via Docker.
docker run --gpus all --shm-size=48g -it --rm -v cri nvcr.io/nvidia/tensorflow:20.12-tf2-py3
NVIDIA Example without tuning, XLA wont matter. This script isn’t tuned but is a quick test to make sure it’s running.
cd /nvidia-examples/cnn/
python resnet.py
You can change precision and batch size - vi resnet.py to tune performance.
Our Results!
FP16 batch 512 :
global_step: 10 images_per_sec: 175.2
global_step: 20 images_per_sec: 980.5
global_step: 30 images_per_sec: 973.9
global_step: 40 images_per_sec: 974.1
global_step: 50 images_per_sec: 976.4
global_step: 60 images_per_sec: 972.5
global_step: 70 images_per_sec: 973.3
global_step: 80 images_per_sec: 974.5
global_step: 90 images_per_sec: 972.7
global_step: 100 images_per_sec: 969.9
global_step: 110 images_per_sec: 974.0
global_step: 120 images_per_sec: 968.7
global_step: 130 images_per_sec: 972.6
global_step: 140 images_per_sec: 971.1
global_step: 150 images_per_sec: 970.8
global_step: 160 images_per_sec: 976.2
global_step: 170 images_per_sec: 972.5
global_step: 180 images_per_sec: 976.9
global_step: 190 images_per_sec: 974.8
global_step: 200 images_per_sec: 979.5
global_step: 210 images_per_sec: 978.5
global_step: 220 images_per_sec: 972.2
global_step: 230 images_per_sec: 973.1
global_step: 240 images_per_sec: 973.3
global_step: 250 images_per_sec: 970.7
global_step: 260 images_per_sec: 971.3
global_step: 270 images_per_sec: 969.8
global_step: 280 images_per_sec: 976.0
global_step: 290 images_per_sec: 971.1
global_step: 300 images_per_sec: 971.0
epoch: 0 time_taken: 181.7
300/300 - 164s - loss: 9.1375 - top1: 0.8283 - top5: 0.8686
FP32 batch 256 :
global_step: 10 images_per_sec: 151.1
global_step: 20 images_per_sec: 465.7
global_step: 30 images_per_sec: 463.0
global_step: 40 images_per_sec: 465.8
global_step: 50 images_per_sec: 461.6
global_step: 60 images_per_sec: 464.1
global_step: 70 images_per_sec: 462.8
global_step: 80 images_per_sec: 462.2
global_step: 90 images_per_sec: 464.7
global_step: 100 images_per_sec: 464.3
global_step: 110 images_per_sec: 466.8
global_step: 120 images_per_sec: 466.4
global_step: 130 images_per_sec: 466.3
global_step: 140 images_per_sec: 464.4
global_step: 150 images_per_sec: 466.5
global_step: 160 images_per_sec: 467.7
global_step: 170 images_per_sec: 468.4
global_step: 180 images_per_sec: 466.3
global_step: 190 images_per_sec: 465.5
global_step: 200 images_per_sec: 465.9
global_step: 210 images_per_sec: 465.7
global_step: 220 images_per_sec: 466.7
global_step: 230 images_per_sec: 468.0
global_step: 240 images_per_sec: 467.6
global_step: 250 images_per_sec: 466.9
global_step: 260 images_per_sec: 467.2
global_step: 270 images_per_sec: 466.7
global_step: 280 images_per_sec: 466.9
global_step: 290 images_per_sec: 468.5
global_step: 300 images_per_sec: 466.2
epoch: 0 time_taken: 176.3
300/300 - 166s - loss: 8.5827 - top1: 0.8423 - top5: 0.8661
tf_cnn_benchmark with much tweaking
We’re giv’n’er all she’s got, Captain!
wget https://github.com/tensorflow/benchmarks/archive/master.zip
unzip master.zip
cd benchmarks-master/scripts/tf_cnn_benchmarks/
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --data_format=NCHW --use_fp16 --xla_compile=true --data_format=NCHW
And here is the output – note a significant performance uplift. Almost 1500 cats per second! in fp16.
FP16 batch 512 XLA enabled (Somehow XLA is running?)
Done warm up
Step Img/sec total_loss
1 images/sec: 1471.4 +/- 0.0 (jitter = 0.0) 7.888
10 images/sec: 1464.8 +/- 2.5 (jitter = 11.0) 7.943
20 images/sec: 1463.3 +/- 1.5 (jitter = 4.9) 7.795
30 images/sec: 1463.4 +/- 1.3 (jitter = 4.6) 7.840
40 images/sec: 1462.3 +/- 1.1 (jitter = 3.4) 7.759
50 images/sec: 1462.7 +/- 1.0 (jitter = 3.7) 7.866
60 images/sec: 1462.3 +/- 0.9 (jitter = 4.5) 7.802
70 images/sec: 1462.3 +/- 0.9 (jitter = 4.9) 7.714
80 images/sec: 1461.5 +/- 0.8 (jitter = 3.8) 7.730
90 images/sec: 1460.5 +/- 0.8 (jitter = 4.0) 7.696
100 images/sec: 1460.4 +/- 0.8 (jitter = 4.2a) 7.675
----------------------------------------------------------------
total images/sec: 1460.16
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server --data_format=NCHW --xla_compile=true --data_format=NCHW
FP32 Batch 256 XLA Enabled
Step Img/sec total_loss
1 images/sec: 617.3 +/- 0.0 (jitter = 0.0) 7.884
10 images/sec: 614.8 +/- 0.6 (jitter = 3.0) 7.957
20 images/sec: 614.6 +/- 0.5 (jitter = 3.3) 7.875
30 images/sec: 614.7 +/- 0.4 (jitter = 3.1) 7.918
40 images/sec: 614.7 +/- 0.4 (jitter = 2.0) 7.802
50 images/sec: 614.6 +/- 0.3 (jitter = 2.0) 7.883
60 images/sec: 614.3 +/- 0.3 (jitter = 3.8) 7.891
70 images/sec: 613.8 +/- 0.3 (jitter = 3.4) 7.860
80 images/sec: 613.7 +/- 0.3 (jitter = 3.3) 7.826
90 images/sec: 613.6 +/- 0.3 (jitter = 3.3) 7.765
100 images/sec: 613.5 +/- 0.3 (jitter = 3.3) 7.900
----------------------------------------------------------------
total images/sec: 613.42
----------------------------------------------------------------
Takeaway
In some configurations, you would only see 600-700 images/sec @ fp16. Reject this! 1200-1400 images/sec is easily possible with newer TensorFlow and some performance tuning. XLA tuning can be important depending on the scenario, but not really here because almost everything has “caught up” to a reasonable level of optimization.
Of course the A100 can “pack” FP16 inside FP32 operations, which is mostly a software thing, so it can nearly (but not quite) double the performance of a 3090 with fp16 since the 3090 software stack is not packing two fp16 operations at a time into fp32 transparently.
3080 performance is similar, assuming your job/project can fit into 10gb of vram. The extra 14gb of vram on the 3090 is definately handy for the larger batch sizes or datasets.
I suppose I shouldn’t be surprised, but the 3090 can offer about 2x performance of my Tesla V100, a $7000 compute card, in many compute scenarios.
And remember, this is ONLY Resnet50 testing!
Overall, since launch, we’ve seen performance go from ~462 images/sec on ResNet50/FP32 to ~600 (woo!) and from ~900 images/sec to ~1450 images/sec on fp16.
These are really killer numbers especially for the not-$7000 RTX 3090.
Some of this is attributable to TensorFlow improvements, but a fair bit of it is improvements up and down the rest of the software stack outside the TensorFlow project.
(TUNED) ResNet 50 Results
A100 850 (fp32) 2300 (fp16)
V100 383 (fp32) 1120 (fp16)
RTX 3090 617 (fp32) 1400 (fp16)
Titan RTX 400 (fp32) 1200 (fp16)