How to RTX 3090 and TensorFlow like a Pro

Deeplearning on a gaming GPU!?


Beware the Leopard

Licensing – okay, so buried in the license is an agreement that you won’t use these gaming cards in a data center context.

NV is keenly aware of very positive case studies like the Deep Dream app, where they went from paying $40k/mo running expensive “commercial grade” GPUs at Amazon to running a bunch of gaming GPUs out of their garage, and all the interesting related business things fell out of that as a result of just that one project. Business Lines, verticals and segmentation must be fiercely maintained.

Just be aware that if you have a large project and you need to scale, There May Be Licensing Issues with these Team Green GPUs. And those issues may not exist if you use a competitor.

Getting Started

Nvidia, and many others, have spent a king’s ransom working on user-friendliness. It is super easy to get started with CUDA and spin up machine learning for any kind of experiment you might want to undertake.

Our main goal is to test ResNet 50, but other benchmarks could include Inception (v3/v4) or ResNet 152, at some different batch sizes to see how they do.


This guide is for Ubuntu!

You’ll need to start by blacklisting nouveau in /etc/modprobe.d/blacklist.conf

sudo vi /etc/modprobe.d/blacklist.conf

Add this to any open line

blacklist nouveau

Now reboot, if you’re running a UI it will not start, it will be terminal only.

Make sure ‘build-essential’ is installed

sudo apt install build-essential -y

At this point you’ll need to install Nvidia’s CUDA Toolkit.

Now reboot again.

Setting up the Environment

Getting things ready is pretty easy now.


You’ll need two things - docker and nvidia-docker2

sudo apt install
sudo apt install curl

curl -s -L | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install nvidia-docker2
sudo systemctl restart docker

Baremetal/Without Docker

Add NVIDIA package repositories "

(This section taken from


sudo mv /etc/apt/preferences.d/cuda-repository-pin-600

sudo apt-key adv --fetch-keys

sudo add-apt-repository "deb /"

sudo apt-get updatewget

sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

sudo apt-get update


sudo apt install ./libnvinfer7_7.1.3-1+cuda11.0_amd64.deb

sudo apt-get update

sudo apt-get install --no-install-recommends \    cuda-11-0 \    libcudnn8=  \    libcudnn8-dev=

sudo apt-get install -y --no-install-recommends libnvinfer7=7.1.3-1+cuda11.0 \    libnvinfer-dev=7.1.3-1+cuda11.0 \    libnvinfer-plugin7=7.1.3-1+cuda11.0

sudo apt install python3-pip
pip3 install ---upgrade pip
pip3 install tensorflow

Now Grab TF Benchmarks


Look around the TensorFlow github, but we’re going to get started there.

In a larger sense, you should also be aware that a lot of the highest-end machine learning has already moved on from General Purpose Graphical Processing Units (GP-GPU). TensorFlow has it’s own hardware now :slight_smile: But still, useful as a learning exercise.

cd benchmarks-master/scripts/tf_cnn_benchmarks/

Run some preliminary tests:

The tests are located in /benchmarks-master/scripts/tf_cnn_benchmarks/ (check path please)

FP32 - Only 256 batch size will fit into VRAM. Notice XLA being on and off.

 python --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server --data_format=NCHW

  python --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --data_format=NCHW  --xla_compile=true

FP16 - 512 batch size will stuff into VRAM. XLA on and off here as well.

  python --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --data_format=NCHW --use_fp16 --xla_compile=true

  python --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --data_format=NCHW --use_fp16 --data_format=NCHW

Now, we can run some nvidia examples:

cd ../nvidia-examples/cnn/


 python --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --data_format=NCHW --xla_compile=true --data_format=NCHW

 python --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server --data_format=NCHW --xla_compile=true --data_format=NCHW

python --data_format=NCHW --batch_size=280 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=1 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --local_parameter_device=gpu --num_gpus=1 --display_every=10 --xla_compile=true
python --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server --data_format=NCHW

With Docker (Easy button)

Tensorflow 2.3 has a lot of performance enhancements and general improvements. Let’s take it for a spin, via Docker.

docker run --gpus all --shm-size=48g -it --rm -v cri

NVIDIA Example without tuning, XLA wont matter. This script isn’t tuned but is a quick test to make sure it’s running.

cd /nvidia-examples/cnn/

You can change precision and batch size - vi to tune performance.

Our Results!

FP16 batch 512 :

global_step: 10 images_per_sec: 175.2
global_step: 20 images_per_sec: 980.5
global_step: 30 images_per_sec: 973.9
global_step: 40 images_per_sec: 974.1
global_step: 50 images_per_sec: 976.4
global_step: 60 images_per_sec: 972.5
global_step: 70 images_per_sec: 973.3
global_step: 80 images_per_sec: 974.5
global_step: 90 images_per_sec: 972.7
global_step: 100 images_per_sec: 969.9
global_step: 110 images_per_sec: 974.0
global_step: 120 images_per_sec: 968.7
global_step: 130 images_per_sec: 972.6
global_step: 140 images_per_sec: 971.1
global_step: 150 images_per_sec: 970.8
global_step: 160 images_per_sec: 976.2
global_step: 170 images_per_sec: 972.5
global_step: 180 images_per_sec: 976.9
global_step: 190 images_per_sec: 974.8
global_step: 200 images_per_sec: 979.5
global_step: 210 images_per_sec: 978.5
global_step: 220 images_per_sec: 972.2
global_step: 230 images_per_sec: 973.1
global_step: 240 images_per_sec: 973.3
global_step: 250 images_per_sec: 970.7
global_step: 260 images_per_sec: 971.3
global_step: 270 images_per_sec: 969.8
global_step: 280 images_per_sec: 976.0
global_step: 290 images_per_sec: 971.1
global_step: 300 images_per_sec: 971.0
epoch: 0 time_taken: 181.7
300/300 - 164s - loss: 9.1375 - top1: 0.8283 - top5: 0.8686

FP32 batch 256 :

global_step: 10 images_per_sec: 151.1
global_step: 20 images_per_sec: 465.7
global_step: 30 images_per_sec: 463.0
global_step: 40 images_per_sec: 465.8
global_step: 50 images_per_sec: 461.6
global_step: 60 images_per_sec: 464.1
global_step: 70 images_per_sec: 462.8
global_step: 80 images_per_sec: 462.2
global_step: 90 images_per_sec: 464.7
global_step: 100 images_per_sec: 464.3
global_step: 110 images_per_sec: 466.8
global_step: 120 images_per_sec: 466.4
global_step: 130 images_per_sec: 466.3
global_step: 140 images_per_sec: 464.4
global_step: 150 images_per_sec: 466.5
global_step: 160 images_per_sec: 467.7
global_step: 170 images_per_sec: 468.4
global_step: 180 images_per_sec: 466.3
global_step: 190 images_per_sec: 465.5
global_step: 200 images_per_sec: 465.9
global_step: 210 images_per_sec: 465.7
global_step: 220 images_per_sec: 466.7
global_step: 230 images_per_sec: 468.0
global_step: 240 images_per_sec: 467.6
global_step: 250 images_per_sec: 466.9
global_step: 260 images_per_sec: 467.2
global_step: 270 images_per_sec: 466.7
global_step: 280 images_per_sec: 466.9
global_step: 290 images_per_sec: 468.5
global_step: 300 images_per_sec: 466.2
epoch: 0 time_taken: 176.3
300/300 - 166s - loss: 8.5827 - top1: 0.8423 - top5: 0.8661

tf_cnn_benchmark with much tweaking

We’re giv’n’er all she’s got, Captain!

cd benchmarks-master/scripts/tf_cnn_benchmarks/

python --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --data_format=NCHW --use_fp16 --xla_compile=true --data_format=NCHW

And here is the output – note a significant performance uplift. Almost 1500 cats per second! in fp16.

FP16 batch 512 XLA enabled (Somehow XLA is running?)

Done warm up
Step	Img/sec	total_loss
1	images/sec: 1471.4 +/- 0.0 (jitter = 0.0)	7.888
10	images/sec: 1464.8 +/- 2.5 (jitter = 11.0)	7.943
20	images/sec: 1463.3 +/- 1.5 (jitter = 4.9)	7.795
30	images/sec: 1463.4 +/- 1.3 (jitter = 4.6)	7.840
40	images/sec: 1462.3 +/- 1.1 (jitter = 3.4)	7.759
50	images/sec: 1462.7 +/- 1.0 (jitter = 3.7)	7.866
60	images/sec: 1462.3 +/- 0.9 (jitter = 4.5)	7.802
70	images/sec: 1462.3 +/- 0.9 (jitter = 4.9)	7.714
80	images/sec: 1461.5 +/- 0.8 (jitter = 3.8)	7.730
90	images/sec: 1460.5 +/- 0.8 (jitter = 4.0)	7.696
100	images/sec: 1460.4 +/- 0.8 (jitter = 4.2a)	7.675
total images/sec: 1460.16

python --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server --data_format=NCHW --xla_compile=true --data_format=NCHW

FP32 Batch 256 XLA Enabled

Step	Img/sec	total_loss
1	images/sec: 617.3 +/- 0.0 (jitter = 0.0)	7.884
10	images/sec: 614.8 +/- 0.6 (jitter = 3.0)	7.957
20	images/sec: 614.6 +/- 0.5 (jitter = 3.3)	7.875
30	images/sec: 614.7 +/- 0.4 (jitter = 3.1)	7.918
40	images/sec: 614.7 +/- 0.4 (jitter = 2.0)	7.802
50	images/sec: 614.6 +/- 0.3 (jitter = 2.0)	7.883
60	images/sec: 614.3 +/- 0.3 (jitter = 3.8)	7.891
70	images/sec: 613.8 +/- 0.3 (jitter = 3.4)	7.860
80	images/sec: 613.7 +/- 0.3 (jitter = 3.3)	7.826
90	images/sec: 613.6 +/- 0.3 (jitter = 3.3)	7.765
100	images/sec: 613.5 +/- 0.3 (jitter = 3.3)	7.900
total images/sec: 613.42


In some configurations, you would only see 600-700 images/sec @ fp16. Reject this! 1200-1400 images/sec is easily possible with newer TensorFlow and some performance tuning. XLA tuning can be important depending on the scenario, but not really here because almost everything has “caught up” to a reasonable level of optimization.

Of course the A100 can “pack” FP16 inside FP32 operations, which is mostly a software thing, so it can nearly (but not quite) double the performance of a 3090 with fp16 since the 3090 software stack is not packing two fp16 operations at a time into fp32 transparently.

3080 performance is similar, assuming your job/project can fit into 10gb of vram. The extra 14gb of vram on the 3090 is definately handy for the larger batch sizes or datasets.

I suppose I shouldn’t be surprised, but the 3090 can offer about 2x performance of my Tesla V100, a $7000 compute card, in many compute scenarios.

And remember, this is ONLY Resnet50 testing!

Overall, since launch, we’ve seen performance go from ~462 images/sec on ResNet50/FP32 to ~600 (woo!) and from ~900 images/sec to ~1450 images/sec on fp16.
These are really killer numbers especially for the not-$7000 RTX 3090.

Some of this is attributable to TensorFlow improvements, but a fair bit of it is improvements up and down the rest of the software stack outside the TensorFlow project.

(TUNED) ResNet 50 Results

A100 850 (fp32) 2300 (fp16)
V100 383 (fp32) 1120 (fp16)
RTX 3090 617 (fp32) 1400 (fp16)
Titan RTX 400 (fp32) 1200 (fp16)

Huge thanks to Mark M for helping out with this one. :slight_smile:


Hi Wendell

Thanks for this guide. I recently got a 3090, and while running the tests above, my scores were between 5 to 10% lower than yours.

I’m just wondering what could be different. Are you using an overclocked 3090?

I’m on a ASUS TUF Gaming NVIDIA GeForce RTX 3090 OC, clocked at 1770 MHz.

My system is a TR 3690x, Fedora 33, running everything under Gnome.


SuprimX 3090 :slight_smile:

Could also be your kernel version or power plan. Set performance cpu governor and re-run?

Also, is it the ONLY gpu in the system or does it share with the host desktop? in my case the gpu was dedicated only for ML tasks with another GPU being the primary desktop GPU.

Hi wendell

Setting the cpu governor to performance helped a bit.

I think most of the difference can be explained by the max clock (1875MHz vs 1770MHz) , but thanks again for the article!

Intriguin post, unfortunately I immediately ran into some problems I could not make sense of:

python --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server --data_format=NCHW --precision=fp16
2020-12-28 15:18:27.893772: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2020-12-28 15:18:27.893789: I tensorflow/stream_executor/cuda/] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
WARNING:tensorflow:From /home/gerd/.local/lib/python3.8/site-packages/tensorflow/python/compat/ disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
FATAL Flags parsing error: Unknown command line flag 'precision'. Did you mean: partitioned_graph_file_prefix ?
Pass --helpshort or --helpfull to see help on flags.

Can anyone shed some light on this?


$ neofetch 
            .-/+oossssoo+/-.               gerd@zenmasterl 
        `:+ssssssssssssssssss+:`           --------------- 
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 20.04.1 LTS x86_64 
    .ossssssssssssssssssdMMMNysssso.       Host: X570 AORUS MASTER -CF 
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 5.4.0-58-lowlatency 
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 5 hours, 6 mins 
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 2066 (dpkg), 6 (snap) 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: bash 5.0.17 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Resolution: 3440x1440 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   DE: GNOME 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   WM: Mutter 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   WM Theme: Adwaita 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Theme: Yaru [GTK2/3] 
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/    Icons: Yaru [GTK2/3] 
  +sssssssssdmydMMMMMMMMddddyssssssss+     Terminal: terminator 
   /ssssssssssshdmNNNNmyNMMMMhssssss/      CPU: AMD Ryzen 9 5950X (32) @ 3.400GHz 
    .ossssssssssssssssssdMMMNysssso.       GPU: NVIDIA 0c:00.0 NVIDIA Corporation Device 2204 
      -+sssssssssssssssssyyyssss+-         Memory: 8723MiB / 64327MiB 

That unknown NVIDIA GPU is a Zotac 3090 Trinity.

hmm, try the docker container?

missing usual advice is to build from source, but the docker container may be a faster route in this case.

Docker does nothing but confuse me :-/ but, I guess I’ll take the plunge.

Kinda off-topic: Don’t people hate Tensorflow and are ditching it for Pytorch?

Sorry man, there are probably some typos in here. I have made some quick edits and will dive deeper later.

The reason we use tensorflow’s tf_cnn_benchmark is because it’s standard and quick. We can compare our benchmarks to many others to be confident in the performance we’re getting. MLCommons (formerly mlperf) is another set of standard benchmark runs that I use in order to further gauge performance. This only covers some really easy and fun ways to test your hardware configurations. :slight_smile:

1 Like

Hi folks, I did a flurry of edits to flesh this out some more. Please drop comments somewhere for formatting and any confusing parts.

1 Like

Tensorflow is a bit of a mess, but AFAIK that talking point was mostly about Tensorflow 1.x. Here you had to declarative define the compute graph in a really weird way and then later feed variables into it.

Pytorch on the other and has a fairly normal imperative model where the compute graph is traced during execution and as a user you basically just write numpy style code but its fast.

However with tensorflow 2.x that mostly changes and you can write the same imperative style of code in tensorflow. You also fully control when things are jitted to run fast which is nice as you can turn it off for debugging. No idea how this is in pytorch though.

There are also (at least) two older APIs for Tensorflow - Keras and Estimators. I think Estimators only use case was to make Tensorflow 1.x berable. Keras is inherited from another library which makes for a nice API on the surface but breaks in extremely frustrating and impossible to debug ways when you try wandering off the beaten path.

1 Like

Hi Wendell,

I watched the video to the end, and have both a slightly off topic and an on topic remark.

First on topic and related, I have been considering buying an NVidia Jetson TX2 dev kit, and was wondering if you have any experience on how they compare to workstation/desktop AI for trained nets. Are there any L1 videos or forums that currently discuss the Jetson that I may have overlooked or missed?

Second, slightly off topic but still related to the video. Although your stove backsplash may be easier for you to clean than if you attempted to clean Jensen’s, I suspect you may have forgotten how easy it is for Jensen to have his cleaned. As Nvidia’s CEO, Jensen is quite rich, he has several options not affordable to the mere mortal, which would make it much easer for him to ‘clean’ his by:

  1. Have construction workers come in and replace the back splash with a new carving/design on a periodic basis. [Least likely]
  2. Have workers unmount and power wash the backsplash on a periodic basis.
  3. Just have the cleaning maids scrub it regularly… [Most likely, and some-what affordable]

it would be cool to motorize it. oh you’re cooking? it flips away the decorative backsplash and replaces it with metal.

I functional things more than opulent things to be opulent :slight_smile:

New member having found my way here from your ML/3090 YouTube vid.

As stated in vid comments, great to see you covering 3090 in respect to ML. Up to now been doing ML with a 1080, but recently purchased 3090 FE (still waiting for PSU to be delivered!). As a Masters A.I. student/part time consultant, doing experiments and prototyping on home PC, I wouldn’t go the route of multiple GPUs and 3090 solely dedicated to ML.

I typically set up my development environment as a Docker, using the MiniConda image with Ubuntu 18.04, so that I have a consistent dev environment (Pycharm pointing to image as Python interpreter) on local PC and in cloud to where I deploy. On local lab PC the container runs with Docker for Windows (shock horror, but what can I say, easy setup and access to MS Office).

Will def be comparing my benchmarks with your forum-published ones in next weeks once new PC built (upgrading from 5820K/1080 to 5950X/3090). Be interesting to see how my particular docker env compares with yours (i.e. how much perf I may be losing). Thanks again for content that really interests us ML enthusiasts (ex gamer, well maybe only MSFS ).

1 Like

Hey Wendell, remember me? I’m that guy that built this dual rtx titan rig dedicated for ML / Deep Learning one month before the release of the 3090 (Build log 3970x).

Straight off the batt - love the video, thanks for doing it. I tried petitioning GN to do something on deep learning which never materialized. I should have known you’d come through on this!

Please build on this. This niche was in dire need of the testing and analysis you bring to the table, just look at the daily posts on the topic over on r/machinelearning.

You might be surprised, but I think the market for this content is not who you might expect it to be. I’m quant finance to pay the bills, part time grad student & kaggler at night. The need for computational horsepower at a workstation level is primarily for my kaggle escapades where I need to run lots of experiments and test ideas QUICKLY. Non-big tech funded academia is typically small scale tinkering and can usually be achieved on modest hardware (though google cloud and nvidia do offer us free compute).

Feedback -

  • ResNet50 appears like a good benchmark on the surface, but it’s not very relevant anymore for CNN classification - EfficientNet b0-7 is now industry standard (

  • I’m a tad confused by the test runs, were you using exclusively FP32 vs FP16 for all stages of backprop/training? Some aspects of the computational graph don’t need the extra precision but some do, so we often use mixed precision ( For the most part I believe it’s really only those big beastly NLP transformer models that like FP32 exclusively.

  • I’m a hardware guy at heart, you are too and you’re definitely right re the benchmarks that appeared out of the gate, they were half baked at best. However drop in performance sans tinkering or optimization does matter. In terms of the research community, if you’re in academia you’re maybe trying a new approach to a layer, testing an optimization method etc, you’re unlikely to be testing anything at scale and tinkering with different compilers is probably not a good use of time. So mixed precision numbers are good indicators of what performance people will likely see.

Research in the big stuff has been lead by google, nvidia or facebook for a reason… you can throw whatever you have at TPUs/DGXs and not care. That’s largely why NVIDIA researchers have lead the way with image GAN research.

Forgive me, this is rambly, but here’s what I’d love to see:

  • EfficientNet benchmarks/comparison with mixed precision.
  • Wall times! Sometimes those Images/sec look awesome but if my wall time is filled with some inefficiency that’s been introduced then I’m back where I started.
  • Dual 3090s over NVLink, this is the presumptive step from dual titans in a single workstation.

Something I keep in mind (in my day job and in deep learning) is the useful compute time threshold: If compute time to evaluate a model architecture or train my model improves from 10 hours to 8 hours, the speed up has no utility to me because I’m still going to be running that model overnight. I think this factors into a lot of the decision making process for small start up shops. If your current VRAM satisfies your requirements, you need to see some dramatic improvements in runtime before it makes sense to increase your compute capacity if you maintain your own hardware.

I have an accumulation of thrown together notebooks from various kaggle competitions / projects etc so feel free to reach out if you’d like to try something more current.

I’d also love to know how the 3090 can handle NVAE (


Sounds good. I’m going to need more redpilling on this. My experience comes from helping older PhD researchers that have a machine or two, and I’m left to fill in the gaps myself on what we are trying to do.

I’ll work on this in the interim. I have burned a lot of time behind the scenes on this general area, I’m excited about it, but I get non deterministic results. For specific example problem sets I can often hand tune things and get enough of a performance delta it throws off any meaningful comparison between generations of cards. Resnet50 was also slower when v100 laun he’d but now? Much faster. I don’t see an easy way to do apples to apples testing comparisons without long-winded disclaimers.

There’s also some diy solutions to pack your own math to get around some of the gimping between pro and consumer parts for ML/AI now too which clouds the water

It’s very frustrating to try to condense into video talking points.

I’ve spent so much time with the v100s on this it’s a bit shameful I don’t have more content to show for it.


I also bought the Suprim X and so far it’s working great. I would love to see some language model benchmarks next time as well like GPT-2 (probably the 700m parameter model since that was all I could fit in 24GB) or BERT. Thanks for the great content as always.

Pytorch and Flax benchmarks would also be cool.


not me, I am used to tensorflow and use it under my jupyter notebook all the time and I prefer using tf.keras most of the time.

have you tried the same under a WSL2 instance? I am planning to switch over to that since it enables me to write shellscripts in a jupyter notebook cell. You need a directML driver though.