How to RTX 3090 and TensorFlow like a Pro

Rafael_Silva · December 28, 2020, 6:20am

Hi wendell

Setting the cpu governor to performance helped a bit.

I think most of the difference can be explained by the max clock (1875MHz vs 1770MHz) , but thanks again for the article!

computer_cerb · December 28, 2020, 2:24pm

Intriguin post, unfortunately I immediately ran into some problems I could not make sense of:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server --data_format=NCHW --precision=fp16
2020-12-28 15:18:27.893772: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2020-12-28 15:18:27.893789: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
WARNING:tensorflow:From /home/gerd/.local/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
FATAL Flags parsing error: Unknown command line flag 'precision'. Did you mean: partitioned_graph_file_prefix ?
Pass --helpshort or --helpfull to see help on flags.

Can anyone shed some light on this?

System:

$ neofetch 
            .-/+oossssoo+/-.               gerd@zenmasterl 
        `:+ssssssssssssssssss+:`           --------------- 
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 20.04.1 LTS x86_64 
    .ossssssssssssssssssdMMMNysssso.       Host: X570 AORUS MASTER -CF 
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 5.4.0-58-lowlatency 
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 5 hours, 6 mins 
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 2066 (dpkg), 6 (snap) 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: bash 5.0.17 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Resolution: 3440x1440 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   DE: GNOME 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   WM: Mutter 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   WM Theme: Adwaita 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Theme: Yaru [GTK2/3] 
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/    Icons: Yaru [GTK2/3] 
  +sssssssssdmydMMMMMMMMddddyssssssss+     Terminal: terminator 
   /ssssssssssshdmNNNNmyNMMMMhssssss/      CPU: AMD Ryzen 9 5950X (32) @ 3.400GHz 
    .ossssssssssssssssssdMMMNysssso.       GPU: NVIDIA 0c:00.0 NVIDIA Corporation Device 2204 
      -+sssssssssssssssssyyyssss+-         Memory: 8723MiB / 64327MiB 
        `:+ssssssssssssssssss+:`
            .-/+oossssoo+/-.

That unknown NVIDIA GPU is a Zotac 3090 Trinity.

wendell · December 28, 2020, 3:11pm

hmm, try the docker container?

https://www.tensorflow.org/install/source

missing libcudart.so.11.0 usual advice is to build from source, but the docker container may be a faster route in this case.

computer_cerb · December 28, 2020, 4:14pm

Docker does nothing but confuse me :-/ but, I guess I’ll take the plunge.

CybeastRaystriker · December 28, 2020, 5:17pm

Kinda off-topic: Don’t people hate Tensorflow and are ditching it for Pytorch?

Mark_Liqid · December 30, 2020, 9:17pm

Sorry man, there are probably some typos in here. I have made some quick edits and will dive deeper later.

Mark_Liqid · December 30, 2020, 9:20pm

The reason we use tensorflow’s tf_cnn_benchmark is because it’s standard and quick. We can compare our benchmarks to many others to be confident in the performance we’re getting. MLCommons (formerly mlperf) is another set of standard benchmark runs that I use in order to further gauge performance. This only covers some really easy and fun ways to test your hardware configurations.

Mark_Liqid · December 31, 2020, 8:16pm

Hi folks, I did a flurry of edits to flesh this out some more. Please drop comments somewhere for formatting and any confusing parts.

Calaphos · January 8, 2021, 10:41pm

Tensorflow is a bit of a mess, but AFAIK that talking point was mostly about Tensorflow 1.x. Here you had to declarative define the compute graph in a really weird way and then later feed variables into it.

Pytorch on the other and has a fairly normal imperative model where the compute graph is traced during execution and as a user you basically just write numpy style code but its fast.

However with tensorflow 2.x that mostly changes and you can write the same imperative style of code in tensorflow. You also fully control when things are jitted to run fast which is nice as you can turn it off for debugging. No idea how this is in pytorch though.

There are also (at least) two older APIs for Tensorflow - Keras and Estimators. I think Estimators only use case was to make Tensorflow 1.x berable. Keras is inherited from another library which makes for a nice API on the surface but breaks in extremely frustrating and impossible to debug ways when you try wandering off the beaten path.

BlackTachyon · January 9, 2021, 2:11am

Hi Wendell,

I watched the video to the end, and have both a slightly off topic and an on topic remark.

First on topic and related, I have been considering buying an NVidia Jetson TX2 dev kit, and was wondering if you have any experience on how they compare to workstation/desktop AI for trained nets. Are there any L1 videos or forums that currently discuss the Jetson that I may have overlooked or missed?

Second, slightly off topic but still related to the video. Although your stove backsplash may be easier for you to clean than if you attempted to clean Jensen’s, I suspect you may have forgotten how easy it is for Jensen to have his cleaned. As Nvidia’s CEO, Jensen is quite rich, he has several options not affordable to the mere mortal, which would make it much easer for him to ‘clean’ his by:

Have construction workers come in and replace the back splash with a new carving/design on a periodic basis. [Least likely]
Have workers unmount and power wash the backsplash on a periodic basis.
Just have the cleaning maids scrub it regularly… [Most likely, and some-what affordable]

wendell · January 9, 2021, 3:29am

it would be cool to motorize it. oh you’re cooking? it flips away the decorative backsplash and replaces it with metal.

I functional things more than opulent things to be opulent

LostGoatOnHill · January 9, 2021, 12:04pm

New member having found my way here from your ML/3090 YouTube vid.

As stated in vid comments, great to see you covering 3090 in respect to ML. Up to now been doing ML with a 1080, but recently purchased 3090 FE (still waiting for PSU to be delivered!). As a Masters A.I. student/part time consultant, doing experiments and prototyping on home PC, I wouldn’t go the route of multiple GPUs and 3090 solely dedicated to ML.

I typically set up my development environment as a Docker, using the MiniConda image with Ubuntu 18.04, so that I have a consistent dev environment (Pycharm pointing to image as Python interpreter) on local PC and in cloud to where I deploy. On local lab PC the container runs with Docker for Windows (shock horror, but what can I say, easy setup and access to MS Office).

Will def be comparing my benchmarks with your forum-published ones in next weeks once new PC built (upgrading from 5820K/1080 to 5950X/3090). Be interesting to see how my particular docker env compares with yours (i.e. how much perf I may be losing). Thanks again for content that really interests us ML enthusiasts (ex gamer, well maybe only MSFS ).

Creeky123 · January 9, 2021, 3:28pm

Hey Wendell, remember me? I’m that guy that built this dual rtx titan rig dedicated for ML / Deep Learning one month before the release of the 3090 (Build log 3970x).

Straight off the batt - love the video, thanks for doing it. I tried petitioning GN to do something on deep learning which never materialized. I should have known you’d come through on this!

Please build on this. This niche was in dire need of the testing and analysis you bring to the table, just look at the daily posts on the topic over on r/machinelearning.

You might be surprised, but I think the market for this content is not who you might expect it to be. I’m quant finance to pay the bills, part time grad student & kaggler at night. The need for computational horsepower at a workstation level is primarily for my kaggle escapades where I need to run lots of experiments and test ideas QUICKLY. Non-big tech funded academia is typically small scale tinkering and can usually be achieved on modest hardware (though google cloud and nvidia do offer us free compute).

Feedback -

ResNet50 appears like a good benchmark on the surface, but it’s not very relevant anymore for CNN classification - EfficientNet b0-7 is now industry standard (https://www.tensorflow.org/api_docs/python/tf/keras/applications/efficientnet).
I’m a tad confused by the test runs, were you using exclusively FP32 vs FP16 for all stages of backprop/training? Some aspects of the computational graph don’t need the extra precision but some do, so we often use mixed precision (https://www.tensorflow.org/guide/mixed_precision). For the most part I believe it’s really only those big beastly NLP transformer models that like FP32 exclusively.
I’m a hardware guy at heart, you are too and you’re definitely right re the benchmarks that appeared out of the gate, they were half baked at best. However drop in performance sans tinkering or optimization does matter. In terms of the research community, if you’re in academia you’re maybe trying a new approach to a layer, testing an optimization method etc, you’re unlikely to be testing anything at scale and tinkering with different compilers is probably not a good use of time. So mixed precision numbers are good indicators of what performance people will likely see.

Research in the big stuff has been lead by google, nvidia or facebook for a reason… you can throw whatever you have at TPUs/DGXs and not care. That’s largely why NVIDIA researchers have lead the way with image GAN research.

Forgive me, this is rambly, but here’s what I’d love to see:

EfficientNet benchmarks/comparison with mixed precision.
Wall times! Sometimes those Images/sec look awesome but if my wall time is filled with some inefficiency that’s been introduced then I’m back where I started.
Dual 3090s over NVLink, this is the presumptive step from dual titans in a single workstation.

Something I keep in mind (in my day job and in deep learning) is the useful compute time threshold: If compute time to evaluate a model architecture or train my model improves from 10 hours to 8 hours, the speed up has no utility to me because I’m still going to be running that model overnight. I think this factors into a lot of the decision making process for small start up shops. If your current VRAM satisfies your requirements, you need to see some dramatic improvements in runtime before it makes sense to increase your compute capacity if you maintain your own hardware.

I have an accumulation of thrown together notebooks from various kaggle competitions / projects etc so feel free to reach out if you’d like to try something more current.

I’d also love to know how the 3090 can handle NVAE (https://github.com/NVlabs/NVAE)

wendell · January 9, 2021, 4:33pm

Sounds good. I’m going to need more redpilling on this. My experience comes from helping older PhD researchers that have a machine or two, and I’m left to fill in the gaps myself on what we are trying to do.

I’ll work on this in the interim. I have burned a lot of time behind the scenes on this general area, I’m excited about it, but I get non deterministic results. For specific example problem sets I can often hand tune things and get enough of a performance delta it throws off any meaningful comparison between generations of cards. Resnet50 was also slower when v100 laun he’d but now? Much faster. I don’t see an easy way to do apples to apples testing comparisons without long-winded disclaimers.

There’s also some diy solutions to pack your own math to get around some of the gimping between pro and consumer parts for ML/AI now too which clouds the water

It’s very frustrating to try to condense into video talking points.

I’ve spent so much time with the v100s on this it’s a bit shameful I don’t have more content to show for it.

trexd · January 9, 2021, 8:57pm

I also bought the Suprim X and so far it’s working great. I would love to see some language model benchmarks next time as well like GPT-2 (probably the 700m parameter model since that was all I could fit in 24GB) or BERT. Thanks for the great content as always.

Pytorch and Flax benchmarks would also be cool.

mayur_who · January 10, 2021, 8:34am

not me, I am used to tensorflow and use it under my jupyter notebook all the time and I prefer using tf.keras most of the time.

mayur_who · January 10, 2021, 8:42am

have you tried the same under a WSL2 instance? I am planning to switch over to that since it enables me to write shellscripts in a jupyter notebook cell. You need a directML driver though.

mayur_who · January 10, 2021, 8:49am

Even I had same questions and you beat me to it. I think it is fp16 for all stages which is really not how I have seen anyone train their models. I would at least use tf32 to handle those vanishing/exploding gradients better.

one more question I had was, what was the input stage like? How performant is the storage? where are the images stored? I have started using tf.data.xxx and it has helped me optimize my input pipeline and improve speeds.

LostGoatOnHill · January 10, 2021, 2:31pm

Interesting, also doing a lot of work with BERT variants at the moment, hence why I went for the 3090 with 24GB as was exhausting my vram on 1080 as you can imagine.

jaiyam · January 11, 2021, 11:33am

TX2 is old Pascal architecture and does not compare favorably at all with Desktop or server. The target use case of Jetson family is in robotics, so the comparison never makes sense anyway. If yo are going to go for a TX2 type product, I will suggest the Xavier NX. Much better cost to performance ratio than TX2 from a pure inference pov.