Nvidia RTX 5090 has Launched!

Video

5000 Series Architecture - So Many Tensor Cores

Session 2 -  Blackwell Architecture_Final_Page_06
Session 2 -  Blackwell Architecture_Final_Page_07
Session 2 -  Blackwell Architecture_Final_Page_11

AI Comparison

procyon ai text score
procyon ai text tokns

ai text generation

Productivity Comparison

geekbench 6
puget

5090 - geekbench 6
NIVIDA RTX 5090 Founder’s Edition - Result

4090 - geekbench6
NIVIDA RTX 4090 Founder’s Edition - Result

Artificial Comparison

nvidia dlss test
port royal

FPS Comparison

4k average fps no upscale
4k avg fps dlss performance no framegen
avg fps framegen comparison

Press Deck

Architecture

Session 2 -  Blackwell Architecture_Final_Page_01
Session 2 -  Blackwell Architecture_Final_Page_02
Session 2 -  Blackwell Architecture_Final_Page_03
Session 2 -  Blackwell Architecture_Final_Page_04
Session 2 -  Blackwell Architecture_Final_Page_05
Session 2 -  Blackwell Architecture_Final_Page_06
Session 2 -  Blackwell Architecture_Final_Page_07
Session 2 -  Blackwell Architecture_Final_Page_08
Session 2 -  Blackwell Architecture_Final_Page_09
Session 2 -  Blackwell Architecture_Final_Page_10
Session 2 -  Blackwell Architecture_Final_Page_11
Session 2 -  Blackwell Architecture_Final_Page_12
Session 2 -  Blackwell Architecture_Final_Page_13
Session 2 -  Blackwell Architecture_Final_Page_14
Session 2 -  Blackwell Architecture_Final_Page_15
Session 2 -  Blackwell Architecture_Final_Page_16
Session 2 -  Blackwell Architecture_Final_Page_17
Session 2 -  Blackwell Architecture_Final_Page_18
Session 2 -  Blackwell Architecture_Final_Page_19

AI Overview

Session 3 - RTX AI PC_FINAL_Page_01
Session 3 - RTX AI PC_FINAL_Page_02
Session 3 - RTX AI PC_FINAL_Page_03
Session 3 - RTX AI PC_FINAL_Page_04
Session 3 - RTX AI PC_FINAL_Page_05
Session 3 - RTX AI PC_FINAL_Page_06
Session 3 - RTX AI PC_FINAL_Page_07
Session 3 - RTX AI PC_FINAL_Page_08
Session 3 - RTX AI PC_FINAL_Page_09
Session 3 - RTX AI PC_FINAL_Page_10
Session 3 - RTX AI PC_FINAL_Page_11
Session 3 - RTX AI PC_FINAL_Page_12
Session 3 - RTX AI PC_FINAL_Page_13
Session 3 - RTX AI PC_FINAL_Page_14
Session 3 - RTX AI PC_FINAL_Page_15
Session 3 - RTX AI PC_FINAL_Page_16
Session 3 - RTX AI PC_FINAL_Page_17
Session 3 - RTX AI PC_FINAL_Page_18

AI In Gaming

Session 4 - Generative AI For Games_FINAL_v4_Page_01
Session 4 - Generative AI For Games_FINAL_v4_Page_02
Session 4 - Generative AI For Games_FINAL_v4_Page_03
Session 4 - Generative AI For Games_FINAL_v4_Page_04
Session 4 - Generative AI For Games_FINAL_v4_Page_05
Session 4 - Generative AI For Games_FINAL_v4_Page_06
Session 4 - Generative AI For Games_FINAL_v4_Page_07
Session 4 - Generative AI For Games_FINAL_v4_Page_08
Session 4 - Generative AI For Games_FINAL_v4_Page_09
Session 4 - Generative AI For Games_FINAL_v4_Page_10
Session 4 - Generative AI For Games_FINAL_v4_Page_11
Session 4 - Generative AI For Games_FINAL_v4_Page_12
Session 4 - Generative AI For Games_FINAL_v4_Page_13
Session 4 - Generative AI For Games_FINAL_v4_Page_14
Session 4 - Generative AI For Games_FINAL_v4_Page_15
Session 4 - Generative AI For Games_FINAL_v4_Page_16
Session 4 - Generative AI For Games_FINAL_v4_Page_17
Session 4 - Generative AI For Games_FINAL_v4_Page_18
Session 4 - Generative AI For Games_FINAL_v4_Page_19
Session 4 - Generative AI For Games_FINAL_v4_Page_20
Session 4 - Generative AI For Games_FINAL_v4_Page_21
Session 4 - Generative AI For Games_FINAL_v4_Page_22
Session 4 - Generative AI For Games_FINAL_v4_Page_23
Session 4 - Generative AI For Games_FINAL_v4_Page_24
Session 4 - Generative AI For Games_FINAL_v4_Page_25
Session 4 - Generative AI For Games_FINAL_v4_Page_26
Session 4 - Generative AI For Games_FINAL_v4_Page_27
Session 4 - Generative AI For Games_FINAL_v4_Page_28
Session 4 - Generative AI For Games_FINAL_v4_Page_29
Session 4 - Generative AI For Games_FINAL_v4_Page_30
Session 4 - Generative AI For Games_FINAL_v4_Page_31
Session 4 - Generative AI For Games_FINAL_v4_Page_32
Session 4 - Generative AI For Games_FINAL_v4_Page_33
Session 4 - Generative AI For Games_FINAL_v4_Page_34
Session 4 - Generative AI For Games_FINAL_v4_Page_35
Session 4 - Generative AI For Games_FINAL_v4_Page_36
Session 4 - Generative AI For Games_FINAL_v4_Page_37
Session 4 - Generative AI For Games_FINAL_v4_Page_38
Session 4 - Generative AI For Games_FINAL_v4_Page_39

Nvidia’s Testing

Session 7 - RTX Benchmarking_FINAL_v4_Page_01
Session 7 - RTX Benchmarking_FINAL_v4_Page_02
Session 7 - RTX Benchmarking_FINAL_v4_Page_03
Session 7 - RTX Benchmarking_FINAL_v4_Page_04
Session 7 - RTX Benchmarking_FINAL_v4_Page_05
Session 7 - RTX Benchmarking_FINAL_v4_Page_06
Session 7 - RTX Benchmarking_FINAL_v4_Page_07
Session 7 - RTX Benchmarking_FINAL_v4_Page_08
Session 7 - RTX Benchmarking_FINAL_v4_Page_09
Session 7 - RTX Benchmarking_FINAL_v4_Page_10
Session 7 - RTX Benchmarking_FINAL_v4_Page_11
Session 7 - RTX Benchmarking_FINAL_v4_Page_12
Session 7 - RTX Benchmarking_FINAL_v4_Page_13
Session 7 - RTX Benchmarking_FINAL_v4_Page_14
Session 7 - RTX Benchmarking_FINAL_v4_Page_15
Session 7 - RTX Benchmarking_FINAL_v4_Page_16
Session 7 - RTX Benchmarking_FINAL_v4_Page_17
Session 7 - RTX Benchmarking_FINAL_v4_Page_18
Session 7 - RTX Benchmarking_FINAL_v4_Page_19
Session 7 - RTX Benchmarking_FINAL_v4_Page_20
Session 7 - RTX Benchmarking_FINAL_v4_Page_21
Session 7 - RTX Benchmarking_FINAL_v4_Page_22
Session 7 - RTX Benchmarking_FINAL_v4_Page_23
Session 7 - RTX Benchmarking_FINAL_v4_Page_24
Session 7 - RTX Benchmarking_FINAL_v4_Page_25
Session 7 - RTX Benchmarking_FINAL_v4_Page_26
Session 7 - RTX Benchmarking_FINAL_v4_Page_27
Session 7 - RTX Benchmarking_FINAL_v4_Page_28
Session 7 - RTX Benchmarking_FINAL_v4_Page_29
Session 7 - RTX Benchmarking_FINAL_v4_Page_30
Session 7 - RTX Benchmarking_FINAL_v4_Page_31
Session 7 - RTX Benchmarking_FINAL_v4_Page_32
Session 7 - RTX Benchmarking_FINAL_v4_Page_33
Session 7 - RTX Benchmarking_FINAL_v4_Page_34
Session 7 - RTX Benchmarking_FINAL_v4_Page_35
Session 7 - RTX Benchmarking_FINAL_v4_Page_36
Session 7 - RTX Benchmarking_FINAL_v4_Page_37
Session 7 - RTX Benchmarking_FINAL_v4_Page_38
Session 7 - RTX Benchmarking_FINAL_v4_Page_39

7 Likes

Seems like DLSS4 will never be back ported to the 40/30/20 series. It would make the 50 series much less relevant regarding price/performance. Which brings up the long-term concern that this proprietary technology creates some intense vendor lock in further cementing a Nvidia monopoly (and Windows gaming monopoly if you can’t get these features in Linux/Proton?).

1 Like

A lot to digest here and will continue to see other reviews come out about it. Having a 3070 has been sufficient for me so unless I see something crazy that might help out in my system I might go for the 5000 series.

The only thing I want at the moment is higher end HDMI connections lol

1 Like

in the vid Wendel said the linux native cuda wasn’t ready. Did you try the wsl version of ollama? i was curious for AI stuff even if it’s wsl stuff please.

1 Like

I am so impressed with the RTX 5090, I can hardly contain myself… :yawning_face: (Not meant as a slight against the reviewers, but against the product and particularly its price.)

I wouldn’t call it vendor lock, but it is an artificial segmentation by software-lockout. However that isn’t just a future concern, you see that these days in many products. Just to name a few that come to mind:

  1. A networking switch that requires an extra license key, so the SFP ports become SFP+ ports and work at 10 GBit/s.

  2. An oscilloscope that requires an extra license key to unlock certain features.

  3. An electric car that can unlock extra battery capacity from the existing battery by purchasing and thus unlocking that feature.

While some aspects of DLSS4 might be backported, I also doubt that multi-frame generation will, for the very reason that you express. If you have seen the Hardware Unboxed video, Steve called this card the RTX 4090 TI multiple times. A joke of course, but - like any good joke - it contained a deeper truth.

1 Like

Thought I read some of DLSS 4 features will be back ported to previous generations. If true, curious how this will change the performance on the 30 series cards. Cyberpunk 2077 apparently has the DLLs with it’s update that came out today, someone seems to have already shared them.

Over the next week, I’m going to attempt benchmarking some stuff with a 3060ti to compare 3.x to 4 differences. Does anyone have any requests on what they’d like to see? I’ll see if there’s a feasible way for it to be added, this will be a new endeavor for me.

1 Like

Hardware-wise, this is very disappointing. I have a little experience with the RTX 6000 Ada (effectively a 4090 Ti Super for servers), and the pure throughput performance feels very similar to the 5090. If that’s Nvidia’s product segment goal, then the 5090 is effectively a discount from USD 7000 to USD 2000. I think there would be some in the audience that would be interested in a comparison video between the RTX 6000 Ada and the 5090.

On the topic of frame generation: there is no point in measuring performance with frame generation, if the majority of frames we’re going to see have no relation with any inputs or game-state updates. This problem isn’t necessarily new, even without frame generation, some rendering engines de-sync input and state updates from frame buffer updates. Reviewers would benefit by figuring out a new way to measure performance that takes into account state updates.

The ML and neural rendering features are far more interesting. DLSS’s Transformer model looks a lot better in motion, and thankfully it’s available for all RTX branded cards. All the neural rendering tech is very interesting, the Half Life 2 showcase impressed me very much.

AMD’s delay is even more mind-boggling, from all the leaks it really seems they have a good mid-range product on their hands.

PS: Nvidia’s reviewer material talks about DLSS without frame generation. That 80% number is purely upscaling.

PPS: During SWOutlaws, Wendell talks about DLSS4 but mentions FSR4 instead.

2 Likes

While I am loving the new cooler design from Nvidia. I cant help but notice that no one is talking about how this is the exact kind of thing EVGA always was trying to do and Nvidia would tell them no. Also looking at all the other vendor cards at CES that may still be the case. This does make me cautious about the direction Nvidia is going.

3 Likes

So, I noticed skimming all the reviews/coverage (including) L1T’s that there’s not been any good testing of even the most basic home Gen AI (LLM, image/video gen) testing (llama-bench, vLLM benchmark_serving.py, torchtune, Stable Diffusion etc.). I assume this is partially lack of expertise, maybe partly time… Just wondering if you guys have anything more planned, or if a guide is necessary? (maybe I’ll write one if there’s a real need)

1 Like

Seems to be due to the lack of linux drivers with an updated CUDA toolkit.
It’ll supposedly be available by jan 29 or 30, phoronix had a post about it.

2 Likes

CUDA 12.8 was released last month w/ Blackwell support. See: 1. CUDA 12.8 Release Notes — Release Notes 12.8 documentation

The last Phoronix article on the 5090 from a few days ago said: “So I can confirm the GeForce RTX 5090 is at Phoronix for Linux testing, but that’s about it for today” which I take to be that testing is underway but numbers aren’t published due to embargo, not because of lack of support unless there’s something else I missed?

1 Like

Yesterday’s update is “the NVIDIA GeForce RTX 5090 Linux benchmarks will begin in the days ahead once having a Linux driver build”.

1 Like

Can confirm we didn’t have linux drivers. That’s why we just tested procyon ai on windows. There will be level1linux video going in depth into ai and the like though soon™

6 Likes

Good to know, did CUDA not work in WSL2 or was CUDA 12.8 not compatible on the Windows side? (just curious)

BTW, if you guys are going to do simple testing, doing a regular llama.cpp CUDA build and testing a llama or llama2-7b q4_0 quant with llama-bench would be a good pipecleaner (it’s basically a “hello world”/benchy that’s fast to run and lets you compare against all Macs and other devices pretty easily.

llama.cpp is a good engine to test with since it’s probably the most used local inference engine (Ollama, LM Studio, Jan, Mozilla, and others all tend to use it). If you’re looking for something to stretch its legs or more rigorous testing, here’s an example of my using vLLM’s benchmark_serving to test Qwen2.5-Coder-32B’s speculative decoding perf between a W7900 and 3090.

If you are going to do more involved testing, on the LLM side, I think vLLM and sglang are both good targets with vLLM’s benchmark_serving.py or sglang.bench_serving. Here’s some recent testing I was doing on throughput vs TTFT on a concurrency sweep w/ mean, and p50 (median)-p99:

Note, for this testing it’s probably best to check everything into a repo for reproducibility, it’s very easy to misconfigure/screw up. Also you’ll want to dump system info and all library versions as per is still changing dramatically across versions. It’s not as useful to have llama.cpp number w/o a build#, or a full version number for your PyTorch and vLLM/sglang (also Python, CUDA, drivers etc).

Also I highly recommend mamba for managing your environment. It is able to manage not just Python but also CUDA and other system level libs. You could try to target a specific docker container and maybe save yourself some headache.

For training, I think something simple like a basic pytorch-lightning , trl or torchtune script would be easy.

BTW, if you want to test efficiency, you can set PL and do a simple inferencing sweep pretty quickly, eg testing a 3090:

Here’s all my code for the sweep and charts (as well as some percent and percent delta charts).

1 Like

Out of curiosity, did you also monitor consumption at the same time? Because power limit is not the same as consumption, and from the graph I would not be surprised that the ‘tg128’ curve actually was running below the power limit on the left side of the graph?

I see this mistake made all the time… People thinking that lower TDP components are ‘more efficient’ which is not always true and entirely workload dependent…

1 Like

Yeah, the CUDA toolkit does support the new ISA, however you need a driver newer than 570, which has not been released yet, see:

Running any NVIDIA CUDA workload on NVIDIA Blackwell requires a compatible driver (R570 or higher).

Their latest beta version is still at 565.57.01:

2 Likes

Michael ‘Phoronix’ Larabel seems to have the R570 driver already. On average 42% uplift over 4090 in compute. There’s not too much AI/LLMs in there but I’d presume scaling would be similar? Maybe q4 models have a bigger uplift as there’s dedicated int4 hardware on blackwell?

1 Like

Yes, the efficiency part mostly relates to pp, although the tg is somewhat relevant as the the chart title “GPU Performance (pp512 and tg128) at Different Power Limits” the focus of that test is just on what the optimal PL setting is for perf, not measuring a joule/token or anything like that.

That can be done, however w/ both nvidia-smi and rocm-smi). See the discussion here for how to do it.

1 Like

Those compute results are cool, but probably don’t say too much about AI/ML perf as those tests all look like they are CUDA core vs Tensor core perf.

I’ve seen a lot of confusion in forums/posts but Q4 has very little little to do w/ INT4. Traditionally most Q4 quants are W4A16 and compute actually happens in FP16. The llama.cpp brings these into INT8 for their CUDA backend (From at least Ampere on, Nvidia has INT8 TOPS performance that crushes their Tensor FP16). Marlin also have optimized INT8/INT4/MP kernels (these can be used any PTQ quants like GTPQ, AWQ at least, but don’t apply to llama.cpp k-quants, which march to their own tune).

From my testing, W8A8 w/ Marlin kernels has the best TTFT and total throughput and quality (can actually be better than FP16 w/ the right calibration set and other settings), but for bs=1 if you’re willing to trade off a bit of accuracy, its throughput is usually on top.

2 Likes

I guess one nice thing on the AMD side is that gfx12 has been in amdgpu and llvm for a while now. You might need to build the world, but if you had a one of the RX9070’s sitting in the stockroom atm it’s actually feasible to get ROCm/HIP building from source even.