5080 16GB vs 3090TI 24GB Generative AI benchmarking!

ubergarm · April 21, 2025, 5:59pm

tl;dr;

5080 16GB is about 20% faster prompt processing and 10-15% faster token generation than 3090TI FE 24GB for the case where both LLM weights and 8k context can fit entirely in 16 GB VRAM.

5080 16 GB is about 10% faster generating images with stable diffusion than 3090TI FE 24GB.

However, the extra 8GB VRAM on the 3090TI FE 24GB offers flexibility to run larger models, longer context, and larger batch sizes as well as general quality of life for driving a desktop rig running xwindows and a browser etc.

Motivation

I’ve been enjoying the recent GPU review content of the 5060TI 16GB and today’s video on the ASUS 5080 16GB and wondering how these new 16GB cards compare to older cards for common home lab ai workloads like LLM inferencing and Stable Diffusion image generation?

Rambling Background

I don’t have a lot of data points yet beyond Wendell’s Procyon AI Text Generation Benchmark that uses ONNXRUNTIME. This is still useful comparison, but in my understanding, ONNX runtime packages a model in such a way that it can run on a variety of systems e.g. Linux, Windows, CPUs, GPUs, etc. However, that can come at a cost to performance over a more optimized runtime built with optimizations for the target hardware.

Just recently, with Wendell’s hardware support, I was able to release an experimental quantization of Google’s gemma-3-27b-it-qat-GGUF LLM model on huggingface designed to work with ik_llama.cpp fork. These quants are perfect for playing around with the best quality gemma-3 in under 16GB VRAM (CUDA GPUs anyway ).

Unfortunately, these models will NOT run on mainline llama.cpp, ollama, lm studio, koboldcpp, etc. (If you want to run on those engines, check out bartowski/google_gemma-3-27b-it-qat-GGUF. I’ve actually been in touch with bartowski, he’s doing some great stuff to support ai home lab enthusiasts!)

LLM Benchmark

Interestingly, redditor u/Maxious just posted benchmarks running the ubergarm/gemma-3-27b-it-qat-GGUF/gemma-3-27b-it-qat-mix-iq3_k.gguf with 8k context on a Inno3D RTX 5080 X3 OC 16GB! So I ran the exact same command on my 3090TI 24GB and here are the results.

Higher numbers are better for both graphs.

The graph on the left is prompt processing (sometimes called “prefill”) which is sort of how fast the LLM can read what you give it. You can see that the more input you give the model, the slower it is given there are more computations.

The graph on the right is token generation or how fast the LLM can write in reply. Similarly, the more text that it had to process the slower it is given each new word it generates depends on every word that came before it so it slows down for longer generations.

Just for comparison, I ran a couple others configurations on my 3090TI 24GB to show how if you have more than 16GB VRAM you could not compress the kv-cache and leave it at f16 which is slightly faster than q4_0 on GPU inferencing.

I also ran in a configuration that only takes 12GB VRAM by offloading just the attention and kv cache onto CPU RAM and let my 9950X with 96GB DDR5-6400 and overclocked infinity fabric handle those calculations. This allows a person with low VRAM to use a much larger context but at the penalty of slower performance.

Finally, my 3090TI 24GB could fit up to 32k context with this model which would allow processing of longer text, longer code generation etc. So keep that in mind when comparing.

Personally, I’d probably still choose a used 3090TI 24GB over a 5080 16GB if local LLM inferencing is an important part of your daily activities. While it may be about 10% slower for jobs that fit into 8k context, keeping everything in up to 32k context will out perform the 5080 16GB given you would have to offload to CPU RAM… Also having a little extra VRAM to run xwindows display and firefox gives quality of life

Image Diffusion

The other big home lab use case for GPUs is running ComfyUI for Stable Diffusion image generation. I have run SD1.5, SDXL 1.0, PonyXL, FLUX.1-dev, Illustrious, the newest HiDream, and even Wan2.1-I2V-14B to generate short video clips from a starting image. That whole scene is really impressive with a lot of optimizations allowing complete workflows to fit in under 16GB VRAM. You can find most of those models and workflows on civit.ai and huggingface as well.

I don’t have any direct comparisons myself, but just saw a new open bechmark project pop up today as posted on reddit by u/yachty66.

It runs stable diffusion generations for 5 minutes and measures number of images generated, average temperature, and eventually power. Seems to take fixed 4GB VRAM and kept the utilization pegged near 100% and power maxed out just below 450 watt limit.

==================================================
BENCHMARK RESULTS:
Images Generated: 141
Max GPU Temperature: 79°C
Avg GPU Temperature: 73.9°C
==================================================

Okie, if anyone is interested holler and I can give exact commands if you want to try any of this on your rig to compare! I would love to see how the 5060TI 16GB results stack up!

ubergarm · April 21, 2025, 6:20pm

Run LLM Benchmark

Steps to build and run ik_llama.cpp fork with above gemma-3 LLM quant.

# Install build dependencies and cuda toolkit as needed
git clone https://github.com/ikawrakow/ik_llama.cpp
cd ik_llama.cpp
# Configure CUDA+CPU Backend
cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF
# Build
cmake --build ./build --config Release -j $(nproc)
# Confirm
./build/bin/llama-server --version
version: 3640 (93cd77b6)
built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu
# Download 13.7GB model
wget https://huggingface.co/ubergarm/gemma-3-27b-it-qat-GGUF/resolve/main/gemma-3-27b-it-qat-mix-iq3_k.gguf
# Benchmark
CUDA_VISIBLE_DEVICES=0 \
./build/bin/llama-sweep-bench \
    --model gemma-3-27b-it-qat-mix-iq3_k.gguf \
    -ctk q4_0 -ctv q4_0 \
    -fa \
    -amb 512 \
    -fmoe -c 8192 \
    -ub 512 \
    -ngl 99 \
    --threads 4

Run Stable Diffusion Benchmark

This downloads about ~6GB into ls ~/.cache/huggingface/ and requires you to login with your huggingface settings → create new token

# 1. Create huggingface account and token with above link
mkdir gpu-benchmark
cd gpu-benchmark/
# install uv (better python pip) locally into ~/.cache/uv
# https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv ./venv --python 3.12 --python-preference=only-managed
source venv/bin/activate
uv pip install gpu-benchmark
uv pip install huggingface_hub[cli]
huggingface-cli login
# benchmark takes 5 minutes and 4GB VRAM
gpu-benchmark
# cleanup everything when complete
cd ..
rm -rf gpu-benchmark
huggingface-cli delete-cache
# select sd-legacy/stable-diffusion-v1-5 5.5GB, hit space, enter confirm Yes
# delete huggingface token if desired
rm ~/.cache/huggingface/*tok*

igormp · April 23, 2025, 2:31am

Thanks for those benchmarks!

I guess the 3090 is still pretty competitive, specially considering that you can get it used for reasonable prices (depending on your region).

I think I’ll be keeping the couple ones I have for a long time still, or at least until I can get a pair of reasonably-priced 32GB GPUs, which I doubt will happen anytime soon haha