AI R9700 Pro - 32gb AMD GPU for AI and creative workloads -- Benchmarks

wendell · October 27, 2025, 6:36pm

Pugetbench

Nuke “special project”

This project is

The AMD gpu did well here until they hit some kind of compute limit for the more complex ‘Defocus’ test.

As usual synthetic benchmarks show Nvidia way out in front

For AI, a 2x R9700 setup can be quite fast on windows, with the Vulkan backend:

I did not have to do much special for ComfyUI and ROCm thanks to changes the last week or so. Our guide here:
Minisforum MS S1 Max Comfy UI Guide can work for you, too on 9070XT or R9700 Pro.

over 150 tokens’sec with all the tuning knobs and stops pulled out!

(it was about ~~120 out of the box). That’s using every single byte of vram on both cards in our two card setup haha!!

Tokens’sec across various models vs 5090

But what if you add a second R9700?

It’s a 68-80% improvement in speed adding the 2nd gpu, and you double your vram.

Here’s the machine we used for testing – Falcon Northwest Talon system with a Threadripper 9995WX and 768gb ram.

vramfiend93 · October 27, 2025, 8:12pm

would it be easy to use it as gpu for virtual machines¿?

Ilake · October 27, 2025, 8:23pm

I like trains:)

jksoftware · October 27, 2025, 8:43pm

For the AI workloads. At what context size was this tested with and what were the size of the prompts?

bhamm-lab · October 27, 2025, 9:14pm

Very curious how you’re able to use shared vram between two cards over the pcie bus (specifically on linux). How does this work and what configuration qwirks did you have to set to enable this?

wendell · October 27, 2025, 9:53pm

for most mdoels, its just standard stuff. In vLLM the layers are assigned to different GPUs.

our custom benchmark script sweeps over a small set of prompts. it ranges from 20 to 1k input tokens. the results are an average but for example one of the test generates 8192 tokens. Total run is about 13k input tokens. depending on the model the answer to the perfect number query, for example, is potentially one of those thousands-of-tokens responses.

theres also a warmup/prefill step. the one thing we have to be careful is also looking at the coherence of the output especially with quant models. one thing we ran into with llama.cpp was the output was “worse” than vllm (while also being faster. but incoherent-worse).

relatively small context windows overall though.

lm studio was either at defaults, or doubling the context window

xwraith · October 27, 2025, 10:29pm

PCIe passthrough to a windows VM?

ferretduck · October 27, 2025, 10:55pm

Any benchmarks for R9700 vs 5090 on ComfyUI image generation?

CircuitSage · October 27, 2025, 11:23pm

When/If you have time, it would be interesting to see two other comparisons to the 2x R9700 Pro

Test 2x R9700 on an AM5 platform with (bifurcated) 8x Gen 5 CPU lanes to each card.
1.1 How much impact do these workloads see from the reduced p2p and system RAM bandwidth?
Test 2x 9070 XT on the same threadripper platform
2.1 What speed differences are there vs. the “pro” version? The pro is 2x the price for effectively a few more GDDR 6 chips.
2.2 It’s expected some tests would not be runnable due to reduced VRAM
2.3 Yeah, it’s a clearly a different PCB with probably a higher quality BOM. But the point stands.

schjlatah · October 27, 2025, 11:39pm

I would love to start seeing some benchmarks on each epoch takes when training a YOLO model.
My main professional use case is to train YOLO models on custom datasets; so doing something like using Ultralytics’ packages to train a YOLOv11 model on the COCO dataset would be a good proxy for my workloads and use cases.
I’ve been moving mountains to use my 4070 TS as long as I can, but 16GB is too much of a bottleneck and I’m running up against max parameters in our training set.

I’d love to know how long each epoch takes for something like:
yolo train model=yolo11X.pt data=coco2017 batch=-1 imgsz=1024

AppelJo · October 28, 2025, 12:10am

Also interested on the state of SRIOV and GPU partitionning in VM with latest from AMD

2cX6Z · October 28, 2025, 12:49am

Seems like this card might be a great cost effective way to expand the capabilities of one of the many AMD AI MAX+ 395 PC’s as an eGPU. Cost, performance, and efficiency would be much better than an RTX 5090 and way less than the 4 GPU Threadripper teased in the video.

I’d love to see how much of a performance bump inference and image generation workloads might achieve with that kind of setup.

jxdking · October 28, 2025, 2:41am

Does it have VM reset bugs just like 9070XT does?

AllegraG · October 28, 2025, 3:25am

This it it right here. Wendell had shown this in the Framework videos and it seems like a test like that with this card would be the perfect follow up. Ever since I saw that video (I have a Framework 128 on order) I’ve been thinking that an eGPU would be a perfect way to pump up the volume on that platform.

lambda · October 28, 2025, 6:36am

Hello @wendell, thanks for the effort!
Are there plans to benchmark larger models at lower precision?

Say Qwen3-30B-A3B (MoE), Seed OSS 36B (dense), Llama 70B etc?

Results from such tests might be more useful although I guess we can kinda interpolate from your existing results to an extent.