NVIDIA's DGX Spark Review and First Impressions

Check Out the Video

The Bottom Line

The DGX Spark represents NVIDIA’s latest experiment in making an “AI lab in a box.” While its marketing headline boasts one petaflop of FP4 compute I was only able to achieve that in 500ms bursts due to bottlenecks like memory bandwidth (haha).

The real story is the platform’s design philosophy, “AI in a box” rather than raw inference throughput.

This system isn’t built to win single-model benchmark contests, and that’s okay. The hardware and software stack are already heavily optimized, and yet some bottlenecks remain when compared to dedicated inference hardware. The real magic lies in its NVFP4 precision format, a novel 4-bit floating-point representation developed specifically for NVIDIA’s next-generation inference pipeline. There’s even an academic paper dedicated to it (arxiv.org/abs/2509.25149). When models are trained or fine-tuned directly for NVFP4, the performance is nothing short of stellar—unique, even—but otherwise, traditional model inference performance is relatively modest.

What DGX Spark does excel at is being a complete agentic AI sandbox. It’s a self-contained environment where developers can orchestrate the full modern AI stack—model serving, retrieval, reasoning, coordination, and deployment—all running locally across multiple containers. Its vast memory capacity makes it possible to run several distinct AI models at once, each handling a part of a larger agentic workflow. If you’ve been out of the loop on “the AI thing” and need to catch up quickly, basically everything is here. One can quickly onboard into the Nvidia ecosystem and one will generally find that anything built or learned here will scale up to the cloud.

I have no doubt that this device will be subsumed into AI and CS educational programs the world over and that will drive many generations of future AI scientists. It is too easy with this platform to get up and running quickly in just about every corner of everything exciting going on with AI today.

Networking is another standout: dual 100 Gb ConnectX-7 NICs make it ideal for distributed training experiments, NCCL tuning, and RDMA-enabled cluster interconnects. Everything you learn scaling workloads to multiple boxes here translates directly to NVIDIA’s higher-tier GB300 systems.

In short, DGX Spark isn’t a product you benchmark—it’s one you build with. Sure I would have loved to see at least 2x inferencing performance, but NVFP4 gets close. Spark is, through and through, a modern and complete AI development instrument designed to help engineers understand and experiment with the connective tissue between models, containers, and data systems that define the next phase of AI computing.


Pros and Cons

Pros Cons
NVFP4 precision delivers exceptional inference performance for NVIDIA-optimized models Inference performance for non-NVFP4 models is relatively weak
Complete “AI lab in a box” — ideal for learning, development, and experimentation Some thermal and bandwidth bottlenecks limit sustained performance
Excellent memory capacity enables multi-model and agentic workflows Cost is High
Dual 100 Gb ConnectX-7 networking supports high-speed NCCL and RDMA workloads 3rd party kernel optimizations needed
Exceptionally well-organized documentation and examples
GB10 “works the same” as GB300

Tell me more about NVFP4

[2509.25149] Pretraining Large Language Models with NVFP4

In this study, we introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format. Our method integrates Random Hadamard transforms (RHT) to bound block-level outliers, employs a two-dimensional quantization scheme for consistent representations across both the forward and backward passes, utilizes stochastic rounding for unbiased gradient estimation, and incorporates selective high-precision layers. We validate our approach by training a 12-billion-parameter model on 10 trillion tokens – the longest publicly documented training run in 4-bit precision to date. Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline. These findings highlight that NVFP4, when combined with our training approach, represents a major step forward in narrow-precision LLM training algorithms.

Check out the NVFP4 optimized models on hugging face:

GB10 device query

FP4, and especially NVFP4 best-case-scenario performance

Model Precision Backend TTFT (ms) E2E Throughput (tokens/sec)
Llama 3.3 70B-Instruct NVFP4 TRT-LLM 195.06 5.39
Qwen3 14B NVFP4 TRT-LLM 51.99 22.59
GPT-OSS-20B MXFP4 llama.cpp 608.3 51.9
GPT-OSS-120B MXFP4 llama.cpp 1594.20 27.50
Llama 3.1 8B NVFP4 TRT-LLM 32.86 39.01
Qwen2.5-VL-7B-Instruct NVFP4 TRT-LLM 366.48 38.08
Qwen3 235B (only on dual DGX Spark) NVFP4 TRT-LLM 87.23 11.73
Llama-3.1-405B-Instruct (only on dual DGX Spark) INT4 vLLM 733.75 1.76

FP8 and bfloat16 Performance

Competitors - AI MAX 395+

AMD’s Ryzen AI MAX 395+ can absolutely be faster in some FP8 and bfloat16 inference workloads—when the software stack is fully optimized. On paper, its peak throughput in mixed-precision math looks competitive, except with NVIDIA’s NVFP4 path, and well-tuned Vulkan or DirectML inference pipelines can push it. But the similarities stop there.

The Strix Halo platform is an APU, not a discrete accelerator. Its integrated RDNA 3.5 compute units share memory and bandwidth with the CPU, which changes everything about how workloads scale. It’s fundamentally a graphics-class RDNA implementation, not a compute-class one, and certainly not CDNA—AMD’s datacenter architecture. While CDNA parts like the MI300 expose mature ROCm, HIP, and RCCL stacks for high-performance compute and distributed training, Strix Halo does not.

In fact, HIP support on this APU has been inconsistent at best. Ironically, Vulkan backends often outperform HIP on Strix Halo simply because the graphics driver path is better optimized and more stable. This means developers trying to prototype agentic or distributed AI workloads on the Ryzen AI MAX will find themselves in a dead-end ecosystem: the performance may be decent locally, but the code, configuration, and tuning knowledge don’t translate to AMD’s CDNA or ROCm-based servers.

So while Ryzen AI MAX 395+ can flash impressive FP8 numbers, it’s a different lineage entirely—a fast APU, not a scalable compute node. Or, as the saying goes, you can’t get there from here. Networking/RCCL? Not there yet. I’d love to see AMD’s Pollara on Strix Halo!

Look at this: DGX Spark | Try NVIDIA NIM APIs – look how polished that is. Now, imagine AMD building that where you start on Strix Halo but you end up at MI300/CDNA… not everything really translates. It’s getting better, but it’s still a problem. A relatively small number of enthusiasts are entirely carrying the AI water on AMD’s non-cdna platforms (imho). You can do some amazing AI stuff on Strix Halo, and run agentic AI… but it’s kind of this weird adjacent experience and the experience focuses mostly on inferencing. Not so much fine-tuning or even some of the crazy fun edge cases we’ve explored like running DeepSeek R528 in Q1 quants in ~~140gb of platform memory (Hard to do that on APU/RDNA).

6 Likes

@GigaBusterEXE has the 3d printable stand we featured in the video

3 Likes

Seems like this could be a nifty toy. Were you able to play around with facial/object recognition for cameras?

I don’t have time to watch the video until later tonight.

yeah, the models run pretty fast. but you can do that on about anything these days. yolo runs on a toaster.

1 Like

What was the prompt processing tokens/sec on Llama 3.3 70B-Instruct? This should be big enough to run Q8, right?

This is the main elephant in the room. FP4 != Q4 and FP8 != Q8.

This is why it’s confusing for consumers. Most people using quantized models are mostly using GGUF (Q8, Q4, etc …), or AWQ or EXL2/EXL3.

Rarely anyone uses FP8 or FP4, even the models are rarely available. Nvidia put themselves in a corner by releasing a device that is “great” for FP4, built great stack around it, but practically, most of the enthusiast don’t care, because it is not optimized for GGUF.

1 Like

STH included this nugget. Big oof on prompt processing speed at Q4:

2 Likes
2 Likes

The Spark is an (IMHO) impressive little AI engine that can (could), under the right circumstances and expectations. However, I am wondering about some choices for the connectivity, maybe @wendell can comment: yes, the 100 Gbit/s networking capability is impressive, but it would have been nice if Nvidia would have (also) included OCuLink or TB5, at least as options. Also because that might have made it easier for us pesky users to attempt boosting compute performance by pairing the Spark with, for example, a 4090 or 5090 attached over an eGPU.
Or maybe that’s exactly what Nvidia was trying to avoid :thinking:?

In that regard, does the Spark use PCI-E 5 at least internally?

Lastly, is there still hope that Intel might, just might deliver on Project Battlematrix? Probably wouldn’ t be anywhere near as cute and polished as the Spark, but could - in theory (!) - deliver a lot of AI oomph per dollar.

So with the CX7 it is RDMA (low cpu overhead) and 200gb. or 20 gigabytes/sec. Or about ~1/3 of PCIE5 bandwidth. far, far beyond even thunderbolt 5.

NCCL does exactly what you’re taling about. Pickup a CX7 pop it in your workstation with an RTX pro and bobs your uncle. Technically 5090 doesn’t support peer2peer anymore but NCCL means that vllm can/should be able to use the remote gpu somewhat asymmetrically. I bet over the next year we see unhinged projects like that.

1 Like

I really hope to see something like Burn would take off as corner stone of an ecosystem (with backends like WebGPU and Vulkan ), so the peasants can get into ML and AI without paying the Nvidia toll. or maybe architecture is going to make CPU’s relevant again.

Okay 14.48 tokens/s on a 120B model is impressive.

Any chance @wendell could suspend his feelings of disgust for a few days and pinch the Spark not only against the Strix Halo but also M4 max? Because it seems the later still wins on memory bandwidth.

What’s odd for me, at least according the llama.cpp benchmarks, the DGX Spark performs alright compared to the Ryzen AI 395+ (which you can find for less than half its price). Ollama is just the UI on top of llama.cpp and is less flexible and often outdated (yet somehow always take the credit).

In any case, Nvidia also released the Jetson AGX Thor, which supposedly is twice as fast than the DGX Spark and around 3.5k USD? It’s a weaker CPU, still nonetheless, it is hard to place the DGX Spark in a winning spot against the market: Ryzen AI is cheaper with similar performance, Apple with their M3 Ultra has both higher memory bandwidth (thus speed) and higher capacity, and Nvidia AGX Thor is twice as fast while being (somehow) cheaper and comes with cuda support.

No wonder the release was delayed so much, and very vague marketing (as in not including the number of CUDA and Tensor RT cores from the DETAILED specs sheet, I tripled checked as of today), it’s nearly impossible to recommend it over the alternative. If you want cuda, the AGX Thor with 2TFLOPs is better value for money, at least in their spec sheets Nvidia lists 2560 GPU cores and 96 Tensor RT cores.

1 Like

The only place that actually list the offical numbers is the register in their article review the DGX Spark yesterday. So it seems the DGX Spark on paper packs more cuda/tensor cores yet FP4 throughput is lower for some reason? Either the Thor has higher Tensor throughput (boost clock), or (put foil hat on) Nvidia is doing some black magic random number generate for those PFLOPS.

Feature DGX Spark Jetson AGX Thor
GPU Architecture Blackwell (GB10) Blackwell (GB200 Deriv. / GB10B)
CUDA Cores 6,144 2,560
Tensor Cores 192 5th-gen 96 5th-gen
RT Cores 48 4th-gen / N/A
AI Throughput ≈ 1,000 AI TOPS (FP4 sparse) ≈ 2,070 TFLOPS FP4 (sparse) ≈ 1,035 TFLOPS FP8 (dense)
Memory Type / Bus / BW 128 GB LPDDR5X / 256-bit / 273 GB/s 128 GB LPDDR5X / 256-bit / 273 GB/s

the device info table was in my video also. for inference perf it’s a low but fine tuning and other tasks are very fast. I think multi agent stuff feels faster too but it’s hard to benchmark that quantitatively.

If we’re talking about gpt-oss-120b which has only 5.1B MXFP4 activations/fwd pass, 14.5 tok/s is much lower than you’d expect for the MBW that the Spark has. ggerganov (who knows a thing or two about llama.cpp :joy:) has published the best numbers for the Spark so far yesterday: Performance of llama.cpp on NVIDIA DGX Spark · ggml-org/llama.cpp · Discussion #16578 · GitHub

Actually, it looks like there’s been a big bump up from yesterday and more along the lines you’d expect (was like 38 yesterday, now is 52+ tok/s) for the ggml-org quant of gpt-oss-120b(this is Q8/MXFP4 - you can read more about the differences in the GGUF quants of gpt-oss here). That’s more along the lines of what you’d expect and basically exactly matches Strix Halo w/ the Vulkan AMDVLK backend at zero context, (but with 2.5X the pp2048).

The DGX Spark isn’t particularly meant as a pure inferencing box (or if you’re using it as that, you’re probably overpaying for it). I do think that for most people, if you’re not particularly space or power constrained you’re probably better off with other local setups. If you have specific work that can be done on cloud, that’ll like be way more cost effective, and if you have any serious work you need to do locally, you should probably go with PRO 6000(s).

For edge inferencing, I’m curious to see how the M5 Mac Mini/Studios perform now that they have tensor cores (I saw the new M5 Macbook Pros just launched), so maybe we’ll see the other new machines soon.

5 Likes

I want to do a video on this, and you and @ubergarm know what you’re doing.

Some of the “cheap” 12 memory channel epyc plus a “cheap” GPU is a way to get fantastic local inference that beat all of these solutions. vLLM is pretty worthless here mixing cpu+gpu but so many people are hyperfocused on tokens/sec. I think another interesting aspect of this is the 5000 foot view of the software and how different software stacks give you some different kinds of flexibility in where the compute happens, but with different perf tradeoffs. even before quants enter the chat.

One variable is… can I add a $1k CX7 from ebay and use nccl with spark? that would also be really cool.

3 Likes

For GPU/CPU offload I think GitHub - kvcache-ai/ktransformers: A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations is still SOTA, but vLLM has been doing a bunch of work on disaggregated inference and V1 I believe supports both kvcache and weight offload at this point (vLLM simply is not the fastest at c=1/bs=1 though, although vLLM/SGLang can be the right answer for some things).

But yeah for home use, I think in general a combination of some GPUs for doing prefill and for loading shared experts, and having any sort of workstation that has decent memory bandwidth (let’s be real, 256GB/s MBW is not hard to beat!) - for $2-4K you could be running way bigger models at the same speed or faster as these little mini boxes. You can probably get more useful FLOPS as well (at the end of the day a Strix Halo has the compute of a potato (7600XT?) and even the Spark is basically just like a 5070…

There are a lot of accelerator options if you only care about LLM inference as well. Anything from old MI50s to new B60 Duals or R9700s. Heck, there are even those Ascend 310 cards (I’ve haven’t tested that yet, but I have benchmarked a 910B node and poked w/ the CANN stack). You’re really just drawing a matrix of price, memory, memory bandwidth and compute (and software support).

I think as fun toys/tinkering all of these mini boxes are neat to play with, but beyond academics/hobbyist use, I just don’t see many professional/industrial justifications for spending $4K+ for a Spark when you could either rent an H100 for $1.5/h, buy a PRO 6000 for $8K (or for poking around, just use like a $500 consumer GPU).

I do think in general it’d be useful to better outline the different kinds of workloads and tradeoffs someone might be thinking about for home AI. I think right now people have barely wrapped their heads around prefill and generation but don’t really understand how different architectures and quants affect throughput, where the different theoretical vs real world bottlenecks are, where the optimizations/pain points are, and how they differ across different use cases.

5 Likes

Thanks for linking GGs updated benchmark results! Right with 5.1B active parameters per token and attn_(q|k|v|o) at 8bpw (q8_0) and ffn_(down|gate|up)_exps routed experts at mxfp4 4bpw and 4 active experts per token the size of the active weights is roughly 3.2GB.

Assuming 273GB/s max memory bandwidth through the GPU the theoretical max token gen speed would be 273/3.2 = 85 tok/sec and the latest benchmark suggests around 53 tok/sec so saturating maybe 60-70% of max theoretical speeds which is pretty good.

On larger servers configured NPS1 even within a single socket its hard to saturate over 60-70% of memory bandwidth.

For some quant types on my amd 9950x running cpu only I can hit 90+% of theoretical max speeds given my 2x48GB DDR5@6400MT/s.

Makes me curious to compare a similar bpw gpt-oss-120b using the latest hotness iq4_kss 4.0 bpw ik_llama.cpp which runs fine on CPUs with no need for special fp4 support in terms of speed and perplexity quality…

1 Like

The DGX Spark does 1PFLOP FP4 Sparse, so 500TFLOP FP4 non-sparse, which means 250/125 FP8/FP16 non-sparse.
The AGX Thor (allegedly) does 2070 FP4 / 1035 FP8 TFLOPS sparse, so 1PFLOP FP4 dense, 500 TFLOP FP8 dense.

Those numbers don’t make sense, how can Thor be twice as fast as the GB10 while having only 2560 cores and alf the tensor cores under the same GPU architecture?

I think the claims for Thor may just be wrong, or there may be some other misleading thing going on to inflate those numbers.

EDIT: found a likely explanation in the llama.cpp thread linked above:

Quick tldr:
Thor is sm_110 (formerly sm_101) with the datacenter-style tensor cores - including tensor memory. And no raytracing cores. While Spark is sm_121 with the full consumer Blackwell feature set.
Thor and Spark have relatively similar memory bandwidth. The Thor CPU is much slower.
Vector throughput on Thor is 1/3rd of the one on DGX Spark but you get twice the matrix throughput.
Thor has 4 cursed Synopsys 25GbE NICs (set to 10GbE by default, see Enable 25 Gigabit Ethernet on QSFP Port — NVIDIA Jetson Linux Developer Guide as it doesn’t have auto-negociation of the link rate) exposed via a QSFP connector providing 4x25GbE while Spark systems have regular ConnectX-7.
Thor uses a downstream L4T stack instead of regular NVIDIA drivers unlike Spark. But at least the CUDA SDK is the same unlike prior Tegras. Oh and you get less other IO too.
Side note: might be better to also consider GB10 systems from OEMs. Those are available for cheaper than AGX Thor devkits too.

3 Likes