NVIDIA's DGX Spark Review and First Impressions

I think the point in the first place for DGX Spark was what it was specifically advertised to be by Jensen when he introduced the machine as Project DIGITS at CES on stage and what Nvidia thereafter also positioned it, which was that this is a testbed/prototyping workstation for a NVIDIA DGX B200 without needing to use said B200 for that work letting it do other things. I think a lot of people jumped the gun when hearing and looking at the specs and very much assumed it was for local AI primarily upon seeing Nvidia and the big pool of memory for high but not outrageous prices comparatively for the market at the time.

Which for sure, it can do and fill that niche but I feel like Nvidia missed the boat for local AI using this given the timing and needing to commit so much wafer capacity on all the other things they want to do. It felt like this chip never got the proper priority it needed so even though it got announced at CES at the same time as AMD’s Strix Halo, AMD managed to push that chip out to market almost a full quarter faster than Nvidia did. That is not to discount the unique features like its networking and FP4 support but in my opinion, it’s much less of a reason today to get it when most setups only will ever use one machine and FP4 is in its infancy and other competitors have more compelling options if you give up those advantages especially on the pricing front. We’ll see how it turns out but my gut feeling is that what I said was accurate and I wouldn’t get a DGX Spark unless I had a DGX GB200 I had to work with.

1 Like

@wendell are there any benchmarks out yet for a dual DGX Spark setup?

no dual benchmarks yet. I might be able to get a second one to test, which imho, is more interesting than inference.

1 Like

I got a little curious about whether NVFP4 on Blackwell actually matters. I don’t have a Spark, so I just tested on a PRO 6000, and it doesn’t look like it.

For those interested in digging in, results and replication data here: speed-benchmarking/nvfp4 at main · AUGMXNT/speed-benchmarking · GitHub

(NVFP4 may be useful for training, but based on my testing, benefits for inference are questionable except maybe at larger/non-interative batch sizes)

1 Like

Running DGX Spark with GLM 4.5 -Air 4 bit via llama.cpp seems to get 16-17 t/s decode and 431 t/s prompt processing. Not too bad.

Aleks Ziskind was also hyping up your stand on his DGX Spark video.

1 Like

The standard LLM inference kernel is effectively ‘mx+b’, the NVF4 kernel is ‘amx+b’ which is significantly more work for the GPU, yet it is roughly the same speed, half the VRAM, and same quality as fp8. It is a huge improvement, just not in the area you seem to care about.

Right now the NVFP4 kernels are not very well optimized for cuda 12 devices as most of them were for cuda 10, vllm has a massive stall in it right now only managing 50% utilization, so I expect it should end up being faster if the kernels get optimized.

Right now the NVFP4 kernels are not very well optimized for cuda 12 devices as most of them were for cuda 10, vllm has a massive stall in it right now only managing 50% utilization, so I expect it should end up being faster if the kernels get optimized.

The NVFP4 was tested with w/ TensorRT/trt-llm, not vLLM.

As for quant quality, there can be huge differences between quants of the same precision based upon calibration and specific quantization techniques (eg SpinQuant with online Hadamard rotations vs RTN or SmoothQuant). People should eval their models against their own downstream use cases for quality, but the argument that somehow there are huge performance gains people are leaving on the table right now by not using FP4? Not from my testing.

Whether there’s more juice to squeeze in the future, only time will tell (but you’re right, I dgaf about any theoretical advantages if they’re not reflected in actual real world performance).

1 Like

trt-llm was/is by far the most promising on spark. all my testing showed that.

I kinda wanna do a follow up on spark and where we are. it remains to be be seen if nim going to be able to demo nccl via spark but I’m working on it

thanks I’m an amateur 3D designer, I really should look further into guides past baby’s first donut from blender guru LOL

1 Like

This being said (and with my inexperience being obviously Obvious), which CX7 would I choose to pick-up and pop-in my Supermicro H13SAE-MF EPYC 4005 System?

I was originally going to go RTX 6000 in my build, but I decided to get the DGX Spark instead.

I don’t need a Royal Flush, but I would really enjoy a seat at The Table :wink:

1 Like

Wendell have you benchmarked 2x Spark and have you explored eGPU on the Spark? :slight_smile: or even nccl?

1 Like

I am working on that very thing right now. It’s good as a trainer/lab… works the same as gb200 but that’s about all I can say right now

working on the eGPU or dual Spark?

dual spark. you “might” also be able to do nccl via cx7 to a rtx pro 6000 ‘other computer’

1 Like

Very interesting. Would be funny if that actually worked well and Nvidia by accident created a much better product than intended.

1 Like

I got my ASUS ASCENT GX10 and have been using it for work (training YOLO models).
So far it’s about 2-5x slower than my RTX 4070 Ti SUPER 16; but it can do things that I was VRAM limited by earlier. For example, part of my “magic” is to train the model for a few epochs at different image sizes on the entire dataset (this works a little better than amp=True). I’ve never been able to train any larger than imgsz=1280 with batch=1 resulting in ~2 hour epochs. On my “Spark”, I’m training for 5 epochs at imgsz=2048 with a batch=4, but each epoch takes 21.5 hours. Doing so uses about 70GB of VRAM (ymmv depending on your dataset).

If, however, you overshoot the amount of RAM/VRAM the device just hard halts and you have to power it down by hand.

I should’ve just bought a 5090, or kept saving for the RTX Pro 6000.