Looking for Advice on High-Performance Workstation Build for AI/ML and Virtualization

Hey everyone,

I’m planning to build a high-performance workstation primarily for AI/ML workloads, virtualization, and some occasional gaming. I’ve been doing some research, but I’d love to get input from this community before finalizing my build.

Use Case:

  • AI/ML Workloads: Training models with TensorFlow/PyTorch, experimenting with LLMs, and running inference locally.
  • Virtualization: Running multiple VMs with different OS environments for testing and development. Considering PCIe passthrough for GPU acceleration.
  • Gaming: Not the top priority, but I’d like the option to play at 1440p with good performance.

Proposed Build:

  • CPU: AMD Threadripper 7960X / Intel Xeon W-2400 (leaning toward AMD for core count and PCIe lanes)
  • GPU: RTX 4090 / AMD Radeon Pro W7900 (considering CUDA support vs. open-source drivers)
  • RAM: 128GB DDR5 ECC (if AMD) or non-ECC (if Intel)
  • Storage: 2TB NVMe Gen4 (OS & applications) + 4TB NVMe (ML datasets & VMs)
  • Motherboard: ASUS Pro WS WRX80E-SAGE (for Threadripper) / ASUS Pro WS W790 (for Xeon)
  • PSU: 1200W Platinum-rated (for stability under heavy load)
  • Cooling: Custom loop or high-end AIO for CPU/GPU

Questions:

  1. CPU Choice – Would Threadripper be the better long-term investment, or would a high-end Xeon offer better reliability for virtualization?
  2. GPU Consideration – Should I stick with NVIDIA for AI workloads, or is AMD’s ROCm support good enough for ML tasks?
  3. Motherboard Recommendations – Any better options for PCIe expandability and reliability?
  4. Cooling – Would air cooling be sufficient, or should I go full custom loop?

Any advice or recommendations would be greatly appreciated! Thanks in advance.

Regards
Martyana

For question 2, to use an AMD gpu for pytorch, I know it is only support for Linux and not Windows. You can see that here: Start Locally | PyTorch . As for tensorflow, I am not sure. So you have to consider that if you want to game and manage it with something like dual booting or something else.

Also, to use ROCm, Linux and Windows are supported:

I personally do not own an AMD card so I cannot for sure say what the state is for ML but I do read posts online that people do some ML work with them (including LLM use).

For question 4, honestly for air cooling you just need good pc fans that are decent with static pressure and air flow.

Don’t forget to include the case you’ll be using as the 4090 is a huge card if you do decide to go with it and want to expand the set up with another gpu.

ampere+ has no support for GPU splitting so multiple VM won’t be possible even with hacked drivers

One GPU and two x4 NMVes is 24 lanes, which is what AM5 and LGA1851 provide from the CPU. 7960X’s also not much of a perf step from 9950X and likely to be a regression from 7950X3D or 9950X3D for gaming. So looks like high marginal cost for maybe negative return.

Wrong socket (WRX8), wrong Threadripper line (Pro). See also all the threads here on Asus mobo issues. Typical default for 7960X’d be ASRock TRX50 WS.

Threadripper pricing rarely makes a good investment when workloads are shardable or fit on desktop parts but Puget’s data is favorable on reliability.

Depends on noise targets. Air cooled 4090 measurements are readily available, air cooled SP5’s going to be loud. Not that a 360 AIO’ll be quiet, but quieter.

Did you build this machine? I just got a W7900 and it works great for AI / ML with inference. I haven’t tried to train yet with pytorch or something else.