Advice for an EPYC workstation build for a DAW

So, got some preliminary news from Peter at Scan UK. If you’re not familiar with them, they are a specialist AV retailer that have a division dedicated to building and selling high-performance music and audio workstations.

They’ve started testing Threadrippers for DAW use - they’ll have a fuller report soon, and have EPYCs on the roadmap to test as well - but it is a little concerning. He shared this with me:

[…] I tested the Threadrippers yesterday and they failed hard for a number of reasons that I think are going to affect the EPYC chips too. […] High core count: the only sequencer I know that claims to do above 64 threads smoothly is Reaper. Yesterday I sent them many screenshots of Reaper failing to address anything over 64 threads, the developer is looking into it but seems a bit perplexed at the moment […] Regarding suitability for audio, I don’t expect EPYC to work for us at this point. The reason Threadripper doesn’t work for us is that it’s simply two chips crammed into one and the interaction time between the dual memory controllers is crippling the data flow and causing the ASIO buffer to overload and collapse […] EPYC has 1 memory controller per core and it looks like 4 cores per chip and then in twin chip builds you’re talking about 8 memory controllers trying to talk to each other […] It’s an unworkable mess with two controllers, I really don’t expect 8 of them in one place to improve the situation I’m afraid.

Granted, this is just one organization’s benchmarks, and until I see more audio-specific benchmarks for Ryzen / TR / EPYC I may reserve judgment, but it’s tilting me towards waiting for Ice Lake Xeons later this year.

I’m also wondering if there’s something that can be done to optimize ASIO for Zen architecture going forward? If that problem could be solved, it’d put my mind more at ease.

3 Likes

This is interesting. Granted I don’t know much about DAW, but having worked with both high bandwidth streaming for Digital Signal Processing and now Datacenter I can say in my workloads memory bandwidth has never been an issue (read: crazy fast) with Epyc CPUs. No idea about Threadrippers. My money is on Windows scheduler being terrible (with many cores) combined with software on top not being optimized well for Zen2. Would be interesting how a MacOS VM does with PCIe pass through for comparison. Is latency what Audio folks are concerned with? At the end get what works best for your workload for the money. If Xeons do better for the money I’d pick that :slight_smile:

1 Like

Yeah this sounds like horribly optimized software, if that is the case at all.
And doesn’t epyc 7002 and threadripper 3 just have one large i/o die?

Edit: If I were you, I’d beg someone with new TR3/epyc 7002 systems to run some benchmarks. I wouldn’t trust a retailer, especially in the scenario where a certain company’s products are severely over-priced and criticized and the retailer is trying to move some units :slight_smile:

2 Likes

Yes, the issue is when you have small buffer sizes for live tracking. Maybe this is less of an issue if you’re doing non-realtime, all in-the-box composing and mixing, streaming audio / samples from SSDs, and making good use of track freezing?

In doing further reading, Steinberg, who invented ASIO, had to introduce a feature in Cubase & Nuendo called ASIO-Guard, to keep systems balanced and prevent CPU spikes - essentially shifting some tracks / plugins & resources from realtime to non-realtime if I understand it correctly. Apparently it was particularly bad on MacOS but Windows does benefit as well.

I agree that, looking deeper, it really seems like there is a weird confluence of low-level drivers, OS issues and hardware design. I don’t know if ASIO needs to be rewritten from scratch (maybe both Intel and AMD could fund some common basic research or OSS projects) or if it’s something that, like the old Amiga or Atari computers, needs custom ASICs on the mainboard?

I was daydreaming today that maybe the solution is a CPU / OS combination designed explicitly for audio from the ground up, rather than repurposing general-purpose chips.

Analog Devices make the SHARC DSP chips, which, funnily enough, come in versions with an embedded ARM core; Imagine a SOC that combined a workstation-class ARM processor (like the 64-core Ampere eMag) with lots of DSP capability.

2 Likes

Single rank is also fairly key for the kind of performance you’re looking for as well. 16 x 8GB modules still might not be enough though. Scan UK didn’t happen to mention their hardware scheme did they?

I’m thinking I agree with the above posters. Software needs to begin catching up but probably won’t until the hardware becomes a little more mainstream and accessible. Which is too bad.

Windows splitting threads into groups of 64, applications need to specifically provide support for navigating the different groups. Epyc, going into the 128 thread land, will certainly cause problems as well. The issue might not even be about memory management at all or only partially part of the story.

This has turned into an interesting thread!
Cheers!

Yes, absolutely. But I think what Pete was getting at is maybe more about the fact that modern computers, in the rush to get better at multitasking and speculative execution, do more of their computing in a non-realtime manner, whereas for live recording or MIDI work (esp through plugins) you need it as near realtime as possible or you have annoying delays, and NUMA can complicate this; the system will drop out if you use very small audio buffers (which in turn puts more load on the CPU).

Looking deeper into this, it really is about the computer architecture. Short of going back to all-analog workflows, a truly realtime DAW would maybe involve RISC CPUs, a realtime OS like QNX, lots of DSP, and a different, possibly non-NUMA memory scheme.

This is why Pro Tools HD and HDX, newer MOTU boxes, and UAD Apollo interfaces rely heavily on DSP to enable zero-latency monitoring / mixing - as well as, in the HDX and UAD cases, running effects. This leaves our general purpose PC more power to handle running the DAW, VST instruments, and shuffling audio files from disk to memory and back.

So at this point, no matter what platform I choose, down the road if I want to do serious live recording, I guess I’m going to have to make a serious investment into either ProTools or UAD gear. And for the solo-composition route, it really is about more and faster all-core turbo speed, even without hyperthreading enabled, then bus and memory bandwidth. For now EPYC still seems the winner, but Ice Lake will be interesting.

1 Like

Was just perusing through my YT subscription notifications and Buildzoid just uploaded a video of him “rambling”, as he often calls it, about his experiences with overclocking ram and Threadripper (I know TR is something you aren’t interested in). All the same, it might be interesting to hear a few things he has to say that really jives with what you’re talking about.

It’s looking like for the high end pro audio (which is far beyond my typical usage, though I can see the issues rearing its head even then) the idea of using Intel based systems and/or heavy DSP as you’ve mentioned is really the only avenue that can offer a workable platform.

I wasn’t really aware of the limitations it had in regards to this area of computing. It’s unfortunate and really makes me wish as you have that something could be done about it. Too niche of a market? I don’t think it is lack of existing technology. Either way, thanks for bringing all of this up in the thread.

Cheers!

The video starts at about where he begins talking about latency/bandwidth:

1 Like

Dual rank will perform better if the clocks are the same, it is harder on the memory controller so it might make the ram run at a lower speed, but if you can force it to run the rated speed dual rank would be best

The exception to this is if you have two dimms per channel, but the ram slot topology plays a part in that as well
Wendell goes over it in one of his ryzen videos I forget which atm

1 Like

Thanks! There is a lot of conflicting opinion on this tbh - other say it’s really about memory bandwidth, so using the fastest memory you can get helps, and there were optimizations in the newest Windows 10 builds to avoid buffer underruns, but Pete’s tests said it was due to the fact that each die in a multi-chiplet TR / Epyc has its own memory controller, whereas Core / Xeon only has a single memory controller for the whole CPU. I don’t know if this gives AMD chips an advantage in chiplets being able to access memory separately, but it causes some sort of latency in coordination (if I understand correctly).

1 Like

Make sure your daw can actually utilize your chosen cpu core count. Logic Pro recently increased available virtual threads to a max of 56 to pair with the 28 core intel chip in the Mac Pro. 2 x 28 = 56. My desired thread ripper build may need to be scaled back to 24 cores. I have absolutely no idea how Logic Pro would behave with more than 28 cores. More of a hackintosh problem, but something to consider regardless. Another issue with Higher core CPUs is single core spikes leading to crashes. Imagine 63 cores all below 10% except for one that’s constantly overloading due to lack of headroom because of lower clockspeeds in higher core count CPUs. Your project depends on all these cores working together at the same time, if one peaks, you crash. Suddenly you’ll realize how that Epyc SERVER cpu behaves within a daw. 64 little dudes on bicycles tethered together vs 16,24,32 dudes on vespas tethered together hauling giant stacks of pizza down the highway. Loads will never be distributed equally across your cores. Now rethink that example with your cores cut in half by virtual thread doubling . There’s definitely a Goldilocks zone concerning core counts and clock speeds with pro audio. It’s been a problem in logic for years.

Pcie 5.0 is coming fast. You’d be better off with a temporary threadripper build, and building your dream platform on pcie 5.0. Maybe send Hans Zimmer or someone else in that league an email asking about their builds.

Pcie NVMe ssds within Kontakt are marginally faster than Sata III ssds for most use cases. Held back primarily from software limitations, and sample decompression bottlenecks. I’ve been scoping forums for over a year now. Very few people are doing it, and definitely not at your scale. The demand needed to force sample engines to adapt still doesn’t exist. Your productivity would be better served by more , and cheaper, Sata III ssds to expand your current sample libraries. Along with backup drives for your backup drive’s back up drive. Samsung has 4 tb 2.5 inch Sata III drives for around 480 dollars. 4tb Pcie 4.0 m.2 are double that price and will be outclassed by 2021-22 with higher capacities and substantially lower prices with pcie 5.0. …It’s coming like intel’s life depends on it.

Have you looked into pcie expansion chassis using a pcie bridge? Stumbled across this reddit thread. Can’t post the link, but if you copy paste the section below into google you should be able to pull it up.

—/———-____

Anyway to turn 16 lanes of PCIe 4.0 in 32 lanes of PCIe 3.0?

u/kilogolfbravo

Anyway to turn 16 lanes of PCIe 4.0 in 32 lanes of PCIe 3.0?

Discussion

Why

When Intel Xe comes out with hopefully multi-discrete-gpu support through OneAPI I want to get 2 discrete GPUs in mITX. The Xe GPUs will most certainly over sturate an x8 link bc their performance should be equivalent or better to a 2080ti. They are also will not support PCIe Gen 4. As a result simple bifurcation of a x16 link into 2 x8 links will not cut it. So that is why I am interested in the possiblity of taking the pcie gen 4 link and turning it into a pcie gen 3 link with double the lanes.

Other Questions
Are there any redriver risers with bifurcation? Any planned for the future? If this is not possible, why? (the riser, not dual dGPU in mITX because I have done that before)

TL;DR

AMD X570 features PCIe Gen 4.0. I want to use the x16 pcie gen 4 link on a X570 mITX motherboard to create 2 PCIe Gen 3 x16 links in order to create make a 2 dGPU setup (in 2 slots) with maximum performance considering that Intel’s upcomping XedGPUs will most likely feature perfect or near-perfect multi-gpu scaling that is abstract to the software and considering that those gpus will easily saturate a PCIe 3.0 x8 link that I would otherwise have with normal bifuracation or effectively with PLX switching.

1 Like

Hi there @Rickybobby. Welcome!

Interesting though I would think that thermal throttling would take care of preventing a crash like that. Most DAWs also have a load meter and a way of limiting their usage.

Having extra threads/cores available hasn’t been too much of an issue for some time. It used to be games that suffered. A famous example was Fallout 3 and quad core CPUs. I’m not certain it would affect a DAW in any negative way today. It doesn’t with my own setup. The threads or cores just won’t be used, which seems to be the main complaint. Besides, having extra threads available for plug-ins running along side your DAW is a good thing.

I don’t want to encourage veering too much off-topic but last I’ve read, even a 2080Ti doesn’t come close to saturating PCIe 3.0 x8 lanes. So maybe I misunderstand what you’re trying to say?

It was nearly a year ago but I recall Intel arguing against PCIe 4.0 bringing much if any benefit to graphics at all. Most seem to agree even knowing about the future existence of Xe. Word on the street is that it probably won’t be until PCIe 5.0 (which has already been initially presented) where video will be at the point where it will need more bandwidth over speed.

(Intel’s thoughts)

Anywhooz, let’s try to stay in the realm of audio before we upset the mods unless we’re talking about offloading to CUDA cores for DSP :grin: Otherwise, we really should be posting the graphics specific stuff in the proper part of the forum. More eyes will see your questions and answer them! :grinning:

Thanks and Cheers!

1 Like

This is just my experience with daws, but It happens immediately even with cold restarts if your channel strip settings become too ambitious. Each channel is sent to a single core. Virtual threading is helpful when your base clock is actually high enough to be cut in half. Thermal throttling makes everything substantially worse . There are thousands of plugins with unique/outdated ways of distributing processing. Hopefully with the release of Osx Catalina cutting 32 bit support developers will finally optimize their plugins across the board even for windows. I just copy pasted a title of a thread discussing/speculating plx chips/bridges to convert pcie 4.0 into more available 3.0 lanes for extra expansion capabilities. Intel’s next xenon chips are running ddr5 and pcie 5.0 in 2021. Probably currently useless except for one upping AMD. I just feel bad for people building 50,00 dollar systems in the new Mac Pro on pcie 3.0. It’s the only reason I didn’t buy one and went down the hackintosh route instead.

2 Likes

All good points. I may have to scale back my ambitions to use very large orchestral libraries without secondary machines to run VEP, and use the “lite” versions for the time being, but I do still see PCIE 4 NVME as a win for read/write speeds - I will likely repurpose other SATA SSDs for secondary storage and backups and I have a QNAP external disk unit for longer-term backups.

I don’t particularly need superfast video cards so that’s not an issue - my WX4100 is more than enough.

To reply to another point, GPUs cannot be repurposed for DSP - if they could everyone would be doing it. There were some experiments to create reverb plugins using CUDA around 2009, but it never really caught on. Plus, video card architectures vary widely between manufacturers so it’s not easily standardizable, vs Avid or UAD which run on dedicated hardware. It’s kind of similar to the multithreading issue in that GPUs are designed for massively parallel operations while audio DSP is by nature highly serialized, chaining the output of one operation into another. Extensions to x86 like SSE and AVX help the CPU, but it’s not quite as powerful.

To the point re: PCIe extension chassis, I suppose that can help but it’s expensive overkill… With a 7-slot system, after the graphics card, an SSD carrier card and maybe extra USB-C ports, that leaves 4 slots I could use for UAD Octo cards - and their drivers only support 4 cards in a single computer anyway. If I managed to max all that out I’d better be working on paid projects… :slight_smile:

1 Like

Not in user land. So no, everyone couldn’t do it but if developers used them then yes. They have in the past as you said and it didn’t take off. But that does not mean it cannot be done. There is plenty of talk about it even still and with the way GPUs or rather GPGPUs are being used more and more, it is more than likely that we will see this in the future. It would only be natural to leverage them over traditionally developing your own silicon etc. They are also ahead of the curve in terms of processing power especially now as compared to then.

Anyway, until this happens, it is only guesswork but I feel that it is one you could bet on. Avid appears to be one of the leaders behind the scenes. But again, relegated to nothing more than just rumour so far. Using one or two TESLA cards for your audio needs over an entire rack of gear would not only be faster, use less space, but also be orders of magnitude cheaper.

I’m thinking its just a matter of time.

One example of one such fairly recent article (2019) by a developer of a DSP library making use of NVIDIA GPGPUs:

It’s happening… :smiley:

So, just as a quick hardware update, the AsRock Rack ROMED8-2T (basically the PCIE 4.0 + upgraded version of the EPYCD8-2T) is supposed to be released around the end of March. I think the few examples that have been seen in the wild (like at Typhoon Systems in Singapore) are engineering samples. One enterprise system builder based in Palo Alto quoted me about $655 USD for the board - I have asked CDW to see if they have pricing / dates too.

FWIW I went to my local PC enthusiast chain store, Memory Express, and they simply don’t even deal with AsRock, so I guess I have to wait until the big mail-order enterprise IT shops start stocking them.

On another note, I’d been doing some thinking about the idea of a modern DSP-powered hardware DAW.

With its new Luna recording software, UA is sort of moving towards stronger software-hardware integration, more formally competing with ProTools, but I’m also thinking about things like the Rodecaster Pro multitrack. It’d be kind of cool to basically mash up a Rodecaster form factor with Apollo-style SHARC DSP onboard, and control it with a very lightweight, realtime OS running an embedded DAW - maybe on some sort of ARM CPU / APU with onboard graphics, mouse/keyboard support etc - like a beefier Chromebook under the hood.

It could have an internet-connected plugin store to buy audio effects and instruments (adapted VSTis or AUs?) or they could be sideloaded via USB or memory card; expandable internal M.2 storage + add-in slots for more DSP.

In any case, by removing a general-purpose CPU and the overhead of running a complex desktop OS, more system resources can be put towards recording, monitoring, and playing back audio with very low latency.

Yeah, that is what I have read as well.

Pete Kaine from Scan Pro noted in reviewing newer Threadrippers for audio that using “AMD calibrated” memory packs worked much better, and definitely using ones rated for the system speed (DDR 3200) helped eliminate some memory “holes” - talking about the tendency to have buffer under-runs on AMD vs Intel due to the multi-chiplet design.

That said it really just affects recording latency for realtime monitoring / playback, with DSP + “frozen” tracks it ought not to be as much of an issue. I don’t record multitrack live instruments anyway!

1 Like

Yup. This is why I think I would lean more heavily on using UAD DSP for channel FX and freezing tracks as needed.

So some updates after reading everyone’s very helpful replies:

  • I took a very close look at TRX40 and Threadripper.
  • I was very, very tempted by the Gigabyte Designare esp with its onboard USB 3.2, 4x M.2 slots, included Titan Ridge and 4x M.2 PCIe cards - but it’s an XL-ATX board which doesn’t fit in the Dune Pro.
  • The Dune Pro is a pretty big case that can handle wider motherboards like EEB, but not taller ones like XL-ATX; see this size comparison at LANOC
  • Audio performance comes back to all-core boost speed rather than top 1-2 core boosts. EPYC is designed to run all cores at consistent boost speeds, so having 16-32+ consistent cores vs. having the system prioritize 1-2 cores to engage Ludicrous Speed! isn’t as helpful for multitrack audio. See Pete Kaine’s test results with the rebuilt DAWBench here.
  • And there’s the number of lanes / slots issue. Officially up to 4 UAD devices are supported on a single system, so on an ASRockRack ROMED8-2T, I could use 4 slots for DSP, 1 for video, 1 for quad NVME M.2s, and leave one free for future expansion.

So… yeah. I think it all comes down now to waiting for the case (held up because of coronavirus delays), motherboard (mid-April) and then seeing what prices 7002 CPUs are going for… :slight_smile:

4 Likes

Right but testing also showes that it makes a margin
of error of difference weather the box the RAM came in says AND or Intel on it. I believe it was like 0.1 % different in both direction for both labels. If the motherboard manufacturer programmed the bios for the RAM you are good.

This is very interesting and I’m wondering how it turned out for you? What was your config? Is it working? How long to get it to work?

I too am coming from a mac the plethora of choices is overwhelming.