Server build for extreme data bandwidth needs

congles · April 11, 2022, 1:42am

Hi all,

I run a server for our company which provides assets to the VFX industry (Matrix, Dune, Uncharted, etc just to whet the whistle) and we’re in need of an upgrade.

We deal with massive volumes (though not massive on the enterprise scale), capturing around 20,000 photo’s in a 6 hour period and this currently takes ~2 weeks to process - which we need to get under 1 week as we’re generating about 8TB per week but if bottlenecks weren’t a problem we could be much higher (16TB+).

A bit more detail
The 20,000 photos generates around 8TB of uncompressed storage which are then combined in various datasets which generates another 6TB payload and 3TB of temp cache storage. So it’s no surprise that our testing indicates our biggest current bottleneck is our HDD bandwidth. We have 4TB of NVME cache storage to utilize, but it’s not enough to handle the whole volume, this resulting in a juggling of data or some data being relegated to slower spinning disks.

Current Processing Server:

OS: Unraid + VM’s
CPU: Ryzen 5900x
MOBO: ASRock X570 Taichi
RAM: 128GB
GPU: 6800XT
HDD: 188TB (Exos Drives)
CACHE: 4TB nvme - 1x2TB Sabrent Rocket Plus, 1x2TB Sabrent Rocket (btrfs pool
Other PCIE: 10GBe fibre card, HBA Card

Current limitations
With benchmarks pointing to storage as the main soak on time, if we could storage everything on nvme’s we’d be looking at a 4-7.5x performance boost depending on workload.
GPU’s are also important as one step in the process requires it. However, this makes up only ~25% of the time, so a single GPU would be sufficient for us moving forward but two GPU’s would be an ideal outcome.

Furthermore, we’ve maxed our PCIE lanes so adding in a NVME PCIE card would require a new server.

CPU Notes:
ideally we would have purchased one of the new Threadripper Pro CPU’s but they’re OEM locked. Essentially we need a balance of fast and numerous cores (hence the TR Pro). 256GB of RAM would be sufficient, but expandability to 512gb could be helpful.

GPU Notes
Two gpu’s would be ideal, we’re using the 6800XT due to the 16gb of VRAM as our datasets use 11.6GB each so the 6800xt has comfortable headroom.

Storage Notes
We want the spinning disks to be only used for archival storage, so that means being able to store everything on fast NAND tech of some sort. Having options to expand our NAND capacity would be important, 16TB should be sufficient for now, however as we scale we may need twice, or more than that. Our budget is anywhere from 10k to 25k USD (full build), so we’re not talking Kioxia CD6 drives.

What we likely need is some sort of the low-end enterprise hardware, but as this is our first need for it, I'm unfamiliar with the options out there.

I’m not sure what else to cover, it’s a bit of a specific use-case, I’m secretly @wendell might see this and think it could be a fun challenge.

Looking forward to hearing everyone’s advice!

risk · April 11, 2022, 5:07am

Heh

I wonder if you considered going nuclear and using more than 1 server with ceph, for example… or just traditional single host servers with multiple mount points.

congles · April 11, 2022, 10:56am

haha, I’m unfamiliar with the vernacular of server aficionado’s, so that didn’t make much sense. But if I understand correctly, you are referring to having something akin to a DAS bay which can be connected to multiple servers to do the grunt work?

SoManyBanelings · April 11, 2022, 1:07pm

Sounds easily solvable with a threadripper or epyc setup and a couple of 3.84tb or 7.68tb nvmes, last gen nvmes are getting cheap as high end enterprise are moving to pcie4.

MadMatt · April 11, 2022, 1:36pm

This

with the addition of a 16port SAS HBA (Broadcom 9400-16i) and a 4x U2 expander ( Highpoint SSD7120) gets you to the 20K mark, including 4x 8TB Intel NVMEs, 12x 16TB Exos drives, a 32 core threadripper and 256GB of ram …

Given the budget, the requirement for performance, and the risk of hitting snags/incompatibilities/conflicts/configuration issues, I would add a 5K to the budget to have someone that actually can guarantee for the perfromance/troubleshoot it if it has issues …

risk · April 11, 2022, 6:53pm

More along the lines of disaggregating your storage over a network. So instead of having 1 large screaming fast array of disks, you have a couple of disks in each of e.g. 5 machines (disks==ssds==nvme drives).

Then for processing, just process the data on multiple machines in parallel, instead of one.

Need more space … add more storage nodes, need more compute, add more compute nodes.

congles · April 12, 2022, 1:18am

@SoManyBanelings, being unfamiliar with these devices (beyond knowing about them), Would this be a matter of purchasing a U.2 PCIE card to make them work? Or are there some motherboards which have several U.2 connectors on them? We would want at least 4 connectors for future proofing.

@risk Your approach definitely sounds like the best avenue to pursue. Do you know of any articles / knowledgebases I should look to to start learning more about this node based approach?

AliasInWendellLand · April 12, 2022, 1:58am

Just spitballing, I would think you want a Threadripper/Epyc system. It doesn’t even have to be one of the latest PRO workstations - a previous (current) gen system will be fine. What you really want is the huge amount of bus bandwidth and channels - up to 128 PCIe channels - with Threadripper/Epyc over the measly 16/20 of a consumer-level system.

I imagine Wendell might steer you in the direction of Optane disks for the initial storage, offloading to spinning rust when you have the bandwidth, but I don’t know. I think you’d need to do the math on throughput and your future needs vs. time before you intend to upgrade again (and whether say a $25K build will pay for itself in that time period).

Alternatives are Xeon/dual Xeon systems.

SoManyBanelings · April 12, 2022, 6:17am

Regarding u.2 it can be confusing since there’s a load of connectors that all can go to the u.2 drives. But it’s all basically pcie(in the form of oculink/slimsas/sff8643 connectors depending on motherboard/pcie3 card) to sff8639 connector on the drives.

Most epyc boards will have the direct connectors to go straight to nvme, in the form of oculink or slimsas usually, then you need the right cables to either an 2.5" nvme enclosure or directly to the drives.
you can just get a simple linkreal LRNV93NF for going pcie3 x16 slot to 4 x sff 8643 on motherboards without the right connectors.
Since it looks like you are not building a rack server but a desktop you could get something like this for the enclosures NVMe Products_ToughArmor Series_REMOVABLE 2.5" SSD / HDD ENCLOSURES_ICY DOCK manufacturer Removable enclosure, Screwless hard drive enclosure, SAS SATA Mobile Rack, DVR Surveillance Recording, Video Audio Editing, SATA portable hard drive enclosure

Optane sounds too expensive for his budget/usecase where he needs 16tb+ nvme

It’s a bit unclear what kinda CPU usage your workload uses, does it max out all cores on your Ryzen? I guess not since it’s hitting write queues on your current HDD setup

congles · April 12, 2022, 8:37am

I think we might want to entertain the idea of a Rack server build, so feel free to recommend / advise based on that as an option. The reason being, the current main server is in a meshify 2XL but will be outgrown within 6-12 months, especially if we remove further bottlenecks.

tl;dr = from my tests, a full NVME workflow would result in a speed ~77% faster than we’re currently able. but there are some workloads that use the CPU heavily.

Regarding CPU Usage etc, our first stage in the process linearizes the images. This uses about 10GB of RAM per instance and pins all CPU cores at 100%. so this is where I see a high core count and RAM availability being used most effectively. The input files are ~70mb each and written files are ~350mb. Each image takes 14 seconds on the current server. Overall, this process takes 77hrs for 20,000 images and is around 45% of the processing time for the start to finish pipeline.

The second stage involves combining the datasets, this is mostly GPU, RAM and HDD. Each dataset requires reading around 30GB of files into RAM and for processing and takes around 8 minutes to complete - I’ve run some tests to show that a full NVME workflow would reduce this to 5min per dataset which is congruent with NVME vs HDD speeds across 30gb of data. This stage takes around 15% of the total time.

Finally, there is a compositing script which runs across the entire dataset. This is vastly bottlenecked by the HDD speed, it takes around 65 hours (40%) and from my tests it would be around 3-8X faster depending on the internal stage (let’s say it averages a 4x benefit).

So you can somewhat see that we have a mixed use case, but with storage being the primary bottleneck at his stage. Across the board, we need more PCIE lanes in order to have enough NVME storage, which means an upgraded CPU to TR or Epyc anyway.

MazeFrame · April 12, 2022, 10:32am

Have you gotten a quote from HPE, Pure, Netapp, Supermicro, etc. on a storage solution for your needs?

SoManyBanelings · April 12, 2022, 12:39pm

Sounds like you would want to future proof a little here an get a beefy threadripper or epyc since your processing sounds highly parallel, after switching to nvme you might start getting CPU limited pretty fast.
Also sounds like a rack system will be the way to go unless noise is a issue, check out some of gigabytes racks like the R182-Z93 can be configured with dual 32core epyc milan 512gb ram 16tb nvme and expandable to a lot more name storage. For under 20k

congles · April 12, 2022, 11:42pm

No not yet, I think I’m stuck in the “build your own because it’s farrrr cheaper” mentality of the HEDT market, but I’m starting to get a sense that it’s probably not quite the case when it comes to Rack mounts - or at least it’s not worth the extra hassle)

congles · April 12, 2022, 11:44pm

Thanks for the link, I just played around for 30 minutes with different configs. It adds up fast but it also looks like we can get what we need for less than 20k as you mentioned.

Thankfully we’re still a few months out from requiring the upgrade, so I’m starting the research early. Is there any main points I should investigate regarding the rack server approach? even just different providers/brands to investigate.

AliasInWendellLand · April 13, 2022, 10:28am

You could do worse than check out the products available at lenovo.com

When I was researching a Threadripper build I was impressed with lenovo’s offerings.

congles · April 27, 2022, 1:19am

Thanks everyone for the input so far.

The last 2 weeks got pretty busy, so I had to put this to the side. It’s also something that’s not able to happen for another few months, so we have time to research. As we do, however, I’ll update this thread as I would really appreciate your input.

Ruklaw · April 27, 2022, 10:22am

This does sound like a really interesting project.

I have to say that my heart sank when I saw what you described as your current server… a consumer CPU and motherboard, no mention of ecc ram…consumer SSDs being used in server role. Ouch. This sort of stuff is fine for a home lab, but a disaster waiting to happen when your business depends on it.

With GPU processing being a requirement you do get a bit stiffed with current enterprise hardware unfortunately, servers with capacity for more than 1 or 2 dual slot GPUs are fairly rare/expensive.

We’ve bought a couple of new servers here recently, went for Dell R7515 units with 3rd gen Epyc which might be worth a look - I’m very much a roll-your-own guy when it comes to servers and had been looking at supermicro, gigabyte, asrock rack etc options but nothing came close to the value we could get there (and they can hold a dual slot full length GPU, and up to 12x 3.5" drives).

The key thing is that if you’re buying dell/enterprise kit you need to realise what they will screw you on… specifically, the drives and RAM. So you need to buy that machine as close to base spec as possible, then buy your own RAM and your own drives (and rails for the drives…)

Given that you’ve mentioned drives are a bottleneck for you, it’s worth pointing out that more, smaller hard drives may be the answer.

Suppose you are running on 6x12tb drives right now… if you replace that with 24x3tb drives then you will near enough get 4x the IOPs out of those drives.

3tb SAS drives are also really cheap second hand as lots of enterprises are upgrading to more capacity/moving to SSD - we only use big drives on our backup server.

24 bay hot swap server cases are pretty affordable if you are happy to roll your own:
https://www.servercase.co.uk/shop/server-cases/rackmount/4u-chassis/4u-server-case-w-24x-35-hot-swappable-satasas-drive-bays-12gbs-hd-minisas-sc-4824/

I’d also have concerns about using Unraid for what you are using, again, good for a homelab not sure I’d trust a business with it.

I’d echo what’s been said above, Ceph may be in your future, it’s cluster storage, advantage is it is really easy to scale up as your needs increase - you don’t need to upgrade your server, you just buy another one and add it to your cluster. It’s also really easy to use with Proxmox, which we use for our VMs - I’d suggest trying it with a couple of desktops or vms as it’s really powerful stuff.

congles · May 5, 2022, 5:46am

Thanks for the input Ruklaw.

To be clear, we’re not in need of more 2 full-size GPU’s in a server. In fact 1 is sufficient, though 2 would be a smart goal if possible - to future proof a little.

I’ve begun researching the benefits of TrueNAS for our use. TrueScale is probably more appropriate for us vs Core even though Scale is very fresh and a little slower for now. ZFS and more numerous, smaller capacity, drives does make sense too. I have a little more investigating to do however.

Regarding Ceph, Proxmox and other solutions, unfortunately I’m not quite technical enough to be comfortable with those and I don’t want to deploy something that requires a tonne of my time to maintain and support simply because I don’t have the foundational knowledge.

So, current roadmap points to a TrueNAS migration, using smaller drives in ZFS. However, a bit more research is needed for sure.

system · February 2, 2023, 11:47pm

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.