Storage organization for a GPU compute server

Hi there, I guess if there’s a place I could find a useful advice on the subject, it would be here :slight_smile:

So, we’re planning on getting a pretty powerful GPU compute server for Deep Learning etc, and while the compute side is more or less set by now (EPYC 7702 + as many nVidia A100’s as we can afford), the filesystem side of things seems a bit more arcane to me :confused:

So far we’ve been mainly prototyping with consumer GPUs (up to 2080Ti), and I only recently bumped into the limitations of the storage - when dealing with the image data, the loader fetches [batch_size]*1MB files (with the batch_size being 32-96, depending on the GPU), which already takes hundreds of milliseconds even on an SSD (granted, the slower kind) and thus becomes a bottleneck for the training.

So, for the A100-s with 40-48GB of GPU memory the batch sizes could go even higher, which makes me think of the organization of the filesystem.
The server we’re looking at has 8x 2.5 hot swap drive bays, 4 of which could be NVME. Also, apart from the full length PCI-e slots that will be populated by A100s there are still 2x PCI-E 4.0 x16 and 1x PCI-E 4.0 x8 half-width half-length slots available.

I guess we would be ok with having 15-30TB of useable space, and there are pretty large drives being sold, so we don’t need to get like 8-12 drives or anything…
This makes me think of ~3 possible scenarios for the file storage

  1. populate the drive bays with SAS SSDs
  2. populate the drive bays with NVME U.2 SSDs
  3. use the HHHL PCIE slots for NVME M.2 adapters

Now, if I want to maximize the read speeds (and the training does shuffle the load order, so random reads and I/O seems important), which of the options are “good enough”? (I.e. only using the filesystem for the GPUs on the server, the CPU will not be 100%-ing at the same time)
Are NVME SSDs so much faster than SAS ones? From the specs, the [$/TB] seem comparable, but the NVME ones are going up to 7GB/s reads when SAS are ~2GB/s - are these specs realistic, or is it like a “theoretical performance” or some marketing BS?
For 1.-2. I would need a controller? Does it need hardware RAID? Should it be PCI-E gen. 4 or is 3.1/3.0 enough? (most of them use x8 lanes) If an NVME drive is taking 4 lanes, does it need a x16 lanes controller to optimally use 4 drives?
For 3. is the software RAID needed?

I’ve tried googling around, but I’m probably not using the right terms, the info is very obfuscated, so I’d really appreciate some advice/basic info I could start with

Decision on these matters usually get decided by budget. What is your budget?

NVMe yes is faster and especially in a proper RAID/ZFS config, the more the merrier. I envision a hybrid solution with RAID 0 for the working data and then larger capacity Hard Drives for the longer term storage or tier. This way you could move data to the appropriate tier and not go broke in the process with all NVMe. SAS SSD is faster but not as fast as NVMe and then the $/TB or $/GB vs NVMe, you have to weigh your options.

I’d probably look into using ZFS with NVMe NAND SSDs or even Optane. SAS drives are not that fast (relatively) in terms of both throughput and latency.

Regular Gen 4 NVMe SSDs are pretty fast, and their random IOPS could be good enough for your use case. However the latency suffers during heavy workloads. If that’s the case, consider Optane drives instead.

Using them as metadata drives (in a “special” VDEV) can help with random reads dramatically if you tune the record size correctly.

This video might help:

Thanks for the responses!

Yes, looks like NVME would be the way to go…

We’re doing mostly research, i.e. hyper parameter search, modeling and optimization, no intended production or inference. I am hoping the datasets will not be larger than a few terabytes, and on the off chance they would, it would probably be the time to setup a separate fileserver anyways and upgrade the compute servers accordingly. So, for now I’m leaning towards just using “local” storage.

The budget is not yet finalized, we’re probably aiming for ~$100k, split at least between two compute servers. One of our early ideas was to just have them do the compute and store everything on a third fileserver, but when accounting for the overhead in networking it seems less productive than just shoving ~30TB of SSDs onto each server.

Now I’m still not fully clear on which approach is “the current meta” so to say - the U.2 drives, HHHL NVMEs or PCIE adapters for M.2s? The drives all seem pretty comparable in prices, ranging somewhere between $250-$300/TB for the 1.9TB - 15TB SSDs. I guess the server layout kind of points to getting 4x U.2s (Samsung PM1733 MZWLJ7T6HALA) and an HBA kind of like the Broadcom HBA 9500-8i?

First question is what environment do you have? You call it a compute server; do you have a pre-existing Slurm Cluster or similar? Do you have power and cooling in a rack somewhere that supports the 200-240V power required for these chassis? If you can power one, are you sure you can power more than one? Is your intent to have single jobs spanning the two+ servers or one job per server? If you want to have spanning jobs you need InfiniBand or RoCE and that’s very expensive, unless you buy old FDR hardware off ebay, which is questionable at best. Why do you want Epyc 7702 processors? Are you sure your deep learning workloads benefit from having that many cores? It might actually be faster using something like a 7542 frequency optimized part, or even fewer cores and more speed depending on how much CPU computing is actually done in a given job.

Second question is what chassis are you wanting this GPU server in? Is this a DIY project in a barebones, or are you using an OEM chassis that comes with GPUs/CPUs etc. preinstalled? The easiest way to go would be to buy something prebuilt from a configurator like Thinkmate, e.g. Gigabyte G482-Z50 for example. A chassis like that would also allow intelligent storage tiering if that’s something you’re really looking for (GPU vRAM < RAM < NVME SSDs < HDDs < LAN storage server ). Depending on the chassis, since Epyc has so many PCI-E lanes you shouldn’t need an HBA to attach the U.2 drives, they might be pre-wired by the OEM so if you plug a U.2 in it’s already presented to the OS.

NVME SSDs are much faster than SAS SSDs. You could do something like a Highpoint SRD 7540 (a lot like the liqid honeybadger) and get 8 x 4TB Sabrent Rocket drives and have 32TB of ~24GB/s bandwidth and maybe 2-4M IOPS for like ~$9k. I don’t know off hand what the best FS to run on that would be but I would think it’d be to set a RAID0 on the hardware RAID controller on the highpoint card, then run ext4 or XFS, not ZFS on a scratch disk. I’m happy to be wrong if someone shows performance numbers using a crazy ZFS arrangement with Special classes, a log device and the disk pool, but based on my experience less complexity is probably better here.

In terms of improving utilization–is this for a company? A single user? A small team of researchers? This affects how you deploy. Setting up a Slurm cluster for something like that would be a good idea; that way you can queue jobs so Jim and Jake can run a job, and even if Jim’s takes 12 hours to finish and finishes at 1AM Jake’s job starts running right away. You can also run Jim and Jake’s jobs at the same time if you have 6 GPUs and each one only needs 3.

1 Like

This is intended for our research lab at the university - so, there’s some server infrastructure in place, definitely should not be a problem for power.

The project tasks will also be mainly research, as I said, so it’s yet unclear if we’d need to have one job on both servers, would be interesting to try for sure, even if not to improve performance, at least to have an idea for how to parallelize the code. And that is also why we picked the CPU with 64 cores - it’s yet unknown which tasks we will need to do and maybe the server can be used for other general-purpose compute for the group. Some reports suggest core count is preferable for AI tasks: image|720x484

We are kind of forced to use OEM, currently the best option we found was a D13z-M2-ZR-4xGPU from Delta Computer, but I’m open for suggestions (if they operate in Germany :smiley: ) I’ll probably have to ask the OEM rep if there are U.2 connectors on the MB itself, description is kind of vague on that

Good that power and cooling are available. The need to have a spanning job is definitely something to consider up front. There are improvised ways to network two servers by Infiniband with no switch that others have tried with varying levels of success as a cheap option.

I think the results in that image are perhaps missing the clarity a more robust study would show. The CPUs there are only 4 and 6 cores respectively, and only support AVX, not AVX2, so the loss of a few GHz of clock operations is very noticeable to the GPUs which need to be fed information by the CPUs. The actual elapsed training time is also omitted–on the faster processor the overall time to train may still be lower, the graph with relative performance percentages does not inform us to that absolute performance. E.g the faster processor may finish 10% faster to start with, and still be faster at the end with underclocking. Just something to investigate for your specific workloads–I’m not an expert in machine learning, just CPU-based HPC. Higher clocks can be very beneficial in some cases.

I understand the requirements to go with an OEM. Definitely worth checking out your options and asking specifics about the server. Typically those would be directly wired to slim-SAS ports or oculink ports on the motherboard. What you’re looking at there is an ASUS ESC4000A-E10 server. I would recommend looking at the highpoint cards aftermarket, as they’re cheaper than the U.2s. If they do have newer ASUS servers, the RS720A-E11-RS24U is one to check out as well. It’s got 2 CPUs so if CPU cores matter to your GP workloads, that could be great as well. Good luck finding the right fit!

1 Like

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.