Hi there, I guess if there’s a place I could find a useful advice on the subject, it would be here
So, we’re planning on getting a pretty powerful GPU compute server for Deep Learning etc, and while the compute side is more or less set by now (EPYC 7702 + as many nVidia A100’s as we can afford), the filesystem side of things seems a bit more arcane to me
So far we’ve been mainly prototyping with consumer GPUs (up to 2080Ti), and I only recently bumped into the limitations of the storage - when dealing with the image data, the loader fetches [batch_size]*1MB files (with the batch_size being 32-96, depending on the GPU), which already takes hundreds of milliseconds even on an SSD (granted, the slower kind) and thus becomes a bottleneck for the training.
So, for the A100-s with 40-48GB of GPU memory the batch sizes could go even higher, which makes me think of the organization of the filesystem.
The server we’re looking at has 8x 2.5 hot swap drive bays, 4 of which could be NVME. Also, apart from the full length PCI-e slots that will be populated by A100s there are still 2x PCI-E 4.0 x16 and 1x PCI-E 4.0 x8 half-width half-length slots available.
I guess we would be ok with having 15-30TB of useable space, and there are pretty large drives being sold, so we don’t need to get like 8-12 drives or anything…
This makes me think of ~3 possible scenarios for the file storage
- populate the drive bays with SAS SSDs
- populate the drive bays with NVME U.2 SSDs
- use the HHHL PCIE slots for NVME M.2 adapters
Now, if I want to maximize the read speeds (and the training does shuffle the load order, so random reads and I/O seems important), which of the options are “good enough”? (I.e. only using the filesystem for the GPUs on the server, the CPU will not be 100%-ing at the same time)
Are NVME SSDs so much faster than SAS ones? From the specs, the [$/TB] seem comparable, but the NVME ones are going up to 7GB/s reads when SAS are ~2GB/s - are these specs realistic, or is it like a “theoretical performance” or some marketing BS?
For 1.-2. I would need a controller? Does it need hardware RAID? Should it be PCI-E gen. 4 or is 3.1/3.0 enough? (most of them use x8 lanes) If an NVME drive is taking 4 lanes, does it need a x16 lanes controller to optimally use 4 drives?
For 3. is the software RAID needed?
I’ve tried googling around, but I’m probably not using the right terms, the info is very obfuscated, so I’d really appreciate some advice/basic info I could start with