Best 300TB storage for $100K challenge

Memphizz · April 29, 2022, 2:53pm

Hi, I work for a research hospital. I would appreciate any comments or advice for hardware/software design for a storage solution. We have a few microscopes (running windows 10), each can produce 3-4GB/s of image data that needs to be processed by small number of linux servers (say 5). It is possible that all the microscopes write to the storage at the same time. All the windows workstations and linux servers are on infiniband HDR. The file sizes are 64KB and larger. All the machines have PCIe v4. The data is not clinical or related to patients.

Here is a rough list of requirements:
-lowest latency FS
-high performance with low queue depth
-high small file performance (like 64KB), don’t expect lots of 4KB RW
-near identical performance on windows 10 (not windows server) and linux, all on IB fabric
-easy to add more drives (and servers if it can’t be avoided) without losing performance or drastic change in FS
-performance and capacity is more important than resilience, we don’t want to deal with failures every couple of months

Should I consider other factors?

Dutch_Master · April 29, 2022, 3:42pm

I’d say that’s the wrong approach. I’m sure performance and capacity are important to you, but for your patients, which is essentially all that matters, security is top of the list by a country mile! Time and time again criminals get into these sensitive systems to hold very personal data hostage for ransom, and it needs to stop! One would be a fool to expect criminals to stop hacking their way into these sensitive systems, so preventing access by those criminals instead should be top priority for you. You CANNOT “take some risk”!

Of course, severely cut-down corporate budgets blah blah blah, financial restraints blah blah blah, meeting legal obligations blah blah blah, preventive IT mitigations blah blah blah, it’s all cr@p your patients do not want to hear if their live-saving data is locked behind a paywall by criminals.

Just me tuppence, hoping you get your priorities right this time.

Memphizz · April 29, 2022, 3:50pm

Sorry, I should have clarified that the data is not clinical. Microscopes are used for basic biology research only. What I meant by security was fault tolerance, data security is provided by the firewall.

Dutch_Master · April 29, 2022, 5:15pm

Thx for clarifying that.

Anyway, welcome (even if it’s a bit rough, sorry for that )

wendell · April 29, 2022, 8:58pm

Would mixed storage be an option? e.g. 20-50tb of flash, the rest spinning rust? Budget is “easy” in that case.

For all flash, the budget is a bit harder. mid-tier enterprise-ish flash is probably going to be ~$50k retail for just for the capacity by itself with no other costs.

I am thinking 2-node replication is an option w/rust but single-chassis with 15tb capacity drives in one 2u is about the only economical all-flash configuration that gives great perf all the way around.

Wendell

MSMSMSM · April 30, 2022, 12:12am

Completely unrelated to the topic (not my forte) but the title does sound like something out a LinusTechTips video

Memphizz · April 30, 2022, 1:46am

Wendell, thank you for the suggestions, single 2u with 15tb drives is viable option. When all the microscopes are working at the same time and the data being processed total write can get to 40GB/s. An overnight experiment can fill up 20TB.

what file system would you suggest? zfs? would it work equally well on linux and windows?

nx2l · April 30, 2022, 1:57am

I feel like some units are being mixed up here…

Writing at 40Gbps

At 40GBps

neither of them is an overnight thing…

Memphizz · April 30, 2022, 2:12am

sorry about the confusion. 40GB/s can happen for very short periods of time. Most of the overnight experiments are at lower rates. Microscopes take a few images a minute, too much exposure to laser light damages the cells.

aj0413 · April 30, 2022, 2:32am

Sounds like this would be decent for a collab vid @wendell

Sure a oem or two would like to help a hospital for PR alone

risk · April 30, 2022, 4:41am

I wonder if a 1000 node ceph cluster made out of raspberry pi cm4 + nvme drives would do the job.

It’d be fun to try, I wonder if we could get @wendell , Jeff @geerlingguy , to chat with Eben… and just give them all lots of candy to see how quickly they can all stand up a small storage cell.

(edit: my gut feeling is that this will cost half the budget and be severe overkill in terms of aggregate throughput, … but I’m guessing tail latency and manageability might make this impractical for hospital use … after all, per node, this workload is only 40MB/s or 400Mbps or 6400x 64k read/writes ops/s per node . It’d be the biggest/baddest raspberry pi cluster in the world)

(edit2: it might involve building a small fabric due to sheer number of network ports needed)

cowphrase · April 30, 2022, 7:15pm

Ceph would be my first choice here, adding servers gives you more performance and storage space. You already have multiple clients which will help balance out the load too. However every server you buy is less money that’s going to storage, which might be an issue with that budget.

COGlory · May 1, 2022, 4:12am

So I’m an electron microscopy facility manager at an R1 university and have designed and implemented a number storage and data movement solutions, mostly for cryo-EM.

For $100k you should easily be able to hit 300 TB storage. In fact, you can probably hit about a petabyte for that budget. At the moment I can’t remember the exact chassis, but Supermicro sells a 4U with 36 bays. We have a single zpool in RAIDz 6 (or whatever the ZFS RAID 6 is, I can’t recall at the moment) on a system with three 12 drive vdevs, each with 16 TB drives, for about 480 TB. We started off with just a single vdev of 12 disks but you have to choose between scale-up vs scale-out at the beginning, and scale-out is complex and expensive, especially at first.

If you must have scale-out, going with something like Ceph is a pain in the ass and probably over-complex for your data needs, but something like MooseFS is simpler, and you can get off the ground faster while still scaling.

I know it’s in use at Purdue in their structural biology department, which is one of the premier ones in the world and generates an ungodly amount of data, and they are pretty happy with it. When I talked to them they said they were just adding a new off-the-shelf SuperMicro server every year or so and it was keeping up.

We’re using TrueNAS Free but are changing that to TrueNAS scale pretty soon, hopefully. We have a 2x Epyc system with HHHL SSDs operating in caching mode, plus quite a bit of actual RAM cache (2x 4TB SSDs + 128 GB RAM). Those caches keep up pretty well for writing (burst up to 600 MB/s which is faster than we can write data) but less so for reading, they tend to slow, although that’s usually not a problem because we normally aren’t reading the whole datasets sequentially, but rather, read an image, process, read the next image, process, etc. If you’re reading multiple clients simultaneously (and I’d really encourage you to know your workload like the back of your hand before making decisions), you’d want to probably go mostly/all flash like @wendell said. Even high performance spinning rust is going to bottleneck. If your software has local caching (for instance, for my workload, we read the images in, extract particles, and cache those on the local machine for operations, and then eventually write out a handful of final structures/files that are much smaller), you can probably get away with spinning rust if you are at 12+ drives.

With our cameras, we are writing out a 5-10 GB image every…30 seconds or so? We average about 100 images/hr during data collections, and a weekend data collection hits 20 TB+ pretty easily. We are using Globus timers to move the data every 5-10 minutes off the camera PC to the central storage, and it’s all over 10 gigabit ethernet without much trouble. That helps alleviate the load because data collection takes longer than transfer so it keeps transfer speeds from being the bottleneck.

You’re probably going to want tapes as well. It’s just not practical for a person to manage that much data and keep it available. Depending on the type of research and who is funding it, you can look into OURRstore, as you may be eligible to use it. They have a number of options but I believe it’s just a few dollars/TB for tape backups and you just upload the files and they handle the rest for you. But it needs to be NSF/NIH related, I think.

Just some thoughts, feel free to message me if you want any more specific info. Microscopy and data management are both pretty close to my heart.

Usually there’s different things happening and recording is interspersed between those things. Moving the stage, focusing, taking trial shots, etc. One of our cameras is 96 megapixel and can write 1500 FPS at 8-bit depth, but we still only average 100 images/hr even fully automated, and those images are usually 50-100 frames. So it could write 144 GB/s but it doesn’t usually.

Memphizz · May 1, 2022, 5:43pm

Thanks @COGlory. Lots of good suggestions. We are fortunate that we have institutional storage that can be used for offloading the data at about 200MB/s-1GB/s, the speed depends on other activities across campus.

This storage (300TB) will also be used for loading the data (after it’s processed) for visualization and analysis. So fast read is critical.

Imagine next generation of detectors pushing double these rates soon, maybe 2023?

Memphizz · May 2, 2022, 1:50pm

@wendell, in a video from 2021 that you talked about the beta version of truenas scale, you mentioned that zfs will be a limiting factor for moving 10s of gigabytes because of memory requirements. How much memory would I need to have for this application? is 4TB enough?

Any updates on the 100G project that you planned? Would you recommend truenas scale for this application?

wendell · May 2, 2022, 2:32pm

wellll not enough is known about your workload. Idk if zfs is appropriate as it has overhead.

I think the context of the 10s of GB was b/c main memory bandwidth is a bit of a bottlebeck when we’re talking about ~22ish 15tb drives w/zfs.

I also am not sure about your infiniband setup, and support. If truenas core supports it, I’d probably go that route over scale. I like the flexibility of scale but core has better performance right now, esp. with iscsi. In a pinch iscsi will outperform SMB but I think smb multichannel would do what you want performance wise. Probably.

bsodmike · May 2, 2022, 3:44pm

Don’t assume the firewall keeps your secure though. Your IT team can easily misconfigure aspects leaving your host vulnerable. Alternatively, someone can sneak a Raspberry pi etc into the private network and attack from within etc.

infinitevalence · May 2, 2022, 4:19pm

I might be crazy, but I actually think this is a good case for hardware RAID.

Honestly a near-line NAS or enterprise setup with the system tuned to smaller chunks and then 1-2 expansion disk shelves.

Low management requirements, high bandwidth, and fewer software bottle necks.

SAS is probably perfect for this workload.

Memphizz · May 2, 2022, 4:49pm

@wendell, below are more details about the setup and workload. Let me know if you think zfs is not a good match.

the infiniband fabric consists of a single HDR switch that connects all the linux compute nodes (AMD EPYC), windows 10 workstations (connected to microscopes or stand alone for visualization), future storage server(s) . All connections go through the switch.

As far as workload pattern, tens of large image Images or lots of small images are created a minute by the microscopes and those are read back in two steps: a) near real-time for processing (like deconvolution) by the linux compute nodes which will produce some additional files that need to be written back b) at a later time for visualization (windows 10) or analysis (compute nodes).

40GB/s write is when all the microscopes are running and compute/analysis nodes are writing results data back to the volume. The read speed is critical for processing/visualization/analysis, the higher the better.

infinitevalence · May 2, 2022, 4:51pm

Honestly this sounds quite a bit like shared video/image editing and very similar to what we install for large news rooms in TV.