Fast scratch space for data analysis and scientitic software development (2022 edition)

In a very similar fashion to this question (from mid-2021) I am looking for fast scratch space (in mid-2022).

I do a lot of data analysis, development of scientific software (e.g. simulations, data processing and analysis pipelines) as well as visualizations. The technical constrains are defined by whatever project lands on my desk … which can actually be fairly diverse. One common denominator is parallelization: Either the workloads are already parallelized or it’s part of my job to parallelize them. There are usually parallel reads and writes of chunks of data involved.

For “local” development, I am using a AMD Epyc 7443P CPU on top of a ASRock ROMED8-2T mainboard.

Data safety is less of a concern - important data resides on a NAS that receives regular backups. I’d copy data onto and off the scratch space as needed. If it fails, I loose a day of computation perhaps. That’s ok-ish.

Purely as a fast general purpose scratch space, I intend to go for something “simple”, if that’s the right term: 2 to 4 striped NVMe drives on a dedicated PCIe card, e.g. Samsung SSD 980 PRO 2TB on ASRock Hyper Quad M.2 Card.

I made fairly good experiences with a similar setup involving a 2nd-gen EPYC, also 24 cores, and two PCIe-3 NVMe’s from Intel. There was usually a lot of room for optimization of read/write performance by tweaking data chunk sizes (I am working a lot with zarr and similar tools) as well as the configuration of the used EXT4 file system “on top” and LVM or Linux RAID “underneath”. In other words, I re-formated and benchmarked this arrangement a few times, depending on projects’ needs.

The common wisdom appears to be to go for enterprise SSDs in a scenario like mine, or even Optane. Now the latter is crazy expensive at the moment. Besides, Intel has just canceled Optane altogether although there still appears to be a lot of of stock. The argument(s) against consumer SSDs include that they wear out more quickly and that they usually are highly susceptible to power losses. I rarely have power losses and if so I’d loose a day of work, ok, see above. As for the wear, if I manage to wear them down in two to three years, the market might have changed (as usual) … so it might not be the worst idea to just, well, wear them down intentionally and replace them with whatever tech comes next. Besides, the numbers provided for “TBW” do not differ this much between better consumer and enterprise SSDs anymore, as far as I can see. We’re talking same order of magnitude: For 2TB capacity drives for instance, they hover between 1.2 to 2.0 PBytes.

Any mistakes in my thought process or alternative points of view? What would you guys recommend?

IMO you’re pretty spot on. As data loss is less of a concern in your use case, perhaps go for a RAID10 NVMe solution. Scrap that, you’re already doing so :stuck_out_tongue:

To mitigate the remaining (small) chance of power loss, use some super-caps (I have a few 5V5/4F ones, not bigger then a single Oreo biscuit) and associated hardware (decently spec’d resistor, MOSFETs for diodes) to create a highly localised backup for power. On a single 4F super-cap, an idle NVMe drive could (in theory!) last for a few days.

(link provided as example, no endorsement of this particular vendor as I haven’t ordered from them)

Lol thanks. For all that goes into this solder job (time, equipment, figuring out schematics, layouts, risk of killing the entire “not-so-cheap” computer…) I could buy literally a ton of Optane :wink:

In practice you’d only get handful of seconds of run time off of a 4F capacitor for a flash drive, it’d only ever charge up to the 3.3v rail and then discharge to ~2.7v before power failure kicks in.

If the NVME was already in L1.2 (deepest sleep a pcie device can go into for all intents and purposes) it could theoretically last for ~26 minutes in this scenario (2.4 Coulombs, 1.5mA CC discharge), but in practice there is some cache flushing that will take place after a power lose that uses up around a thousand times more energy than a L1.2 state does.

This thread derailed quickly. Anyway, my original question still stands.

hahaha it did.

How big’s your dataset? you’ve already ruled out ramdisk haven’t you?

It’s not one specific data set. It’s more like whatever my clients are throwing at me. For very large data sets I try to select a “representative subset” so I can work on it locally initially, if possible, before moving stuff to remote servers / data centers etc. Very often its in the 100GB to 2TB range. Besides, I also use scratch drives to “cache” results of intermediate steps which is very helpful for testing, development and, in some cases, simplifying workflows. Either way, “the more capacity the better” though within reason as far as price is concerned. I do not want to blow more than 1k Euro at this at the moment. My initially mentioned solution is 0.5k plus VAT for 4 TB. The two disks combined should theoretically allow 14GB/s sequential reads and close to 4 GB/s sequential writes (assuming, from experience, that I will often work way beyond the limits of the SLC write cache).

Well, I need my RAM for the computations themselves - the scratch disk is literally an extension of that. I am currently running on 256GB of RAM (4 DIMMs) with a direct option of going to 512GB (another 4 DIMMs into the 4 still empty slots) for about 1k Euro plus VAT if needed. Any workload that involves rendering results into some kind of animation, which is what I also typically use for debugging and discussions, eats this amount of RAM for breakfast. The motherboard supports 2TB of RAM, but this would be crazy expensive (and actually require slower DIMMs than I currently use). So, long story short, it’s all a balancing act / compromise for having a flexible dev and “exploration” system at a good price point.

To make the most of your money, it’s also good to keep in mind the many used enterprise SSDs (with true power loss protection) available on eBay, often for under $100 a TB.

These have speed specifications that at first appear slower than consumer drives, but are actually for what they can provide for all day steady state performance, instead of a few seconds or a minute. The majority of them have +95% of their life remaining, with many being basically unused.

I’m not offhand familiar with what’s out there enough to suggest one at the moment, but I can dig up my notes of your interested in suggestions.

The absolute best price per TB is in large used U.2 drives that pop up if you’re patient, but it’s a lot more expensive to connect a bunch of those than it is for M.2

1 Like

One more question:
Is the scratch pad used mostly sequentially or is there alot of random read/write going on too?

Also if you you’re simulation is more memory bound and scales with cores well, you might be able to pick up more performance by filling out all 8 memory channels of RAM.

Definitely, I’d be thankful. If it does not waste too much of your time, of cause.

Yep, unfortunately. I have not found anything reasonable (yet), though I keep looking.

Again, it depends on what kind of project I am on. I am exploring, essentially, trying to answer this very question among others, as well as to optimize and/or characterize certain workloads, before I look for the right hardware (or recommend a direction to look into).

Yep, that’s why I have this board :wink: One upgrade at a time, unless someone specifically wants to pay for it.

If it turns out that the access patterns have a decent amount of random reads and writes, it might be worth it to consider a hardware raid controller for the nvme drives, you can typically get better performance when offloading the transfer requests to an accelerator, even if it is just raid 0, you’d be surprised at how inefficient a general purpose CPU is at being a data shuffler.
The full bore x16 pcie 4.0 adapters like the 3258p or 9670W are both >1500USD, so on the very pricey side. but the x8 pcie 4.0 adapters are slightly more reasonable <1000USD.

Also if the newer Optane drives go on some kind of firesale because intel wants to dump them (seems unlikely but I can dream), they would be well suited to this application.

There is a really interesting article published by Puget Systems on this topic. One of very few at this level of detail that I could actually find. It’s based on Windows, not Linux as in my case, but nonetheless absolutely worth a read. Bottom line: NVMe RAID 0 via the OS level is surprisingly good under most circumstances (acknowledging that you have to budget in CPU cycles for it when doing calculations at the same time).

This would be the “more professional” solution, indeed. Buying out of Germany, I have not found anything in this direction that makes economic sense for me at the moment.

And then I still need the disks …

I mean, the GPU market just had a “little hiccup” … sometimes dreams come true. You’ll just never know when exactly.

As I understand it, the Highpoint SSD7101A-1 is just a pcie switch and not an actual raid controller. Meaning it is just using some kind of special sauce software raid implementation when benchmarked as “HighPoint RAID”. VROC and the “Windows RAID” are also software raid.

This is the only mention of hardware nvme raid versus software nvme raid I could find:
Linux mdadm software RAID vs Broadcom MegaRAID 9560

Looks like a 30-50% performance improvement depending on queue depth when going from software raid to hardware raid, I’d imagine the performance delta between one generation newer hardware to be even greater, raid cards are improving at a faster rate than CPUs.

…true. fwiw I was able to pick up a MegaRAID 9560 for alittle under 600USD on everyone’s favorite auction site awhile back.

I’ve got my eyes peeled for any P5800x dumps.

Hmm this does not make entirely sense: Highpoint SSD7101A-1 datasheet. It’s an x16 device, i.e. it uses the full bandwidth of an PCIe x16 slot. I would not call it a switch, but maybe I am wrong. Besides, they are explicitly referring to a RAID controller on their card.

You’re right, they are calling it a raid controller, but when looking at the card’s PCB the only major chip on it is a PLX PEX8747 which is a dead giveaway for switch based NVME card (no RAM packages to support raid controller or raid controller itself). The only benefit to the switch based NVME cards is that they allow multiple NVME drives to go on a motherboard’s pcie slot without pcie bifurcation support in bios.

This makes a ton of sense. Thanks a lot for the explanation.

1 Like

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.