Hard drive suggestions for heavy io scratch

I’m planning on adding a ZFS pool of HDDs to my workstation meant for scratch storage for scientific computing workloads. When running compute workloads, it’ll probably see several TB written per hour. Any suggestions for a robust drive with decent capacity (4+TB) that can handle this kind of abuse?

This is fine for most drives as long as it’s sequential writes. (3.6TB/h == 1GB/s … You’ll need at least 5-6 drives with some simple raid0 setup).

You may want to also consider flash instead of HDDs just because it becomes easier to get the throughput unless you need large volume of scratch space.

1 Like

Optane

3 Likes

This is fine for most drives as long as it’s sequential writes. (3.6TB/h == 1GB/s … You’ll need at least 5-6 drives with some simple raid0 setup).

Yeah, I was thinking striped volumes in ZFS pool to get the throughput with HDDs.

You may want to also consider flash instead of HDDs just because it becomes easier to get the throughput unless you need large volume of scratch space.

My requirements for scratch are anywhere from 1-10 TB stored at any given point, so flash would be great if highly durable drives at 10TB could be found on the cheap. This rules out optane for me too.

If you’re doing several TB per hour hard drives are out IMHO. The cost of additional controllers, space, noise, power, physical size, etc… I’d go with a smaller number of SSDs personally. Plus there’s the whole unofficially SMR’d drive problem at the moment which don’t play well with ZFS (or RAID in general)

Take all of those issues and you’re better off with a smaller number of SSDs.

You simply aren’t going to get there with the data rate, even if your IO is large block IO. If it isn’t large block IO, or is bursty beyond that average… things be even worse.

Assuming 200 MB/sec sustained you’re only going to do 720 GB in one hour.

Sure you could RAID a bunch of disks together but… what happens during rebuild? Will your array still be able to keep up when it is rebuilding? Or is it OK for scratch to die/be unusable due to speed problems until it is rebuilt or repaired?

1 Like

Here is a hdd suggestion i just purchased 5 of these for my unraid nas about a week ago and they can do 250mbps pretty much 24/7. Idk how many you would need to get your storage pool size and required throughput but this is a very nice drive for the price. They also were shipped very very nicely. Its also a enterprise drive so i dont think it would have a problem with your workload.

With flash there’re these WD SN630 drives I’m now seeing for sale for 760eur in 7ish TB capacity.

You think a pair of those would last you a few years?

2 Likes

Just another thing to be aware of - hd performance varies as they fill. They may do 250MB/sec or whatever at the start of the disk, but get slower the closer you get to the inside of the disc (like, maybe half the speed).

So unless you’re planning to short stroke them (and even then, the inner tracks will be slower than the outer ones, just to a lesser degree depending how much you give up) - you need to keep that in mind with spinning disc based media.

1 Like

one of the first videos wendell was in was about partitioning hard drives for speed, if you need very little space for scratch you can partition like 1% of the total space at the front of the disk and then the read/write head doesn’t have to move or at least very much to read/write and you’ll get more surface area per rotation

SSDs are so cheap though I’d go with those

2 Likes

Yup, SSDs have their own quirks, but its mostly cache vs. actual bulk NAND speed.

Hard drives though… basically take the performance you get when first installed, halve it (for the inner tracks you’ll end up on as it fills assuming you don’t short stroke and give up a bunch of space), and then take some more off as the file system gets fragmented.

So… go larger than you need for better performance.

As crazy as it sounds, another alternative to SSDs, if you have a bunch of random writes and are operating at close to full capacity most of the time (prevents wear leveling operating efficiently) is to go with ram. Slow 2133 DDR4 ecc ram is 3-4x the cost of cheap flash (actually cheaper than optane in some cases).

Not sure what kind of scratch space your workloads expect, but something like a bunch of redis/memcached might do the trick better than a filesystem in this case.

1 Like

Scratch can die. There’s no data that I need to store beyond the duration of the production run.

1 Like

that’s not at all crazy, i remember a company called or making a product called “RAMSAN” about 9 years ago.

if you need fast scratch, that’s where it’s at :joy:

Actually RAMSAN fills a different niche - same as liqid today, where you have growing legacy workloads supporting business and business development that happens to surprisingly outgrow what “normal hardware” can provide and outpace what your developers can keep up with.

For example: your database that happens to be architecture so poorly that it needs to run on a single host with 4TB ram (limit on new epyc) + SWAP on pci-e ram backed nvme…

Any sane business would have invested into moving this workload from this one giant machine to something distributed, either by simply splitting/sharding vertically or horizontally the database in two/more parts at the cost of inconveniencing a small part of the company who need all shard queries. Or by investing into something “cloud native”(hate that term) and rewriting use cases around it, or both.
The downside is that in legacy deployments it takes developer time to implement these changes and like any change there’s all kinds of risk involved and not everything may end up going as planned.


In this case, it seems that drives are fine, and you don’t need low latency access (network is fine) so it’s just a question of throughput and overall hardware wear.

Overall rate is ~10Gbps … not that high.

If the data is erased fully every hour or every day, and then you proceed to fill up the drives again, ssd wear leveling will get a chance to kick in and durability will be fine.

If the workload is generated by multiple machines, you can attach the drives to each machine and use them over a network. (there are various mfs filesystems that can help you run a writable union across the filesystems so you don’t need raid0 in case your scratch space is in many files - this works over the network).

If you have a few dozen machines, then use regular drives and ceph - where you practically get raid0 across all the drives on all the machines.
Network is still lower latency than hdd seeks, and this setup is great for tail latency and aggregate write throughput where you have many machines writing to scratch and many machines reading.

Get 2 Micron 9300 MAX SSDs, put them in RAID 0, job done. The smallest drive (3.2 TB) can endure over 18 PB of writes, multiply that by about 2 as RAID 0 should split the write load roughly evenly, and you have a ~6TB device with ~4-5 GB/s of write speed and ~30 PB of endurance.

You can even go with a single drive, but they are surprisingly affordable little beasts, ~700-900 EUR for one 3.2TB drive.

1 Like

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.