128TB RAW ZFS build, Jumping the Storage Spaces ship - Questions

HIGH LEVEL

I maintain roughly 30TB of large video files utilizing Windows Storage Spaces. I’d like to make a new system using Ubuntu and ZFS (if the numbers make sense).

BACKSTORY

One of the biggest drawbacks to Storage Spaces is the overall performance, especially write performance. This has been a large bottleneck trying to get to a 10Gb network setup. And now that the SS Pool is in place, I can’t make any changes to it for performance. I’m forced to build a new system.

USE

I hoard videos. Each file is about 8 to 24GB.

GOAL

I’m hopeful that ZFS Raidz2, and some hardware changes, can get me to the sustained write performance I desire (400+MB/s would make me very happy).

EXISTING HARDWARE (on new server)

1x Intel - Xeon E5-2670 2.6GHz 8-Core Processor

1x ASRock EPC602D8A LGA 2011

2x Samsung - 850 EVO-Series 250GB

2x Plextor M8Pe(G) 128GB NVMe PCI-Express

8x 4GB RAM HMT151R7BFR4C-H9

FUTURE HARDWARE

16x HGST 8TB Ultrastar He8 HUH728080ALE604

QUESTIONS

I’d like the write-back cache of the Pool to sustain, worse case, 100GB of data per transfer, at 400+MB/s. Is this possible, and how can I get there? Is an NVMe SLOG the answer?

Whichever strategy is needed, I need the write-back cache to be only for write-back. If the system is going to place recently used or high demand data there, all it takes is a few video files, and the cache could be full. Can you have a write-back cache in ZFS that is write-back only? (I’m OK with the NVMe drives sitting and doing nothing most of the time if that’s what it takes)

Do I even need to worry about this? Will the performance of Raidz2 be enough on its own?

Also, is 400+MB/s unrealistic?

OTHER STUFF

Ultimately, the ZFS server will become the primary machine, and the current Storage Space server will be blown up and rebuilt, becoming a full backup.

Thanks in advance for any assistance

looks good.

more ram is always better than a slog or an l2arc, the ARC works a lot better than most caching schemes. If this is a pure storage appliance, the more raw ARC you can throw at the problem, the better.

throw an nvme slog in there if you really want some increased write performance

1 Like

AFAIK SLOG is more relevant for sync write operations.

You can set it to do async writes as well but the slog is a copy of the zil which is always in ram, so any amount of SSD storage you give it it can only use what it can already fit in ram. There may be another (better) way of doing write caching but I’m not sure what it is.

Like @tkoham says, for performance more ram is more better.

Also if performance is what you’re after then striped mirrors will be faster than raidz but you will lose usable storage.

I’m fairly certain that’s untrue. The purpose of the zil is to persist over reboots and as such, it must be stored persistently.

Wait. Isn’t the SLOG a separate block device to assist the ZIL on sync writes? Like… I do a sync operation which is slow but ZFS points it to a hopefully fast flash device and then it gets flushed to the spinning rust disks after?

1 Like

not exactly, no

functionally yes, though

As I understand it (could be wrong) the zil is stored in ram and on the pool so that if there is an error during writing it can be recovered. The slog move the zil from the pool to a separate device to improve write speed (so you’re not writing to the pool twice) but there is always a copy in ram.

I’ve never dissected the source code, but i have seen situations where I’ve used 128gb of slog on a system with 32gb of ram. The reason for that is because the zil is stored, in a non-slog configuration, on disks in the pool.

This is done so that incomplete writes don’t get lost. If the zil entry is incomplete, it’s dumped. If the zil hasn’t been flushed to permenant storage, it’s flushed at boot. This is all done in a start after power failure or unclean shutdown.

The slog just moves the zil from empty blocks on the array to a dedicated device, which can increase functional speed.

I think you may be right, not sure where I heard that. If you were to set a pool to sync always would the slog act as a write cache or is there a better way of doing that?

the slog is native to the pool (not ram) unless you offload it to a slog device

1 Like

I’ve been using this as reference This

By default, the short-term ZIL storage exists on the same hard disks as the long-term pool storage at the expense of all data being written to disk twice: once to the short-term ZIL and again across the long-term pool. Because each disk can only perform one operation at a time, the performance penalty of this duplicated effort can be alleviated by sending the ZIL writes to a Separate ZFS Intent Log or “SLOG”, or simply “log”. While using a spinning hard disk as SLOG will yield performance benefits by reducing the duplicate writes to the same disks, it is a poor use of a hard drive given the small size but high frequency of the incoming data.

The last paragraph of the article states that you can get much improved write performance by moving the SLOG to NVMe.

The conundrum; increased write performance is what I’m looking for, but examples/theory I’m reading are geared towards many small quick writes. My proposed writes will be huge, pushing 100GB at a time. At that size, does the SLOG still get the job done?

RAIDz is still striped, having an asyncrhonous cache get overrun with the amount of spools you’ll be spinning won’t hurt performance much

2 Likes

16 drives should be enough to sustain 400 MB/sec i would think?

Being a little pessimistic, streaming data a modern single spinning disk can go 50+ megabytes per second (sequential) even when pretty full, and you have 16 of them… 8x50 = 400 so even if you’re writing data twice without an SSD log device i reckon you’d go close (being slightly pessimistic)?

Put it this way, for large copies i’m often saturating 1 gigabit ethernet on my little 4 drive NAS running 4x4 TB SATA disks in RAID10. The limit is the NIC. Easily.

These are the drives i’m using… they bench upwards of 120 MB/Sec each sequential when around half full… your 8 TB drives should be faster?

To get 400 MB/sec over the network, you may start running into NIC issues… you’ll be wanting to ensure your 10 gig NICs are actually capable of pushing harder than 4 gigabit…

May need tuning, e.g.

1 Like

Sequential, yes. Random, you’re going to drop extremely fast.

Definitely. But from the sounds of it this is being used as a video archive…?

Fully random worst case you’re looking at maybe 4-7 megabytes per second depending on the RAIDz2 configuration.

Yes, but that doesn’t necessarily mean it’s only going to be accessed by a single person. As soon as you get that second access, you’re going to see seek time kill the write performance.

1 Like

Yeah, i forgot to ask - how many users is this going to be serving. Because as you mention, it will definitely matter.

1 Like

That’s why I support having an SSD for write caches.

It’s not 100% critical, but it definitely helps.

ZFS in theory serializes writes using RAM, but yeah, if you’re stuffing 16x 8TB drives into a box, spending a couple of hundred bucks for a pair of SSDs isn’t going to break the bank.

edit:
I do try and size my storage based on the disks though and use SSD cache as performance “gravy”.

Because if you start pushing past the SSD cache, that’s what you’ll see…

1 Like