TrueNAS Scale and ZFS, is ECC important enough to get? (Cezzane APU)

thro · August 22, 2022, 5:31am

Thread is entitled “TrueNAS Scale and ZFS, is ECC Important” etc.

twin_savage · August 22, 2022, 5:39am

Assuming we’re talking spinning drives here; It’d be really interesting to see how much fragmentation slows ZFS down, especially since it can’t readily be “fixed” like it can with normal drives or raid arrays. I’ve never actually seen any benchmarking on this.

back on topic:
In my experience looking at memory fault logs, it usually just tells me when a dimm wasn’t seated properly. 7/10 times reseating the ram fixed the flurry of ECC errors I would get.

diizzy · August 22, 2022, 6:58am

Please do tell what’s equvalent to ZFS and BTRFS

thro · August 22, 2022, 7:31am

I’d suggest this is why ZFS (in addition to NETAPP which is also COW) recommend 80% max usage to ensure enough contiguous blocks exist for write (to minimise frag). Writes are async and full stripe.

Read on a multiuser system is going to be fairly randomly distributed anyway (because all the users want different stuff), and this is why you have big read caches.

i.e., write = mostly not impacted; they are serialised and written as a transaction group in a full stripe. read = cached. if its slow add more cache.

Not saying frag = no impact, but again, on a system with many users its probably not as bad as you may suspect - the reads are coming in all over the disk anyway.

As far as benchmarking it goes… that would be a real tricky benchmark to set up…

FurryJackman · August 24, 2022, 11:47pm

I would like to see benchmarks for passthrough VMs. This is where array benchmarks would help a ton.

twin_savage · August 25, 2022, 1:57am

I mean the fragmentation of the data on the vdev itself as opposed to free space frag. I don’t think zfs has a way to even measure data fragmentation, but its effects on read speeds are real (also I’m paranoid about hdd longevity/reliability and I know head thrashing/excessive seeking isn’t good for either).
I’m making the assumption here that arc can’t predict what I’m going to read out of my pool.

I know ontap has reallocation commands to fix the issue because it affected it’s customers enough.

Good point about multiuser environments possibly drowning out fragmentation seeks with just different user request seeks, perhaps that is the case. Now I wanna see seek time accruals on a “fresh” zfs system, a “used” zfs and then something that would represent a multi user environment.

thro · August 25, 2022, 2:14am

Yeah i got it… data frag.

I believe ARC can “read ahead” to some degree. And yeah, that was kinda my point - on a multi-user box reads will be coming in somewhat randomly anyway.

I do agree fragmentation is probably an issue. Just that in the situations ZFS is intended for (multiuser in a managed environment where space will be kept free and upgraded as appropriate) its likely not as big an issue as we might think due to the factors above.

edit:
seems ZFS does a degree of fairly clever (and adaptable) pre-fetch

vic · August 25, 2022, 5:02am

ZFS by design won’t have data fragmentations. So it doesn’t require ‘defrag’ like other filesystems. ZFS does have free space fragmentation issue. Hence, the rule of thumb keeping up to 80% full. When ZFS is nearly full, the search of free blocks to write data degrades significantly in performance. That’s the major source of slow down.

Btrfs is a close ‘clone’ of ZFS but seems Redhat gave up Btrfs to focus on XFS. Apple Filesystem (APFS) is the latest COW filesystem that I could think of in large scale deployment. It isn’t designed with RAID in mind but could scale from mobile devices to large workstations.

Trooper_ish · August 25, 2022, 6:14am

Huh. I just presumed that data was fragmented, because of the copy on write nature of the system.

So a file that is modified, might have all the origional data in a few sequential transaction writes, but modifications In disparate writes in other places, as stuff is saved.
Never bothered me with small or larger files on rust, but I’m only a single user. Amplified writes extending seek times must affect multi users more

twin_savage · August 25, 2022, 7:05am

@Trooper_ish You’re right, COW file systems will fragment more/faster than a non-COW filesystem, COW encourages new writes to be written in new places; for this very reason BTRFS has the ability to turn off COW on a per file/folder basis to tune the system to end user needs. I still contend BTRFS is a buggy mess when used with anything beyond raid1.
Snapshots also encourage further file system fragmentation.

How long have you had you pool running? maybe your data hasn’t had time to fragment enough yet? or maybe I’m just paranoid.

@thro The ZFS prefetch is definitely elaborate, smart enough I couldn’t predict the behavior of it on a given read pattern hahaha

another good point that this likely isn’t an issue if there is a consistent upgrade cycle that “refreshes” the data by putting it on new drives/servers.

thro · August 25, 2022, 7:10am

Hmm. I’ve had this pool running since… 2006…

Fragmentation 17%

It’s never been fully re-written (as in re layed out), just re-silvered to new bigger disks one at a time.

twin_savage · August 25, 2022, 7:13am

you’re talking free space fragmentation though, correct?

thro · August 25, 2022, 7:16am

Very much doubt it, as i’ve recently upgraded drives and there’s a lot more free space than only a couple of months ago (54% used - around 40% of the new disks is likely contiguous “virgin” space).

But i’m not sure - it’s just the “fragmentation” property of the pool

twin_savage · August 25, 2022, 7:54am

I’m 99% sure ZFS is only capable of reporting free space fragmentation, unless something changed very recently.

My assumption as to why a data fragmentation metric was never implemented in ZFS (and a flimsy one based off of very little) was that due to the complexity of system they couldn’t figure out how to or how to do it in a way that didn’t tank performance to calculate.

It just feels like I must be missing something because every other COW filesystem can defragment the data stored on it, ReFS APFS, BTRFS netapp fs, XFS (sort of).

vic · August 25, 2022, 1:18pm

Computer Science has precise definitions of internal fragmentation and external fragmentation in the domain of filesystems. In my understanding, ZFS doesn’t have both by design.

When folks colloquially say “data fragmentation”, seems to me to mean: all blocks of a file not in a big chunk of contiguous blocks. I’m not so sure this sense of fragmentation carries any meaning in practice. Access time depends on lots of other factors, and can optimise around it by various other techniques.

Also practically all filesystems will have this sense of “data fragmentation” sooner or later after a pristine state of a filesystem. If too worried, perhaps we should start using magnetic tape which guarantees no “data fragmentation”

Trooper_ish · August 25, 2022, 1:47pm

Perhaps look at it from the other side- almost all data is fragmented, so the files can be spread among providers in a vdev/array.

Each vdev Will try and fit transactions into the smallest space that is /has been marked free.

As data grows, one needs to free up older blocks to “make space” the same as deleting normal files in a single drive filesystem?

So all files basically fragmented by design, to allow splitting among redundant providers, and to allowing parallel reads, even if seek times are not better. Only improves throughput.

twin_savage · August 25, 2022, 5:41pm

yes, I’d categorize this as data fragmentation. If we were all on ssd’s or in main memory I wouldn’t care because it would be inconsequential, but hdds are electromechanical and very sensitive to having their reads broken up and having to physically reposition a head to a new location. Performance can degrade by orders of magnitude is data is fragmented enough, also there are reliability implications to it as well.

You’re right that all file systems will get fragmented overtime, but almost all file systems have the ability to defragment themselves, or even “optimize” by placing often accessed data at the perimeter of the platter and in proximity to other often accessed data.

vic · August 25, 2022, 6:10pm

I’m curious about impact of “data fragmentation” and did some research. Here is a study of common filesystems (ext4, btrfs, xfs, zfs…) and their read performance with respect to ageing i.e. blocks of a filesystem are increasingly written over and ‘randomly’ scattered around disk drives.

The study may not be very representative of the subject filesystems’ performance in general. Nevertheless a good data point for seeing the trends. Seems all filesystems suck when they age, and they age pretty quickly.

One surprise the study shows is that ZFS is pretty robust in handling repetitive rewrite to files in systems with light write activities. See Figure 2 in the linked study. Perform ‘okay’ in systems with medium to heavy write activities. See Figure 4. When a large number of small files are concurrently modified, ZFS is fair in keeping the new blocks in close proximity of unchanged blocks. See Figure 5.

The other surprise from the study is that XFS is fairing better against ageing among the subjects under test. So perhaps my next build will move away from ext4 to XFS for system disk. While ZFS will remain the choice for NAS or data archive purposes.

twin_savage · August 25, 2022, 6:57pm

This is good stuff! I am going to read this end to end, but figure 9a shows the different systems relative performance in a fragmented state versus a defragmented state. I hadn’t ever heard of BetrFS before but it’s performant.

vic · August 26, 2022, 2:55am

Glad to hear you enjoy it. Hope more ppl find it interesting.

Figure 9 is the ‘mailserver test’ result. It is ‘interesting’ and perhaps not that fair to ZFS. Love to hear what people think why ZFS performs the ‘worst’ in the experimental setup.

My take is wrong recordsize used for ZFS. Their ZFS config uses recommended default. I suspect that means recordsize is 128KB. The mailserver experiment setup with file sizes uniformly distributed between 1KB and 32KB, way lower than the recordsize. They probably should have picked a smaller recordsize, say, 16KB. Then I bet ZFS performance will be similar to other test subjects.

But XFS shows its true color again