[HELP] Strange ZFS setup choices

prathampatel · November 17, 2020, 9:27am

I’m extremely new to using ZFS and there are a few thing which I want to do. I will state the reasons down below.

Turn off ARC and all kinds of write cache.
Get a prompt before ZFS repairs a rotten bit of a file. Or if that is not possible, get a log of which file was “repaired”.
Optimize it for an all-SSD pool.

Now that I’ve stated what I want to do, I’ll talk about why I’ve made such “bad” choices. I live in India - where computer prices are ever so sky rocketing and where old quad core servers with mere 8GB ECC RAM cost at-least $400 (yeah it’s cheaper in USA but it ends up being the same cost-wise if I ship the said server in India, so not really worth it). The only reason why I chose ZFS is because of it’s checksumming and repairing features. All I need from a file system is for it to checksum the data being written in real time - which ZFS does, and for the file system to periodically check the whole disk for any kind of bit rot - which ZFS does.

Now, because of high prices of new & old hardware, ECC memory is out of question over here in India. I will be ordering a workstation for myself with Ryzen 9 3900X (yes I know Ryzen has un-official support for ECC memory, but I either want all in or nothing at all - no hanging in between for clear and blunt “I have ECC memory” or “I don’t have ECC memory” in terms of troubleshooting & support) for Deep Learning, I have it equipped with 5x 1TB SSDs with the intention of getting 3.5-4TB of available space in raidz1.

Reasons why I’m going with SSDs instead of HDDs

The storage pool will only be written once and is purely for archival purposes, so exponentially faster scrubs because of SSD’s read/write speeds.
Even if a disk fails, that rebuild speed even of a SATA SSD is a godsend.
Energy efficient

As I’m only using SSDs, the write speed impact when turning off ARC won’t be a problem for me. Also as shocking this might sound to the majority of you, the SSDs I’ve chosen are consumer TLC SSDs and I do not intend to use them as a fast editing NAS so there’s that. I’ll only be filling at-most 1.7 TB of data in the 3.5-4TB pool once it is all set up and writes will be at most 10 GB of files, only once a month. I can’t exaggerate this enough. I selected SSDs because I need to spend less time cringing + being nervous when a scrub occurs and in future when I need to replace a disk and resliver.

Talking about my intended setup, please bear with the insanity of my “intended” setup; I will be setting up a Ubuntu server VM and use zfsutils-linux for the ZFS part of my setup. I will perform the optimizations which I asked for help in the beginning of the post and then setup the 5x passed through SSDs in raidz1. Then I will transfer all the important data I have to the SSD pool. I also have an external 4TB USB HDD and a Raspberry Pi 4B 4GB. I will set that RPi with Ubuntu server as well and use the single HDD in ZFS as well (this time I don’t need any of the above requested optimzations). That Raspberry Pi setup is just for redundancy as you’ll read about it more down below.

Reasons as to why I have opted for such bizarre setup.

Can’t justify ECC memory with my budget so I want to turn off any writes in RAM if possible. As, if I have understood ZFS correctly, ZFS will first write to ARC and then to the disks. So a bit flip in non-ECC memory won’t be detectable at the time writes. Also because I have pretty fast pool write speed considering all of it is just SSDs.
I requested to get a prompt before ZFS repairs/attempts to repair a file is because I have non-ECC memory. I want to make sure if the file is actually corrupt myself. If it is corrupt, I will let ZFS go ahead and if it isn’t, it’s most probably a RAM issue and I will try to see if a reboot helps. Or else, I do have a remote copy of the same file on the Raspberry Pi
If getting a prompt before ZFS repairs a file is not possible, I want to get a log entry regarding which file was “repaired”. I will perform the same verification as I mentioned in the above point myself as I have no problem doing a bit more work in the absence of ECC memory.

This will be my first ZFS build - as it clearly appears so. So any help/recommendations regarding my setup will be extremely well received. :’)

KVM_powah · November 17, 2020, 5:31pm

I am also paranoid about bit rot. So I get it. I think this falls under the category of you do you. Go for it. But a ZFS scrub isn’t something I worry about. I run it when I expect the storage to be at low use and forget about it.

reavessm · November 17, 2020, 6:07pm

ARC is a read cache, not a write cache. The ZFS Intent Log (ZIL) gets confused with a write cache it still is not. Either way, to turn off ARC you would do zfs set primarycache=none <nameOfPool> and to turn off ZIL you would need to make everything async by zfs set sync=disabled <nameOfDataset>. I still don’t see why everybody is so worried about ECC/Non-ECC memory but it seems that I won’t be able to dissuade you.

For the repairing, you can set up scripts to notify you when a scrub finishes via the ZFS Event Daemon (ZED) and it give you more information

Trooper_ish · November 17, 2020, 6:23pm

ZFS will still cache writes in memory as it creates transaction groups.

Kinda pointless to worry about; every computer without ECC has the same potential for errors, as well as the source of the data (data might have errors before ZFS receives it to write out, and will faithfully write all errors into the pool, for all of time) But other filesystems might not check for corruption at read, write or scrub time (some do)

ECC is not a hard requirement, it’s just a suggestion.

Even without ZFS, people ECC might be recommended for computation tasks that are business critical, but again, not needed, and not always used

reavessm · November 17, 2020, 6:56pm

This is the first I’ve heard of transaction groups, but after some quick google-fu it seems that they only affect in-memory data structures? Like metadata mappings (which eventually get flushed to disk but if it starts in memory and you have to have it then ¯\_(ツ)_/¯).

Trooper_ish · November 17, 2020, 7:15pm

I’m not an expert, and I presume you already found a brief overview here:

And maybe general write throttling here:
http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/

But Jim Salter does suggest a way to speed up writes when the cache is disabled:
https://jrs-s.net/2019/07/20/zfs-set-syncdisabled/

Tldr: for fastest write, disable cache, and set sync=disabled, which is unsafe, but fastest? (unless I’m reading it wrong?)

Not sure if this conforms to the goal of no cache for safety sake though…

Trooper_ish · November 17, 2020, 7:20pm

Oh, and meant to link this:

https://jrs-s.net/2019/05/02/zfs-sync-async-zil-slog/
for more options, though OP wanted less cache, rather than more…

Log · November 17, 2020, 8:18pm

From here

primarycache = all | none | metadata
Controls what is cached in the primary cache (ARC) If this property is set to all then both user data and metadata is cached. If this property is set to none then neither user data nor metadata is cached. If this property is set to metadata then only metadata is cached. The default value is all

Do not set primarycache to none, you will amplify reads because ZFS will have to search for metadata then the data blocks it’s looking for. None is for testing and debug purposes and can have weird performance consequences that are less understood than ZFS normally is…

ARC resists caching large sequential reads, preferring to store smaller random reads, especially if they are read multiple times. ARC also frees memory when the system requests it, though sometimes it’s not fast enough in certain workloads to avoid low memory consequences. It generally doesn’t cause a system problems.

If for some reason you really do not want to cache for a dataset, set:

# zfs set primarycache=metadata your_pool_name/your_dataset_name

This is actually useful for saving on memory when using VM’s (because the VM’s ALSO cache data in memory, and don’t have access to the system ZFS ARC).
I believe there may be possibly be bad performance consequences when using samba to access a ZFS share though, which brings us to the next thing

The best way to limit ARC size is the following.

Create or edit zfs.conf:

nano /etc/modprobe.d/zfs.conf

Then add or modify the following line:

options zfs zfs_arc_max=8589934592

The size is in binary
1073741824=1GiB=1024^3
2147483648=2GiB
4294967296=4GiB
8589934592=8GiB
Etc…

Then you’ll probably need this to update things:

update-initramfs -u

Finally use the following to confirm it was changed. It should spit out the number you want.

cat /sys/module/zfs/parameters/zfs_arc_max

I’ve run ZFS on a non-ECC system with 4gb of ram before. It’s fine for storage purposes. The only thing that FORCES the need more memory is when you have a LOT of files. ZFS needs to keep track of that list in memory to function. I forget the calculation for estimating how much is needed, but the minimum needed at least partially depends on recordsize (which effects the number of blocks to keep track of)

Another thing you can do to decrease the amount of metadata needed (useful for SSD GB sizes, not so much with HDDs TB sizes)

# zfs set recordsize=1M your_pool_name/your_dataset_name

The default is 128K. Changing it to 1M or 4M are very common for storage datasets. They will have negative performance consequences due to write amplification depending on how they might be modified.

I have a post here that has some other things you may be interested in.

Log · November 17, 2020, 9:25pm

Turning off Write Cache?

As @Trooper_ish mentioned, you can do sync=all. This will broadly drop your write rate by an order of magnitude. R/W worksloads will be so bad it’s silly. ZFS batches together txgs and writes them to disk every 5 seconds (or if it hits a certain size) by default, and will do this whether you have sync writes or async writes. It’s just that sync writes force additional writes the disk in order to quickly respond to a program waiting on confirmation that a sync write has truly completed. It also doesn’t automatically make things safer. Sync writes are for allowing the program making the writes to be sure everything is as expected in case of a sudden crash. Databases may need them, but ZFS doesn’t care.
Forcing sync isn’t needed for simple data storage.

ECC Ram

Run memtest86 on your normal memory for a day or two, and then don’t worry about it.
Your single largest source of data corruption will be faulty/poorly connected SATA and SAS cables, hands down. I’ve personally caught this a few times, thanks to ZFS.
Following that, is a faulty/overheating SATA or SAS controller. (Most Sata expander cards are garbage)
After that, my guess would be bad sectors being revealed on a disk. They don’t really just start existing randomly, but rather were always a bit iffy and actual use simply caught them.

Scrub and Resliver speed

Scrubs in recent zfs versions are very fast. These should be set to automatically monthly or every two months. You shouldn’t worry about it or even notice it unless you get an email telling you something is wrong. By the way, DO set up email notifications.

Resliver speeds are slow on HDD’s. If you have a disk go down, the first thing you should do is send/recv to your backup. No need to worry them. In fact, your snapshots and backups should take place automatically. (An alternative)

By the way, you do have backups, right? Right?. I saw the word redundancy, but it’s easy to confuse terms.

The way the word “redundancy” in tech is typically used to mean there is extra stuff so that things keep working without breaking. A ZFS VDEV configured in RAIDZ1 has redundancy, thanks to using an extra disk and data for parity. A more clear term is “fault tolerance”.

To truly satisfy the requirements and purpose of having a backup, you both a copy of the data, as well as the ability to roll back changes to abritrary points in time, in case data is changed accidently or maliciously. Also, preferably seperate physically from the location of the source data.

Because ZFS is Copy-On-Write, you can compare files in snapshots using zfs diff

For faster reslivers, keep an eye on DRAID, when it likely comes out in ZFS 2.1 or something. Basically instead of having a whole disk empty and ready to be used as a spare, the data is spread out over all disks along with empty space forming a sort of virtual spare. This if a disk goes down, the remaining disks can all work at the same time to calculate and write extra parity within just a few hours instead of ~~30+ hours~~(This was for a Read+Write+Read test I use for checking sectors, not a resliver, sorry) ~16 hours for a whole 10TB disk. Actual reslivers will still be slow, as when adding a disk back, it’ll still need to be written full as usual. Once the reslivered disk is finished, the space from the virtual spare data blocks is freed up.

4TB of data will resliver within the day even with RAIDZ.

Trooper_ish · November 17, 2020, 9:40pm

I always looked at Raid / Mirrors more as Uptime, and Backups as recovery/ disaster recovery.

Even non-raid systems can recover with backups. Whereas the best raid in the world is useless if errors exceed fault tolerance (like a cable to a controller goes bad, and half the vdevs drop out, killing the pool ded (okay, maybe not ded, just degraded and unresponsive till vdevs restored))

Also, deleting/changing a file on a raid array might not be recoverable without backups, unless spotted in time (snapshots may extend the time window)

But OP (@prathampatel) already has a backup device planned, with a risk already accounted for, which is good to see. (acceptable to OP risk)
If the pi had not been brought already, an Odriod HC2(the 3.5" version) might be simpler to use as a network backup device, as less cables, but that’s a semantic point

I presume you mean on spinning rust.

As in, OP should not necessarily worry so much about getting SSD’s; Replacing dead Hard drives in such small capacities are not expected to be problematic?
Like, larger drives take longer to resilver, and they are where one should worry about death-before-resilver?

risk · November 17, 2020, 9:58pm

Maybe just get a pair of 5T / 6T CMR drives (extra space for comfort), setup dm-crypt, and raid1 btrfs on top and you’re done. You also get checksum and scrub functionality, and you only get plain old Linux page cache semantics to worry about when it comes to ram usage - easy peasy and cheap. When you need more space, just rebalance to raid 5. When you need more performance, just start using device mapper with write cache on SSDs.

Trooper_ish · November 17, 2020, 10:00pm

I don’t know, I think OP’s choice of ZFS is a good one, they just don’t have to worry (As Much) about the ECC and the SSD’s, if money is tight.

But, not a bad thing, actually thinking of the consequences, rather than jumping in to Death-Wish-Raid, (then regretting it later) with no backups

Log · November 17, 2020, 10:05pm

I always looked at Raid / Mirrors more as Uptime, and Backups as recovery/ disaster recovery.

Exactly. It’s for giving time to mitigate the issue, not really for preventing data loss.

I presume you mean on spinning rust.
As in, OP should not necessarily worry so much about getting SSD’s; Replacing dead Hard drives in such small capacities are not expected to be problematic?
Like, larger drives take longer to resilver, and they are where one should worry about death-before-resilver?

Yeah, HDD’s.
A rough calculation of the time it takes is as follows.
Let’s assuming the disk is doing a low end of 180 MB/s writes. If I recall correctly, most mine seem to hit between 200 and 220 MB/s when they have a nice large uninterrupted sequential write.

1,000,000,000,000 bytes (1 TB) divided by 180,000,000 bytes
=5556 seconds
=93 minutes
=1.6 hours to fill up 1 TB of space

Note that 10TB easystores can write starting out at about 210MB/s when empty, but this will gradually slow down to about 130MB/s at about 80% full

So a 10TB disk would fill up in roughly 16 hours, if filled completely
A 4TB disk would completely fill up in 6.4 hours.

Now, there may be things that slow this down, but generally once the send/recv gets going, it’s going to be even faster, and additionally the disk isn’t going to be completely fill.

Basically, not really a concern.

Trooper_ish · November 17, 2020, 10:08pm

I generalise down to 100MB/s for hard drives, because I buy cheap consumer rust, not those fancy 240MB/s SAS drives some people can afford

Log · November 17, 2020, 10:11pm

All my disks are shucked easystores. You are probably used to seeing lower speeds from everything else going on. More, smaller files, samba, 1G ethernet, etc. At this point, all HDD’s are pretty much pouring out of the same tap. Manufacturers probably do some sort of testing and send the best ones off to enterprise, and stuff the rest into plastic.

Actually I did forget about ethernet. OP probably doesn’t have a coolguy 10G network from ebay like someone I know, so he’s gonna be maxed out at ~120 MB/s.

So for shoving data over 1G ethernet at @ 100 MB/s
=10000 seconds
=167 minutes
=2.8 hours

So a 10TB disk would fill up in roughly 28 hours, if filled completely
A 4TB disk would completely fill up in 12 hours.

Trooper_ish · November 17, 2020, 10:18pm

That’s another thing, if completely filled.

I’ve just re-jigged some 4TB drives hanging off a pi, and it was re-silvering about 120MB/s. If I unmounted, it probably would have been a little quicker, but the drives were not full, so was pretty quick.

And I can see the dust from the derailed thread leading off into the distance…

prathampatel · November 18, 2020, 6:39am

Upon some reflection, I’ve decided to go with my decision. I will consider all inputs by all of you and I feel overwhelmed by generousness of level1forums compared to r/zfs. To answer @Log, yes, I do have a remote backup. I do intend to put my Raspberry Pi at my friend’s house and we both have the same “local” ISP so I can send updates to my NAS over gigabit.