Linux Zfs datasets/topology new user

So I finally bit the bullet and decided to scrap ext4 for zfs. Played around with a VM before doing the real deal.
I couldn’t find a single guide for what I was looking for and used a handful of other sites/guides to piece it all together. Wondering if my setup is okay or flat out insane. A little confusing mixing guides
This is my setup & datasets (& zvol)

$ zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:03:18 with 0 errors on Sat Aug  5 01:59:26 2023
config:

	NAME                                                     STATE     READ WRITE CKSUM
	rpool                                                    ONLINE       0     0     0
	  mirror-0                                               ONLINE       0     0     0
	    ata-Samsung_SSD_860_EVO_500GB_S3Y1CC0IA58810B-part2  ONLINE       0     0     0
	    ata-Samsung_SSD_870_EVO_500GB_S7PXMTEO510313X-part2  ONLINE       0     0     0

errors: No known data errors
$ zfs list
NAME                             USED  AVAIL     REFER  MOUNTPOINT
rpool                            228G   222G       96K  none
rpool/ROOT                      7.78G   222G       96K  none
rpool/ROOT/archlinux            7.78G   222G     7.48G  /
rpool/ROOT/archlinux@Aug6_2023   315M      -     7.34G  -
rpool/ext4                       200G   337G     84.4G  -
rpool/home                      20.0G   222G     20.0G  /home
rpool/home@Aug7                 43.5M      -     19.9G  -

I’m using a mirror setup, with 2 x 500GB SSDs (Samsung 860 EVO + Samsung 870 EVO), created with ashift=12 (mixed search results using 13, drive lies @ 512b sectors)
Ig my question is, I do a fair bit of bittorrent and have several VMs. I created a ZVOL and created an ext4 filesystem on it. Would this eliminate the fragmentation that I would otherwise get if only using zfs? After some reading around, I set the zvolblocksize to 1M, all the other datasets are default (128k I believe)
I seem to be getting a pretty decent compressratio on the zvol @ 1.38x w/ LZ4 compression.
Is this the appropriate/correct usage of a ZVOL? Would it be an ideal way to store large files (and junk) that will be modified/created/destroyed semi-regularly, in terms of fragmentation (or any other issues)?

Overall things seem to be smooth, I created a tmpfs for my browser’s cache in RAM; I have VERY little to no write activity once I shifted the browser cache/profile @ /home/user/.cache/chromium
Thanks for any insights from anyone more experienced with zfs

A zvol is still going to get fragmented. It’s all sparsely allocated and not guaranteed to be contiguous (in fact it won’t be by design). Having used ZFS as while, in reality fragmentation isn’t as big a problem as it may seem. As long as you’re not filling your disks to capacity ZFS will be clever about how it allocates blocks. Keep your disk use below about 75% and your fragmentation probably won’t go over 3-4%.

1 Like

Workload Tuning — OpenZFS documentation

If it’s in a zvol, volblocksize should be 16 for BitTorrent and a separate one with 1M for long term storage of media files.

Bit Torrent
Bit torrent performs 16KB random reads/writes. The 16KB writes cause read-modify-write overhead. The read-modify-write overhead can reduce performance by a factor of 16 with 128KB record sizes when the amount of data written exceeds system memory. This can be avoided by using a dedicated dataset for bit torrent downloads with recordsize=16KB.

When the files are read sequentially through a HTTP server, the random nature in which the files were generated creates fragmentation that has been observed to reduce sequential read performance by a factor of two on 7200RPM hard disks. If performance is a problem, fragmentation can be eliminated by rewriting the files sequentially in either of two ways:

The first method is to configure your client to download the files to a temporary directory and then copy them into their final location when the downloads are finished, provided that your client supports this.

The second method is to use send/recv to recreate a dataset sequentially.

In practice, defragmenting files obtained through bit torrent should only improve performance when the files are stored on magnetic storage and are subject to significant sequential read workloads after creation.

The proper blocksize to choose in regards to where torrents are downloaded is kinda tricky, because they can be vastly different. I have to take issue with OpenZFS’s torrent tuning suggestions, because it neglects an aspect of how torrents work.

A torrent breaks everything into chunks called Pieces that range from 16KiB to 8MiB. So ( torrent_size / number_of_pieces ) = supposedly an optimal block size for a filesystem to ingest. Generally when a torrent is correctly created, the larger the overall size, the larger the piece size used because there’s some kind of range (few hundred to just over a thousand?) of the number of pieces that are considered better. Modern torrents seem to user larger Piece sizes overall than in the past. With internet as fast as it is, sending 16KiB Pieces is actually a bad idea. (Torrent clients may actually send multiple Pieces at a time, but I’m not sure on this).

For me, I just checked a few torrents that are a few GiB in size, and they have either 4MiB or 8MiB piece sizes.

  • A generated nixos.iso is ~2.38 GiB in size, and consists of 610 Pieces which are 4MiB in size.

  • A 287.63GiB youtube channel archive I uploaded to the internet archive has 36818 Pieces that are 8MiB in size. lmao

When your torrent application reads and writes data, it could very well theoretically be doing so in some manner of 16kb operations (no idea). However in practice at the macroscale it’s sequentially reading and writing out entire Pieces. Torrents don’t send portions of Pieces to my understanding.

So even if the if a torrent requests a 16KiB read (again, unsure if this is really the case) from a much larger block on ZFS, ZFS is going to store that block in ARC, and immediately have the rest available from system memory when the client asks for the next 16KiB. So there shouldn’t be read amplification. Incidentally, torrent seeding is a fantastic use for adding any old SSD as an L2ARC.

And for torrents with very small Piece sizes, well they are likely just dealing with a small total size, so any RMW amplification will be even more transient than it is for a regular torrent.

And for writes, we should also not forget that for async writes, ZFS aggregates these in memory until a threshold of size and time. So if write amplification is truly a concern, you can increase these parameters. The following are what I remember as immediately relevant:

You are also other options, but frankly I’m not clear how they all interact with each other.

So in /etc/modprobe.d/zfs.conf you can put something like this and be fine. Hopefully someone who’s more into tuning can chime in if this is all that’s really needed for this scenario.

options zfs zfs_txg_timeout=30

Basically if you lose power, you have roughly up to 30 seconds of data loss.

So unless I wanted to self-harm by carefully testing and tracking write amplification and performance across many variables over fucking torrents, here’s what I have done, and would continue to do for a torrent ingestion.

  • Dataset: I just leave the recordsize at default, or go larger like 1MiB.
  • ZVOL I’d probably set zvolblocksize to 32KiB or 64KiB. Possibly larger, dunno, would depend on the histogram of file sizes.

For torrent seeding, bigger blocks are better. Treat it like the static data it is. This also cuts down on metadata (which takes up RAM) and CPU resources to calculate checksums and compression. Also the nice thing about torrents, is we don’t give a damn about latency, outside of the metadata which will end up in ARC.

:point_right: Also make sure to disable preallocate in your torrent client. :point_left:
If the client writes zeros to preallocate space, those get compressed to nothing by anything other than compression=off (which you shouldn’t do, that’s basically for debugging use only. If you ever decide that compression is truly a terrible thing to attempt, use zle instead). If your client tries to be clever by writing random data, then when real data comes it’s a COW filesystem so it’s going to be written to somewhere else anyways.

References

Let me know if I got anything twisted around

P.S. Setting the torrent to move finished files to a dataset or something will remove any fragementation from the torrent if you consider that an issue. GitHub - markusressel/zfs-inplace-rebalancing: Simple bash script to rebalance pool data between all mirrors when adding vdevs to a pool. is also intriguing, but I haven’t tested this yet.
Having larger record/zvolblocksizes keeps free space fragmentation (which is what the FRAG value actually is) issues from being a problem, since ZFS will continue to easily find contiguous spaces to stuff data randomly into without having to create tangled nests of sub block pointers called gang blocks.

2 Likes

Hi @dshocker,

welcome to Level1techs!

Can you please elaborate on why you switched to zfs? There is nothing wrong with it, but knowing what you want to achieve allows to you to test if you actually do achieve your goal(s).

Your zfs physical layout is a (relatively) simple mirror of two SATA SSDs. I don’t see that you’re using or even need advanced capabilities of ZFS (e.g. snapshotting, compression, deduplication, etc.).
How do you think zfs is better than ext4 on top of a simple mdadm mirror?

You use a ZVOL only to create an ext4 file system on top of it (you techically didn’t “scrap” ext4). Why did you create a ZVOL instead of simply using a basic zfs dataset (refer to Aaron Toponce : ZFS Administration, Part X- Creating Filesystems)?

I don’t yet have snapshots because I’ve only been running it less than a week. I am using lz4 compression. Basically I wanted redundancy, one disk is 3yrs older than the other so by the time one fails, chances are the other one will have enough life to resilver replacement disk. I was using timeshift before, but it hardly compares to zfs snapshots, has saved my ass on a few occasions. Being able to manipulate snapshots on boot with ZBM is quite appealing too;
And well, I didn’t realize I should just create another dataset for torrenting/junk. If that’s a superior solution to an ext4 zvol then I will happily destroy that zvol and just create a zfs dataset; I’m just pretty new to all this and figuring things out as I go. But the 2 main reasons I’d say I wanted it, the snapshots, and redundancy.
It’s a laptop so the layout must be pretty simple, only having room for 2 storage devices.
I had just read a lot about torrents and fragmentation with VMs, was no other reason for creating the zvol. If I’m thinking along the wrong lines (seems I am), I’ll gladly destroy that zvol and use a zfs dataset

Yeah I’d definitely recommend just going straight datasets. No sense in ignoring the filesystem you already have to put another filesystem on top of it, generally.

ZVOLs are for when you really want to emulate a block device for some reason, typically virtual machines. Even then I personally prefer QCOW2 on a dataset and call it a day.

1 Like

Thanks for clearing that up for me, I had an incorrect way of looking at/interpreting what I was reading, wasn’t obvious to me that I should use a dataset and not zvol. Also, it seems like every post I come across for recommendations @ recordsize, everyone has a different opinion. I’ll go ahead and make the recordsize 1M. I’d say the ratio for VM storage and bittorrent would be 80/20. It’s basically a duckaround dataset that will never be snapshotted, a place to put random bits and bobs/junk.

Thanks to everyone for pointing things out to a zfs newbie, it’s not that theres a lack of information out there, quite the opposite, its a bit overwhelming how much information is out there, lots of differing opinions

2 Likes

You can assign different recordsize value to each dataset your create in ZFS. What the best value is depends on how the dataset is used.
You can see stats on recordsize usage using zpool iostat -r (if you append a number to this you see live stats every of seconds). This can help you find optimal recordsize values for your datasets.

1 Like

Not too sure how to interpret these numbers (pre-VM dataset creation)

rpool         sync_read    sync_write    async_read    async_write      scrub         trim    
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0
4K          94.0K      0   672K      0    160      0  6.04M      0  9.76K      0      0      0
8K          11.0K     81  78.0K      2  1.29K      3  1.21M  1.73M  1.73K  3.06K      0      0
16K         5.72K    132  24.8K      0    657     43   292K   941K  1.99K  3.55K      0      0
32K         6.23K    184  14.6K     14  1.88K     87   439K   377K  7.35K  5.96K   365K      0
64K         11.1K    130  15.4K     29  2.35K    306   212K   285K  67.3K  57.6K   254K      0
128K        3.71K     31  72.7K      4  16.2K     42   349K  18.7K   364K  37.6K   206K      0
256K        6.42K      0    116      0    286      0   598K      0  44.8K      0   296K      0
512K        2.72K      0    184      0    119      0   175K      0  55.7K      0   140K      0
1M          35.6K      0  60.6K      0    816      0   135K      0  82.8K      0  58.4K      0
2M              0      0      0      0      0      0      0      0      0      0  21.9K      0
4M              0      0      0      0      0      0      0      0      0      0  8.53K      0
8M              0      0      0      0      0      0      0      0      0      0  2.87K      0
16M             0      0      0      0      0      0      0      0      0      0  3.59K      0
----------------------------------------------------------------------------------------------

I created the dataset with 1M

[]$ du -sh Windows\ 10\ x64/
88G	Windows 10 x64/

[]$ du -sh /vm/Windows\ 10\ x64/
56G	/vm/Windows 10 x64/
NAME      PROPERTY       VALUE  SOURCE
rpool/vm  compressratio  1.49x  -

Very impressed with the compression
Thanks again for all the useful information

1 Like

Just a clarification on this: I think that volumes are “thick provisioned” by default, so refreservation == volsize, unless -s is specified in the zfs create command.

The cost to fragmentation is then at creation time rather than during the entire time new blocks are being allocated.

Not using -s with -V is probably the best one can do if the intention is to avoid pool free-space fragmentation.

Nah, that’s not what reservation means in ZFS. It just means ZFS will pretend the volume takes up that much space for any other calculations, ensuring there’s always room to write as much data as the partition claims to support to it. Useful for VMs in some scenarios.

ZFS is copy on write. Allocating everything at once is pointless because the first write you get after the initial it’s going to split up and write in a new area, and again with each subsequent write. A byte once written is never overwritten with new data from the same position in that file, the new data is simply written elsewhere and then the original byte marked as empty (unless it happens to be part of a snapshot or is deduped and shared with other files).

This is why you can’t use a ZFS volume formatted as swap to hard suspend, because the data will be scattered all over the pool.

Everythings working out nicely, verrrrryyyy happy to have real-deal snapshots and not pseudo timeshift

I set zfs_txg_timeout = 90, this way 80-90% of the time my disk is totally idle. Disk writes look more or less identical(actually less) to what I’m used to seeing with ext4 (using a basic dumb conky script to monitor read/writes. Any real downsides with this?–or am I using it as intended.
Can definitely notice certain repetitive actions I perform are very quick, guessing the zfs caching is doing its thing nice; 75% of my 16GB ram cached and ready to roll instead of all that wasted RAM that was previously doing little to nothing.
Have been reading more and more about zfs, it’s my new favorite toy I wish I had discovered this years ago…though I hear it was a minor little shitshow when ZoL/OpenZFS were separate things. Hope more distributions start supporting ZFS as a first class citizen, at the very least Arch; there’s about a snowballs chance in hell distros like Debian would ever

2 Likes