The Pragmatic Neckbeard 2: ZFS - SSD speeds at HDD capacities

AKA: The Pragmatic Neckbeard 2: Talk ZFS to me

In this installation, we're going to talk ZFS on Arch Linux. ZFS is your friend with the huge house. ZFS is your friend with the fast car. ZFS is also your friend who's a bit high-maintenance. ZFS takes a bit of work, but the benefits (in my opinion) outweigh the work required for someone who's on a budget.

Why isn't ZFS default then?

ZFS is a bit of a bear to work with because of the way that it is installed on your system. ZFS and Linux have incompatible licenses. This prevents a distribution from packaging Linux with ZFS. That doesn't mean you can't build ZFS modules on your own and include them in a kernel. There's a difference between distributing binaries and distributing a script to combine it yourself. From my research it appears that by compiling ZFS into your kernel for personal use, you are not in breach of Linux GPLv2 license (I am not a lawyer, correct me if I'm wrong). That's why ZFS isn't the default, among other reasons.

Cool, how can I get it?

Getting ZFS is actually quite easy on Arch. Start by making sure your installation is up to date, then install the AUR package zfs-dkms.

By now, you must be thinking: "Wow, that was easy. I thought you said this was going to be a pain?!"

That's true. I did say that. Let's talk about updates. At the time of writing this, Linux, DKMS and ZFS don't quite jive properly and every time your kernel updates, DKMS fails to install the zfs modules. This can cause issues with your system. The solution for this is to make sure that every time you install a new kernel or upgrade your existing one, be sure to run sudo dkms autoinstall and DKMS will recompile and install the ZFS and SPL modules into your system.

Awesome! Can I mkfs.zfs yet?

Nope. Actually, you will never mkfs.zfs. ZFS doesn't work that way.

Let's talk about ZFS on a conceptual level before we get into functional demonstrations.

How ZFS works

Think of ZFS not as a filesystem, but as a package of tools for dealing with storage. At it's core, ZFS accomplishes four basic things: It provides a file system, snapshot management, volume management and pooling management.

Think of ZFS like a beefed up software raid without the downsides of software raid. Let's say we have 10 2TB drives. We also have 2 128GB SSD's. From there, we can build an array with 16 TB of storage. This is accomplished by creating the equivalent of a RAID 6 array, which zfs calls raidz2. This means we sacrifice two drives worth of data, 4TB for parity, and as a consequence, we are able to have any two drives on the array fail and not lose any data.

Now, we haven't allocated our two SSD's yet. Let's do that now. ZFS allows you to use your SSD to cache often accessed data and data destined to be written to the array.

As I mentioned in a previous article, ZFS doesn't suffer from the "write hole" issue that plagues many RAID solutions. This is because ZFS has a feature called the ZFS Intent Log (ZIL for short). Essentially, this feature stores intents (and the data therein) to write data to the array, and ZFS then moves that data into the cold storage array. the ZIL normally resides on random free sectors of the array, but if we specify a dedicated "log" disk for this, in this case, one of our SSD's, we can increase our effective write speed to that of the SSD. This gives you the benefit of having the speed of an SSD and the capacity of your spinning media.

Corrections: It's been brought to my attention that the paragraph above has been written as potentially misleading. To clarify the way the ZIL or SLOG works, it's best to read through this thread on the freenas forums. Thanks @Vitalius!

The Cache works in a similar way. It stores copies of the most frequently accessed data on the array in itself so it's more easily accessable. This requires ram to operate, but it's worth it in the end because this means that for your most frequenlty accessed data, you have full SSD speed for both reading and writing your data.

Now, let's say we've nearly filled up our 16TB of storage and want to add more. No problem! The way zpools work, you can have multiple raidz objects within one pool and it will still display as one filesystem. This essentially means that ZFS will function as if you had used Linux software raid (mdadm) and LVM in conjunction to create one awesome pool, but that's not the end of it.

Let's say you've got a dataset in your ZFS cluster that you are going to make some major changes on but you aren't sure if that's what you're going to want in the end. ZFS will allow you to take snapshots and roll back to them or even mount the snapshot separately from the main dataset.

Snapshots in ZFS are awesome, so let's drill down and explain in detail how they work. For the the cases of this example, let's say we have a dataset named home and a brand new snapshot named test. So it will look like:

tank                20G
tank/home           20G
tank/home@test       8K

As you can see, the snapshot, test has no difference in data to the main dataset and is taking up only 8K of additional space. This is the magic of copy-on-write. Now, when a file is modified or deleted within the home dataset, ZFS writes the data in a different physical location on disk, as is the standard with copy-on-write, updates the pointers on the dataset to the new location, looks at the old data blocks and sees that the snapshot still needs them, so it doesn't free that space. This is why snapshots will grow over time. They take ownership of the original blocks of their parent dataset when data is changed.

ZFS also supports sending and receiving data in both raw and incremental capacities. This means you can take daily snapshots of a dataset and send the changes to another computer with a ZFS pool over standard tools like netcat or ssh.

It's important to note that ZFS can be very rigid in some aspects. When it comes to physical disk configuration, ZFS trails behind it's competitor, BTRFS, in that the raid level and disk configuration cannot be reconfigured without destroying and recreating the zpool.

Using ZFS

Now that we've got a bit of knowledge about how ZFS works, we can move on to the fun stuff. Let's start by outlining my setup, so that those who are following can make proper adjustments.

I'm using 2 3TB Western Digital red's and a single 120GB Samsung 850 pro for the array. I'm going to be over-provisioning the SSD and only using 100GB. This will allow wear leveling to occur more evenly, in theory. I've partitioned my SSD into two partitions, one 80GB partition to be used for the ZFS cache and one 20GB partition to be used for the ZIL. I'm going to be using the two 3TB drives in raid0 because I have no mission critical data on these drives. If you're using two drives and intend to store important data on them, put them in a mirror configuration. Remember you can always add storage at a later point.

So, let's look at my disks:

# fdisk -l
Disk /dev/sdc: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: D30AA847-F3AB-4B45-A69F-D522C16DB1B0

Device          Start        End    Sectors  Size Type
/dev/sdc1        2048 5860515839 5860513792  2.7T Solaris /usr & Apple ZFS
/dev/sdc9  5860515840 5860532223      16384    8M Solaris reserved 1


Disk /dev/sdd: 119.2 GiB, 128035676160 bytes, 250069680 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x7f8ac676

Device     Boot Start       End   Sectors   Size Id Type
/dev/sdd1  *     2048 250067789 250065742 119.2G 83 Linux


Disk /dev/sdb: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 6495F007-EFC5-9D4C-95B2-499ABBFABC9E

Device          Start        End    Sectors  Size Type
/dev/sdb1        2048 5860515839 5860513792  2.7T Solaris /usr & Apple ZFS
/dev/sdb9  5860515840 5860532223      16384    8M Solaris reserved 1


Disk /dev/sda: 119.2 GiB, 128035676160 bytes, 250069680 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x2eea4214

Device     Boot    Start       End   Sectors Size Id Type
/dev/sda1           2048  41945087  41943040  20G 83 Linux
/dev/sda2       41945088 209717247 167772160  80G 83 Linux

As you can see, I'm using a second 120GB SSD for my OS partition. I'm doing this so that if something goes terribly wrong with the ZFS array, I can still boot my PC and login as root (or any other user that's not set up with $HOME=/home/*).

Now, my ZFS SSD is showing up as /dev/sda, and my two 3TB devices are /dev/sdb and /dev/sdc. This makes things easy. Let's create our pool.

# zpool create tank -omountpoint=/media/tank -oashift=12 /dev/sdb /dev/sdc log /dev/sda1 cache /dev/sda2

The meaning of -oashift=12 is to tell the array to use 4096 byte sectors instead of the default of 512 bytes. This will improve performance significantly on my system because the optimal size is 4096.

The -omountpoint=/media/tank option is specified to tell ZFS to mount the root pool at /media/tank. If this option is not specified, you'll wind up with a new folder in / with the same name as your zpool's name. As you can see, I've named my pool tank.

Let's have a look at how tank's set up now.

# zpool status tank
  pool: tank
 state: ONLINE
  scan: none requested
config:
	NAME        STATE     READ WRITE CKSUM
	tank        ONLINE       0     0     0
	  sdb       ONLINE       0     0     0
	  sdc       ONLINE       0     0     0
	logs
	  sda1      ONLINE       0     0     0
	cache
	  sda2      ONLINE       0     0     0
errors: No known data errors

You should see something like this. Periodically checking in on this output will show you when you have issues with a disk in the array.

One last thing we need to do to get ZFS to work after boot is enable zfs.targetand zfs-import-cache.service so that everything is configured properly on boot.

# systemctl enable zfs.target
# systemctl enable zfs-import-cache.service

ZFS has two main commands that you'll use while working with a pool. It's split up very logically. If you're operating on the pool itself, you'll use the zpool command. If you're operating on the data within the pool, you'll use the zfs command. You can read the manpages for each if you're interested in finding out more about each command. Be warned that you can easily spend an hour in the manpages when working with ZFS.

Creating datasets

Now, we've got these 6TiB of storage available to use, let's mount out /tmp on it.

# zfs create -o mountpoint=/tmp -o setuid=off -o sync=disabled -o devices=off tank/tmp

This will create a new zfs dataset, try to mount it at /tmp and set some security options on it.

-o sync=disabled will improve the performance of the /tmp filesystem, at the cost of dataset integrity in the event of a sudden shutdown. Keep in mind that this won't affect the integrity of the rest of ZFS , just /tmp.

If you're using Arch Linux (or derivatives like Manjaro or Antergos), like I recommended in the previous article, you'll need to mask systemd's tmp.mount.

# systemctl mask tmp.mount

On restart, your system will mount the zfs dataset on /tmp instead of the ramdisk that is used by default. It's important to note that /tmp will not mount if ZFS fails to load, so if /tmp has contents, ZFS will not be able to mount a dataset there.

I also have my /home on a zfs dataset, just to keep user data separate from system. I'll go through the requirements of doing that. First, log out of your normal user and login as root. We need to make sure that no files in /home are open.

First step is to move /home somewhere else, but keep it on the same hard drive.

# mv /home /home-old

If you're only logged in as root and you get an error about files being in use, try restarting your PC.

Now we can create our home dataset dataset:

# zfs create -omountpoint=/home tank/home

Zfs will automatically mount the dataset on /home and we can begin to copy our data back:

# rsync -arv --progress /home-old/* /home

Once all our data is sync'd across, we can delete the old home directory and make a snapshot of the current status of our home directory, for safe keeping.

# rm -rf /home-old
# zfs snapshot tank/home@initial-copy

We've now successfully set up our zpool.

let's have a look at our filesystem usage:

# zfs list -t all
NAME                     USED  AVAIL  REFER  MOUNTPOINT
tank                    80.9G  5.19T    96K  /mnt/tank
tank/home               21.8G  5.19T  21.8G  /home
tank/home@initial-copy      0      -  21.8G  -
                        -- snip --
tank/tmp                3.06G  46.9G  3.06G  /tmp

This will show you a list of all datasets, zvols and snapshots that are on pools in your system.

Let's say you want to back up your home directory. ZFS makes that easier than ever. You've already got a snapshot, I'll just send that to my freenas server, for example. Create a new dataset to hold your backups first. Setting gzip compression on the dataset will help keep the size of the data down.

# zfs send tank/home@initial-copy | ssh 192.168.1.50 -l root "zfs receive -F storage/backups/home"

This command will send all the data, over ssh, to your newly created dataset on the remote server. Keep in mind that you'll get a max of ~100MB/s on gigabit ethernet, so this can easily take a couple hours to transfer data. That's why we have incremental data transfers.

Let's say we've worked for a few days and we've accumulated more data:

# zfs list -t all
NAME                     USED  AVAIL  REFER  MOUNTPOINT
tank                    83.0G  5.19T    96K  /mnt/tank
tank/home               23.9G  5.19T  23.8G  /home
tank/home@initial-copy  67.1M      -  21.8G  -
                        -- snip --
tank/tmp                3.06G  46.9G  3.06G  /tmp

Now as you can see, we've got 67M of data referencing the snapshot, meaning that data's been changed, and 2G of additional data has been created since the last snapshot. Let's snapshot again and send the data again.

# zfs snapshot tank/home@another-snapshot

Now that we've got our next snapshot, let's send the changed data, to reduce the network usage. This is a bit more complicated, because we have to reference both snapshots. Essentially, we reference the snapshot that both sides have, and the snapshot we want the remote server to recieve. This allows ZFS to calculate which blocks to send over to make both sides match.

# zfs send -i tank/home@initial-copy tank/home@another-snapshot | ssh 192.168.1.50 -l root "zfs receive -F storage/backups/home"

So, the second argument for -i is the first snapshot and the final argument for send is the latest snapshot, if that wasn't obvious from the above command. You'll notice that the transfer happens significantly faster because it's transferring less data overall. Whenever possible, use incremental transfers when backing up data over the network.

What happens if you want to get data back from a snapshot that's since changed? We've got two different ways to do this. The first way is the nuclear option, rolling back. ZFS allows you to roll back an entire dataset to a previous point in time. This is simple and effective, but it also has drawbacks. This means that you'll destroy any snapshots and data in between the target snapshot and your current zfs configuration. It's pretty messy, but can be done if you don't care about that data.

The second option is a bit more tedious, but is a cleaner method: mounting the snapshot and copying data. You can do this with a simple mount command:

# mount -t zfs tank/home@initial-copy /mnt/snapshot

It can be dismounted with:

# umount /mnt/snapshot

Now you can traverse the snapshot and copy data you need from it. This is beneficial in many ways, but it's a bit more time consuming, so your preferred method will heavily depend on the situation.

So we've covered most of the basics of ZFS so far, but what if I decide I don't want a dataset or snapshot anymore? That's easy. There's zfs destroy that will allow us to destroy snapshots or datasets in the same way:

# zfs destroy tank/home@initial-copy

This will destroy the snapshot, without confirmation, so be careful when using this command. It won't operate on a dataset with snapshots, unless using the -R argument, which tells zfs destroy to operate recursively on the dataset destroying any subsets or snapshots it encounters within that dataset.

ZFS volumes

Now that we've spent about 3000 words talking about native datasets, let's talk about virtual block devices or volumes, which ZFS calls zvol.

Creating a zvol is very similar to creating a dataset. All you need to do is specify -V and then the size of the desired size of the volume. You can also use -s to create a "sparse" volume, which doesn't take up space on the pool until it's needed. This is known as thin provisioning. To create a thinly provisioned zvol, you can use the following command:

# zfs create -s -V 50G tank/windows-vm

This will create a 50GB zvol in that pool tank and it will show up as /dev/zvol/tank/windows-vm. You can format and mount it like you could any other physical disk or you could use it to run a Virtual Machine on, which we'll go over in part 4.

zvol's can utilize snapshots in the same way that a normal dataset can, so we won't go into it, just know that you can snapshot and rollback in the same way that you would with a normal dataset.

Before you go hogwild and start making zvol's all over the place, it's critically important that you understand recommended sector sizes and how that fits in with ZFS. ZFS has a default zvol sector size of 8K. Linux filesystems (ext4, xfs, btrfs) automatically detect this and align themselves to this, but NTFS and other windows filesystems don't. You can fix this by using the -b option and specifying a blocksize in bytes. The default NTFS sector size is 4K, so using -b 4096 will result in a properly aligned NTFS partition for your zvol. If you're creating a new partition from the windows installer gui (you're setting up an OS drive) you'll want to do this. Otherwise, just leave off the -b option and use the disk management utility and using a NTFS cluster size of 8192 for the zvol. If there is a sector that's not quite correct, it's going to leave a problematic system status.

Conclusion

I hope you got something out of this article. As I mentioned before, I'm not much of a writer, so I'm working on improving my skills, which is one reason why I'm doing this series in the first place. As I get better, I'll go back and fix the earlier guides and make sure that they flow better and are easy to understand.

I also plan on releasing supplemental content on how zfs is tuned for different workloads, but that's a ways out. I'm still working out certain kinks on my own system.

Feel free to check out my new-new blog here

P.S. I fell quite ill this weekend, so I wasn't able to finish or proof this article as well as I'd have liked, since the illness is affecting my coherence in a negative way. I should be good by next weekend though.

17 Likes

number 3 of neck beard on the books or not yet?

As I mentioned in a previous article, ZFS doesn't suffer from the "write hole" issue that plagues many RAID solutions. This is because ZFS has a feature called the ZFS Intent Log (ZIL for short). Essentially, this feature stores intents (and the data therein) to write data to the array, and ZFS then moves that data into the cold storage array. the ZIL normally resides on random free sectors of the array, but if we specify a dedicated "log" disk for this, in this case, one of our SSD's, we can increase our effective write speed to that of the SSD. This gives you the benefit of having the speed of an SSD and the capacity of your spinning media.

Hmm, this isn't entirely correct. Or at least it gives the wrong impression.

What is SLOG?

The hardware RAID controller with battery backed write cache has a very fast place to store writes: its onboard cache, and usually all writes (not just sync writes) get pushed through the write cache. A ZFS system isn't guaranteed to have such a place, so by default ZFS stores the ZIL in the only persistent storage available to it: the ZFS pool. However, the pool is often busy and may be limited by seek speeds, so a ZIL stored in the pool is typically pretty slow. ZFS can honor the sync write request, but it'll get slower and slower as the pool activity increases.

The solution to this is to move the writes to something faster. This is called creating a Separate intent LOG, or SLOG. This is often incorrectly referred to as "ZIL", especially here on the FreeNAS forums. Without a dedicated SLOG device, the ZIL still exists, it is just stored on the ZFS pool vdevs. ZIL traffic on the pool vdevs substantially impacts pool performance in a negative manner, so if a large amount of sync writes are expected, a separate SLOG device of some sort is the way to go. ZIL traffic on the pool must also be written using whatever data protection strategy exists for the pool, so an in-pool ZIL write to an 11 drive RAIDZ3 will be hitting multiple disks for every block written.

If your qualm is "But that's FreeNAS." both FreeBSD and Linux use OpenZFS. Might be different versions, but both functionally work the same way.

I recommend anyone who actually intends to use ZFS read that entire post. Tons of useful information regarding ZFS can be found on the FreeNAS forums since it's their go-to file system.

Fun Fact: Enabling a SLOG can make your system slower, not faster, if it's not needed, even if it's a SSD. Just because of how RAM usage will work out with that.

Another "fun fact" is that ECC RAM is definitely something you want when using any checksum based file system that protects against bit rot. The reason why is simply that if your RAM has any kind of error regarding the checksums during a Scrub (Where ZFS checks for data integrity violations like bit rot), it's like putting your data through a paper shredder.

ZFS will rewrite the data transparently, but the data was fine originally, and is now overwritten with a corrupt version. ZFS will say everything is fine and you won't realize there's a problem until you try and access corrupted files.

Of course, the odds of that are pretty low, but it's the type of thing you don't think about until it happens to you.

For anyone who wishes to read more in-depth about how ZFS works, here:

http://doc.freenas.org/9.3/freenas_intro.html#zfs-primer

And there are more links at the bottom of that page to more info regarding ZFS specifics.

3 Likes

Yes and no. I've been extremely busy lately. I'm finding myself working 70 hour weeks. I've been wanting to sit down and hammer it out, but I'm hoping to have some down time this weekend to hammer it out. (fuck yeah, holidays!)

I agree that ECC is nice, but it's not 100% necessary. I've been using it for years on devices without ECC and I've never had a single checksum error. I'll follow this up by linking you to this article. Not to say it's not helpful, but if you're on a budget, you're fine.

The moral of the story is to back up your data.

I may have been a bit misleading. It's probably more accurate to say that you've got the potential performance of the SSD, but it's all up to ram. I'll write an addendum when I have some time.

2 Likes

Thanks for the update. Good to hear from you on this. :)

I'm not about to ignore feedback of any sort. I'm doing this for the benefit of the forum's users, so I'm glad you caught the issues. I'll be adding a part 2.5 at some point in the future talking about zfs tuning and data integrity.

To be honest, I'm so busy with work that when I had a free hour or so on Sunday, I totally forgot what I do for fun. We're hiring some consultants to ease the load, so in the next few weeks, I should be ready to come back in a more permanent fashion.

1 Like

Great article! I am going to follow your blog since it's almost exactly what I would like to do.

Say, can I use a partition on my SSD for the OS and the rest of it as ZIL / L2ARC? So 80GB OS, 120GB Cache and some overprovisioning. Any pitfalls to consider if I decide to do that?

2 Likes

This is what I've heard. ZFS uses the ram it's given. Seems like people look at the ram usage and go, "uh oh, it's using all the memory it can handle, it must need more." Better to let it cache in ram than to slow it down with an SSD. Or at least that's how I remember it. Correct me if I'm way off.

Related, I have a C2D laptop running Ubuntu on ZFS and it only has 3GB of DDR2. No slowdowns at all. Also have a server doing the same with 8GB. No performance problems there either.

Could ZFS have kernel support for the newer; i.e., v5000+ non-Oracle versions? There is that whole forked version mess. I figured that was the version packaged when it's installed from common repos, but I've never checked.

Sorry, meant to quote that whole paragraph, but it isn't working on my phone right now for some reason. Dunno.

1 Like

Yes, but you'll get better performance if you don't partition it.

Glad you enjoyed! I'll be releasing the next one next week.

3 Likes

ZFS on Linux always partitions the disk though. I see no reason to advice against it.

Great write-up to get people started. I have a few quick remarks.

I think your license paragraph is just an opinion. Ubuntu ships ZFS and while this of course generated drama the core question remains whether shipping linking modules with a kernel breaches the license. I also don't think it is possible to compile ZFS in the kernel.

"-o sync=disabled" has as a consequence no /tmp data will ever be written to your SSD SLOG devices. Just mentioning this. I have no idea how much sync data a /tmp directory usually holds. I mount my /tmp as tmpfs, this is still my recommendation for most users.

A quick tip: you can pipe zfs send/receive to 7z and compress it a lot before hitting the network (if your bandwith is not that big). You can do this:
zfs send -Rv dataset@snapshot | 7z a -mx=9 -si -so | ssh host "7z x -si -so | zfs recv -F destination"

I'm talking about splitting it into multiple partitions that are configured on a single device. IE:

/dev/sdc1 - log
/dev/sdc2 - cache

This can cause performance issues if you're reading and writing to the same pool simultaneously.

Thanks! Always happy to hear responses.

I could have sworn I saw multiple sources saying that Ubuntu shipping ZFS in the way it does effects a breach of GPL. If it's still a question, I'll throw in an addendum.

It really depends on your resource requirements of /tmp. Particularly on Arch (when compiling from the AUR), you wind up having some sizeable resource requirements for certain packages. I've had my /tmp filled up when using an 8gb tmpfs. I do agree with you though, if you can avoid using zfs for /tmp, it's probably better.

I maybe should have put it in there, but honestly, this was more of an overview. There's so much more I could cover. I may do a series on just ZFS some day. (assuming I ever finish this series)

I'm preparing a small series as well :-) Get the word out you know.

1 Like

Only 12 likes? Thanks OP for taking the time to post something like this.

1 Like

Eh, it's not very well written and honestly, not everyone wants to spend 20 minutes reading 3500 words on something they probably don't use.

@sam_vde Tag me when you post the entries. I'm not as active on the forum as I'd like to be, but I'm definitely interested in reading it!