ZFS - Questions

I want to run a Linux host and a Windows VM in KVM and I am currently laying out my storage and backup plan. But before I start I need a few questions to be answered:

  1. Can I start with a single disk zpool and add another disk for mirroring later on?
  2. Is it a good idea to store the OS in ZFS or should I consider a separate partition for that?
    2.1 Can I use my SDD for that purpose and use the rest of its size for caching? Are there limitations (full disk vs partition)?

Thanks in advance!

Regarding your first question:
https://www.freebsd.org/doc/handbook/zfs-zpool.html

19.3 zpool administration

19.3.2 Adding and Removing Devices

A pool created with a single disk lacks redundancy. Corruption can be detected but not repaired, because there is no other copy of the data. The copies property may be able to recover from a small failure such as a bad sector, but does not provide the same level of protection as mirroring or RAID-Z. Starting with a pool consisting of a single disk vdev, zpool attach can be used to add an additional disk to the vdev, creating a mirror. zpool attach can also be used to add additional disks to a mirror group, increasing redundancy and read performance.

Regarding your second question:

Do you mean the Linux OS or the Windows VM OS image? I presume the Linux OS for the following.

Yeah, go for it. You can do either honestly. ZFS uses virtual devices it creates, so you could essentially partition the SSD into two partitions, and create a vdev for each. Then put the OS on one and the Virtual Machine stuff on the other. If you wanted to mirror part of the drive and not the other, you could that way. It says in that link I gave you: Pools can also be constructed using partitions rather than whole disks. Putting ZFS in a separate partition allows the same disk to have other partitions for other purposes. In particular, partitions with bootcode and file systems needed for booting can be added. So you could totally do that.

When you say "use the rest of its size for caching", I assume you mean for a ZIL and L2ARC. That entirely depends how much RAM you will have. How big your ZIL and L2ARCs are should be decided based on how much RAM you have because the two are connected.

I recommend reading this to find out about that: https://forums.freenas.org/index.php?threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/

Specifically the section: Tangent: Dangers of overly large ZFS write cache.

In this case, more is not always better.

More info on what those specifically are: http://www.45drives.com/wiki/index.php?title=FreeNAS_-_What_is_ZIL_%26_L2ARC

I only post that because people sometimes get confused since the terms can be confusing. For example, the ZIL is not the ZFS Write Cache. The ARC (in your RAM) is the read cache, and the L2ARC is only useful if it's larger than your RAM ARC, as it needs to contain everything that's in the ARC, and if the things it is caching aren't already coming from an SSD in this case. i.e. having that information you want to get on both some other SSD and the L2ARC partition of your OS SSD wouldn't really make much of a difference.

ZFS is complex. If it's configured in a way where everything it uses is made large without considering how the parts interact, you can get very degraded performance and probably not understand why.

2 Likes

I want to add that zfs will always partition the disk on linux, even if you request it to use the whole disk. So to avoid confusion with the vdev naming (it will still use whole disk name specifications in the latter case, although the real target is a partition) I always partition the disk myself.

WRT OS on zfs: if you can, do it! Just make sure you use a distribution that supports it. Ubuntu is ace to run fully native zfs systems, debian can do it but I have seen a lot of confusion on how to do it exactly (it does not seem to be fully baked in). No remarks on other distributions.

Thank you! I am going to dig through all the links soon.

Could you give me a concrete example for my system regarding ARC / L2ARC? From what I've gathered so far these might underperform if set too large. Is ZIL safe to use with a consumer grade SSD?

What I have is:

  • 1x WD Red 3TB (another one for mirroring later)
  • 1x Samsung 950 Pro 256GB
  • 32GB RAM (I am fine with using up to 16GB for caching)
1 Like

Is ZIL safe to use with a consumer grade SSD?

So, that link I posted mentions something regarding that.

The ZIL always exists. What you mean is a SLOG. Separate Intent Log.

By default, ZFS will put the ZIL on your pools. Meaning it shares the same storage as your data. It also mentions how the more activity you have, the slower ZFS gets when accessing the pool and the ZIL when they're on the same storage.

And that's why you'd want an SLOG. But partitioning it such that the OS and the SLOG are on the same device, if you mean the VM's OS, would defeat the purpose, I think. If just the Linux host OS, it might be fine.

But regarding the SLOG on a consumer SSD, the SSD is non-volatile storage (meaning it won't lose data if you lose power), so it should be fine. Consumer SSDs are rated pretty highly in terms of read/writes to failure, so yeah.

Regarding ARC/L2ARC, on FreeNAS, the default RAM to take up is 80% for caching. I believe FreeNAS has configured OpenZFS to do this and it may be different from the default, but I know OpenZFS on Mac does a similar thing. So it may be the default in general and not just on FreeNAS.

I'd read this before considering if you should bother with a SLOG.

This is from this page: http://open-zfs.org/wiki/Performance_tuning#Virtual_machines

1 Like

My preference after reading is installing the host OS inside the ZFS pool and use the entire SSD for L2ARC and SLOG. Windows will definately be installed in a zvol for the best performance.

From what I gathered so far the high RAM usage isn't that much of a problem since ZFS is freeing it when needed by other applications but maybe I am wrong about this. If so, is there a way to limit RAM usage?

There is one more thing I need more info about: ZFS write cache and transaction groups
So, the default size of a transaction group would be 4 GB in my case. How can I find out if that isn't too large so I can avoid filling up another group before the first one has been flushed to storage?
How do I determine the size of my SLOG partition?

1 Like

ZFS should be freeing RAM as it's needed.

the default size of a transaction group being 4GB in my case

You tell if that's too large by seeing how long it takes ZFS to commit that transaction group to the pool. Meaning, how long it takes to write to the pool.

How long it takes for another group to be full is how long it takes you to fill 4GB of SSD space, essentially.

So it's a matter of input speed vs output speed. You want output to be faster than input. That will prevent you from filling a second transaction group before the first is fully written. You can't just do a benchmark and use those numbers though as ZFS has a bit of overhead. i.e. dealing with metadata and such.

If you imagine output to be SLOG -> ZPool, then it's whichever's max speed.
If you imagine input to be ZPool -> SLOG, then it's whichever's max speed.
Almost certainly will be the ZPool though.

So, I would create the zpool on it's own and do a benchmark for it and the SLOG then figure out which is the bottleneck (again most likely the Zpool since it's on an HDD). Then do math to see how fast the 2nd transaction group could fill vs the 1st being written.

http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/

That URL has ways to benchmark for performance on OpenZFS.

From that first link to the forum thread explaining about the SLOG and ZIL:

This gives us some guidance as to what the size of a SLOG device should be, to avoid performance penalties: it must be large enough to hold a minimum of two transaction groups.

If your transaction groups are 4GB, your SLOG needs to be 8GB. Having it larger may help but is probably not worth dealing with. Having it smaller will prevent you from having two transaction groups' worth of space and thereby cause I/O to be paused when one is full.

So I would do 200GB for L2ARC, 8GB for SLOG, and 48GB for overprovisioning for Synchronous I/O (as mentioned in the General Recommendations on the OpenZFS Performance Tuning URL I linked earlier).

Following that paragraph:

The choice of 4GB is somewhat arbitrary. Most systems do not write anything close to 4GB to ZIL between transaction group commits, so overprovisioning all storage beyond the 4GB partition should be alright. If a workload needs more, then make it no more than the maximum ARC size. Even under extreme workloads, ZFS will not benefit from more SLOG storage than the maximum ARC size. That is half of system memory on Linux and 3/4 of system memory on illumos.

So based on that, don't make a transaction group larger than the max ARC size as it's pointless. And maximum ARC size, on linux, is apparently 50% of RAM.

1 Like

Alright, I think I am ready to start building my zpool. Thank you for for your time and all these information, @Vitalius!

1 Like