Linux OS disk layout recommendations, RAID?

I’m putting together a Debian based small storage system and trying to come up with a sensible OS disk layout or scheme.

The storage portion will be just 2 x 8 TB WD Reds and zfs on both of those.

The primary disk for the OS (sda) will be a 2.5" Samsung 860 Pro drive. I actually have two of these available and could do RAID1 (software) or just stick with a single disk. I’m wondering what a good OS partition scheme would be?

Just use one. When SSDs are getting old they just become read only. At that point you can clone the drive to a new one. Also SSDs have some error correction stuff inside so I wouldn’t overthink it further.

I’d use a ZFS mirror.

  1. Because redundancy
  2. ZFS can accelerate reads by reading from both sides of a mirror for improved performance.

Have you considered ZFS on root with the SSD as a log or l2arc device? I’m not sure how well that works (if at all) on Linux, but if it is feasible, you’ll get the benefit of SSD accelerated bulk storage. But i’m no ZFS on linux guru.

If its a storage only system (or otherwise a non-interactive storage only box), maybe consider FreeNAS, and boot from USB using all of your mentioned disks for data on ZFS.

But i’m guessing its not basically headless?

What is your use case?

edit:
SSDs do occasionally fail without just going read-only. That might be how the NAND fails, but there are other failure modes, uncommon as they may be.

1 Like

Good point, that can happen with a burned out controller for example.

3 Likes

I had a controller that burned out. $2400 to recover a 128GB disk at drivesavers.

Also, this is why you back up your systems.

1 Like

My “use case” is simple storage, headless. I’m moving from an external USB that I connect “sometimes” and move data to/from to a more available variant of that. The debian system will be connected to my UPS. Since someone has already mentioned “backup” I encourage anyone passing through thread feeling the need to grandstand about raid not being backup to please take your contribution elsewhere. You’ll just have to trust my backup situation is under control and control your fingers from crapping up this thread on that topic.

In my mind, the OS disk would only hold the OS bits and a handful of relatively static config files.

Recovery of the OS disk would consist of reinstalling the OS on a replacement and then dropping my handful of config files back into place.

Making use of a raid boot config was being considered as a path of less hassle during the occasional OS disk fail.

Maybe a raid boot / OS disk is added complexity with not much added benefit?

What about partitions and layout? Are you using ram disk(s) to minimize activity to the SSD so it lives longer? My little box has 32GB so there is headroom to consider doing that.

1 Like

ZFS will use RAM automatically to minimise disk activity via serialization of writes (e.g., lots of small random writes hit RAM first where they can turned into a big single stripe write to hit the disks) and as a read cache (i.e., no need to set up a RAM disk - ZFS will grab/use the RAM for that sort of thing itself).

ZFS really requires decoupling your thinking from “i have these disks and i plan to use them for these purposes”; it doesn’t work like a regular filesystem and is a bit more than a traditional FS. It integrates a logical volume manager and various other additional features to be able to make better use of resources (for storage purposes) than a traditional filesystem + raid controller + LVM/partition set up.

The concept with ZFS is that all system resources are used for storage as deemed appropriate by the kernel, which has a better idea of what is going on and what demands caching, etc. than the user does normally (and can do it on a block level rather than file or folder level).

The more spindles/cache/ram you can throw at a zpool the better the system can optimize use of resources as appropriate. Another way of saying that is that anything you take away from it for RAMDISK or different storage pool, etc. reduces its ability to make use of those resources, reducing performance.

For what you describe as your use case (headless storage), i’d seriously recommend FreeNAS with a ZFS mirror for your spinning rust, and look into using the SSD as an L2ARC (read cache) and/or LOG (write cache) device. FreeNAS is intended to be booted from USB stick (or other seperate storage) so that it is fully decoupled from your storage media. In the event your FreeNAS box dies, you can plug all your drives into any ZFS capable box (in any order, etc.) and get the data back. You also get performance monitoring, web GUI for ZFS, email alerts for disk failures, etc., etc. that you probably won’t get from a roll-your-own ZFS on lInux set up.

It’s not Linux, but from the sounds of it you’re planning to basically reinvent FreeNAS on linux if you go down that route…

edit:
All that said, unless you’re on some high speed network, SSD may be overkill. Gigabit ethernet won’t keep up with a pair of decent modern hard drives for large (singular) reads or writes. If your workload involves lots of small random reads and writes though (or you have many users) the SSD will help.

I’d maybe build it without the SSDs (keep them aside until you finish testing) and see if the system is fast enough without them. L2ARC or LOG devices can be set up with them afterwards without needing to reinstall or destroy your pool. That way if you find they aren’t required you can use them for something else :slight_smile: - whereas building with them from the outset you’ll never know if you could happily live without them in there.

2 Likes

oh one more thing.

if you use your ssd for cache (l2arc or log), an ssd failure won’t reuslt in total data loss. you’ll lose nothing from an l2arc failure and only outstanding transactions (15 sec or so) from a log device failure. unless you have a mirrored log device in which case you’re good…

1 Like

Maybe this is a dumb question but can the l2arc, log be located on the same two ssd mirror pair?

You don’t always need or benefit from caches, and if your workload isn’t heavy on time-sensitive synchronous writes, the log won’t make a difference. As @Thro said, you should test your workload and tune the system without SSDs before resorting to them.

I personally would put the SSDs somewhere they could actually help, like in a desktop or laptop. The storage server is not likely to benefit from the SSDs at all. It’s always on, programs it runs are small enough to load once and stay resident in RAM, etc. You would get a real boost sticking the SSDs in a desktop or laptop, on the other hand. They have to boot quickly, install updates as you’re sitting there watching in agony, load large graphical programs, etc.

1 Like

I’d boot off zfs. You now have compressed arc, you should benefit from that for the OS, and disaster recovery is very easy when needed. Snapshot the boot/rootfs and send the filesystem to your 8TB mirror. When disaster strikes you can boot off that to get going asap, then restore the drive when you have a replacement available. Whether you want to mirror the 860s or not really depends on how resilient you want this to be. It doesn’t sound like you really need a mirror for the OS.

You talking about the log?

because by default the zil is in zpool, so it would be two writes anyway.

Arc / L2 arc are caches, no need for redundancy.

ZIL / intent log is worth mirroring, losing zil is the same as losing some data / going back in time / losing some consistency.

When you write randomly to a filesystem and sync, this sync causes data to get flushed from ram to disk, or with zfs to zil in one sequential write instead of many random ones. Random writes will still happen eventually. Even if your apps don’t sync, the filesystem metadata should still be consistent, zfs can make these metadata mutations faster by writing them to zil first, and then lazily mutating them in their proper storage place later, when the drive is more idle. If your zil goes, your data goes too.

Absolutely right, my mistake for conflating the two. I’ll edit the post to clear up the confusion.

Here are some relevant excerpts from the FreeNAS blog:

When synchronous writes are requested, the ZIL is the short-term place on disk where the data lands prior to being formally spread across the pool for long-term storage at the configured level of redundancy. There are however two special cases when the ZIL is not used despite being requested: If large blocks are used or the “logbias=throughput” property is set.

By default, the short-term ZIL storage exists on the same hard disks as the long-term pool storage at the expense of all data being written to disk twice: once to the short-term ZIL and again across the long-term pool. Because each disk can only perform one operation at a time, the performance penalty of this duplicated effort can be alleviated by sending the ZIL writes to a Separate ZFS Intent Log or “SLOG”, or simply “log”.

So from the looks of things, as @risk stated, there is some benefit in terms of integrity to putting a mirrored SLOG on the same SSDs as the rest of the pool, as by default the ZIL is not redundant.

I’m actually surprised the ZIL is not at least written to more than one device, but I guess it is only short term storage, so the risk is minimal and it is better than just sitting in RAM.

I was looking to see if there was any sort of “copies” setting for the ZIL or something like that, but didn’t find one.

More reference material, from zpool(8):

  Intent Log
     The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous
     transactions. For instance, databases often require their transactions to
     be on stable storage devices when returning from a system call.  NFS and
     other applications can also use fsync(2) to ensure data stability. By
     default, the intent log is allocated from blocks within the main pool.
     However, it might be possible to get better performance using separate
     intent log devices such as NVRAM or a dedicated disk. For example:

       # zpool create pool da0 da1 log da2

     Multiple log devices can also be specified, and they can be mirrored. See
     the EXAMPLES section for an example of mirroring multiple log devices.

     Log devices can be added, replaced, attached, detached, imported and
     exported as part of the larger pool. Mirrored log devices can be removed
     by specifying the top-level mirror for the log.

   Cache devices
     Devices can be added to a storage pool as "cache devices." These devices
     provide an additional layer of caching between main memory and disk. For
     read-heavy workloads, where the working set size is much larger than what
     can be cached in main memory, using cache devices allow much more of this
     working set to be served from low latency media. Using cache devices
     provides the greatest performance improvement for random read-workloads
     of mostly static content.

     To create a pool with cache devices, specify a "cache" vdev with any
     number of devices. For example:

       # zpool create pool da0 da1 cache da2 da3

     Cache devices cannot be mirrored or part of a raidz configuration. If a
     read error is encountered on a cache device, that read I/O is reissued to
     the original storage pool device, which might be part of a mirrored or
     raidz configuration.

     The content of the cache devices is considered volatile, as is the case
     with other system caches.

Booting Debian off of a ZFS mirror sounds like a headache and IMO, I don’t see the point until ZOL supports TRIM (even then I don’t think I’d personally ever use a non-native filesystem for boot).

ZFS for data, sure but MD/LUKS/LVM is fine for a mirrored boot device. Just scrub it periodically.

I’d also stay away from SSD caches (again until ZOL supports TRIM).

3 Likes

They seem to have it well under way, I really need to motivate myself to finish the other guys work. (trim, that is)

I can confirm it is not a headache but very easy to do :slight_smile: The benefits of having it this way have already paid off for me. We’ll see whether risk of using a non-native fs materializes one day or not. The foundations seem well established.

About the need for TRIM opinions seem to differ.

Here are my two cents:

  1. If you are just going to reinstall the boot OS on a drive failure, ZFS on the boot drive seems like major overkill. Especially if you have to recompile the kernel on upgrades (Not sure how Debian handles ZFS on Linux)
  2. L2ARC can actually hurt performance in certain situations, so be careful. Basically, you want to max out on RAM before thinking about L2ARC. L2ARC has to be indexed in RAM, so the more L2ARC you have, the more RAM you use.
  3. I don’t see how a SLOG can hurt your performance, especially if you have a spare SSD lying around. You might see more benefit from using that SSD somewhere else, depending on your workload, but I don’t think it will hurt.

I would personally go for regular Debian install on the OS SSD, then use the other SSD for a SLOG. Also, keep in mind you can add additional SLOGs and L2ARC later.

How are you accomplishing this without having to get your hands dirty in the kernel on install/updates?

@nx2l, didn’t you have ZFS SSD performance issues that you determined were attributable to TRIM (or something else in ZOL vs FreeBSD)? Sorry if I’m misremembering.

@SgtAwesomesauce, surprised to not see a BTRFS plug from you.

Negative… my performance issue was related to using zvols.
When I switched to using fileio with LIO (targetd) … issue went away. (was using blockio with zvol)

1 Like