Most appropriate storage topology

Hi All.

I have started putting together a TR-based machine that will be used as:

  1. A storage server
  2. A workstation
  3. A gaming rig

I will set up separate Windows 10 VMs with GPU passthrough to accomplish [2] and [3], but am unsure as to the best way to achieve [1]. At this stage I have ordered the following:

  1. A TR 2950X CPU
  2. Gigabyte x399 Aorus Xtreme motherboard
  3. 64GB of Crucial unbuffered DDR4 2666MT/s ECC memory

so I still have the opportunity to select components enable me to best accomplish [1]. My current PC uses the same SSD for the Debian host OS and a single Windows 10 VM guest, again with GPU passthrough. This time around I am not sure whether I should:

  • Configure two NVMe SSDs as RAID1 for use by both the host and VM guests, with a separate RAID5 HDD array for bulk storage
  • Configure two NVMe SSDs as RAID1 for exclusive use of the host, and pass through separate SSDs for use by the guests (again with a RAID5 HDD array for bulk storage)
  • Do something else entirely

I have prior experience with mdadm but none with ZFS, so am unsure if ZFS would be appropriate here. Can anyone make a recommendation as to which storage option I should investigate further?

1 Like

I would personally use a normal SATA SSD for the boot drive, and a pair of NVMe drives as VM storage.

I also keep data separate from O/S so I can just blow the host away and re-install, and just transfer /home or whatever elsewhere.
But that’s just me

I found a 4 deep mirror of HDD’s was not quite low latency enough for AAA gaming on a windows VM, so switched the game storage to a file on a small SSD array.
Made a big difference in COD:MW.
But my lord that game has some large updates…

How much bulk storage do you need?

RAID1 or RAID10 is usually simpler than RAID5/RAID50/RAID6/RAID60

ZFS gets you data checksumming and recovery and raid journaling of out of the box. Latter buys you quick rebuilds, former is there to give you a peace of mind when you have large volume of data and statistically get bit flips. With mdraid you only get journaling typically.

With ZFS you can’t remove/reconfigure storage as easily as with LVM (devicemapper) or with BTRFS.

  • With ZFS the only option is usually doubling the number of disks (or adding disks in batches if you have multiple vdevs already). Or swapping disks with bigger ones.

  • BTRFS has a roughly similar feature set to ZFS, but is more user friendly. With BTRFS you can move between raid5/raid10 literally in place, while mounted and while using the filesystem. It’s also lot more nicely integrated into Linux, just a kernel driver in the usual place with documentation as usual, and a progs package. To mount, pick any drive of the set (no zpool export/import/-f), or just make sure one of the drives that’s part of the set appears and mount by filesystem uuid + optionally subvolume id (if you’re keeping os separate from home, but on the same set) . It’s a lot friendlier/easier than mdadm/dmraid/zfs, which I appreciate when having to deal with dead disks.

The recommendation in all cases is to keep your bootloader and /boot separate from your storage pool, as all of these raid setups require some kind of manual intervention to boot degraded. With systemd these days, by default, you might not even be able to ssh into the host if you have a missing drive that’s part of your raid set.

Personally, between my btrfs / zfs / lvm setups spread across 4-6 drives each that I have had at home for the last 5-10 years, btrfs has served me best (running it in raid10 at the moment)

1 Like

I like this approach too. Using normal SATA SSD for boot drive.

I have not used NVMe drives for VM storage, VM images. But I have used SSD software defined partitions for swap spaces for whatever processes I have running on such VM’s, which has been very useful. OS write backs to the drives after boot are pretty low in the modern images for the most part, so you could use JBOD for VM Storage too.

This is just one more option to consider…

1 Like

Another option you may want to consider.

  1. Use Unraid or Proxmox as the host from a cheap SSD or even usb pen drive; setup the NAS as a standalone container (in Unraid) or as a dedicated freenas VM (in proxmox) with ZFS for your bulk storage (passthrough a virtual GPU for initial install only then run headless).

  2. run both Debian and windows on dedicated VMs each passed through with a single fast disk

  3. setup your game library as a share from the NAS and share as needed, same with media etc.

Basically a hyperconverged system in a box. Could give you more flex to upgrade either os without worrying about the virtualisation, and you can snapshot or have multiple instances sharing the same hardware for different needs.

Just an idea.

2 Likes

Thank you for your replies. FYI, my bulk storage needs are quite modest - 16GB of redundant storage would more than cover my needs for the next few years. However, this is still more capacity than I can afford to host on SSDs (be they SATA or NVMe), so mechanical HDDs will continue to have a role to play.

With regards to Proxmox, I would be lying if I said I was overly familiar with it, but as far as I have been able to ascertain, it would not do anything that I could otherwise do myself - i.e. I do not need Proxmox to be able to set up VMs or containers on a base Debian install. Happy to be corrected of course, but to me it just seems like an unnecessary layer of abstraction.

I think the first thing I need to determine is whether or not I need to have the base OS (in this case Debian 10) running on ZFS (or BTRFS) as well as the bulk storage. As previously mentioned, I am currently running Debian on a RAID1 array using mdadm, but am unsure if this is sufficient. After all, its all well and good ensuring that your precious data files (in the bulk storage pool) are safe, but if your OS crashes because of an uncorrected error or similar, then you are in just as much trouble.

Peanuts. Get a pair of Seagate exos x 16T drives and be done. btrfs in raid1 will do. Then, in a couple of years, upgrade to a pair of 30T PCIe SSDs. Alternatively, get a pair of smaller drives today and then in a year when you’re at 80% full get a third and run btrfs fi balance.

I think I first need to gain a better understanding of the level of redundancy that each system component requires. Take the system description in the OP as case in point:

  1. The host OS will be a standard Debian Buster install. Because everything rests on this foundation, I think that this must have some level of protection. The current plan is to use two ~500GB SATA SSDs with power loss protection in a simple RAID1 (ext4) array. This can be done during the initial install. This is not foolproof by any means, and it would be better to have the host OS on ZFS, but having had a quick look at what is involved I am not sure my simple needs require such an involved solution - at least not yet.
  2. The two Windows VMs could then have their own dedicated ~500GB NVMe SSDs passed though, with all important files stored on a dedicated ZFS-based storage array (see below). I am still unsure if the VMs themselves should also be stored on the ZFS array, or if that is going overboard.
  3. The storage array would then likely be a simple ZFS RAID10 array of four ~4TB HDDs, possibly with a 32GB Optane NVMe to serve as a write cache.
  1. 4x4T in raid1/10 yields 8T(7T because HDD manufacturer rounding) of usable space.

For the rest:

How about: get a pair of 1TiB Samsung Evo Plus (or similar ssd whose latency doesn’t go to s**t as soon as you start writing) and partition them. First partition 256MB for uefi, second 32G for mirrored ZIL, remaining raid1 btrfs partition for host os as well as VM image files. (100G is plenty for Linux VM host, and you can always stretch into your bulk ZFS if you need to).

Sneaky. I had not thought of partitioning the NMVe drives in such a manner. The only problem I can see with such an approach is that if you suffer a power outage your ZIL wont retain its content (which would be very bad in the case of synchronous writes) and you could also mess up your RAID1 partition as well (I think - don’t know much about Btrfs - been focusing on ZFS). If the two SSDs had data loss protection (i.e. BBU) then this would work, but this would mean using SATA SSDs rather than NVMe, and I am reluctant to do so.

I would also probably do things a little differently were I to go down the SATA SSD route - stick with your twin UEFI and mirrored ZIL partitions, but then create an ext4 RAID1 partition for the host and a RAID1 ZFS vdev for the VMs. That way the host is at least mirrored with BBU, and the VMs can be snapshotted etc.

No, mirror ZIL across both nvme, os will sync contents with a barrier, there’s no magic.

In other words you’re syncing NVME drives (quick) and not syncing HDDs (sloooow)

why wont it?

There was a thing about plp SSD’s for slog, but it never bothered me enough. The system won’t write junk. I would much rather loose writes than worry about plp/ bbu

Edit- slog not zil.
Slog separate, zil built in (ram)

slog and zil are related (but zil is not in ram)…

By default, the short-term ZIL storage exists on the same hard disks as the long-term pool storage at the expense of all data being written to disk twice: once to the short-term ZIL and again across the long-term pool. Because each disk can only perform one operation at a time, the performance penalty of this duplicated effort can be alleviated by sending the ZIL writes to a Separate ZFS Intent Log or “SLOG”, or simply “log”. While using a spinning hard disk as SLOG will yield performance benefits by reducing the duplicate writes to the same disks, it is a poor use of a hard drive given the small size but high frequency of the incoming data.

1 Like

When you write to a filesystem, a device has a certain amount of cache dram memory on it (unless it’s super cheap flash with no dram cache - ignore that for now). This helps with performance because writes return faster. But these writes are not guaranteed to not disappear after a reboot or a power cycle. To ensure durability, the user/app needs to issue a sync command. This will block and not return immediately, it’ll return when data from the cache has made it to durability storage.

Battery backed cache dram is considered durable.

But operationally it just means that writes can return immediately without waiting on caches to clear - super fast syncs.

Thank you all for your replies. I found the iXsystems link provided by @nx2l to be particularly helpful. I think I now have a plan that will work and that actually makes some sort of sense.

Using two 1TB Samsung 970 EVO Plus NMVe SSDs:

  1. Create a 256MB UEFI partition on each
  2. Create a 32GB partition on each for use as a mirrored SLOG (see below)
  3. Create an LVM RAID1 array for the host OS

Then using three 8TB Seagate Exos HDDs:

  1. Create a “slow” zpool and set sync to always
  2. Create a three-disk mirrored vdev
  3. Add the mirrored SLOG to the slow zpool

The end result should be a configuration in which the host OS is mirrored on fast SSDs, and bulk storage has a fast SLOG with 3X the read speed of a single HDD, along with the ability to lose up to two HDDs. I can then add a second three disk mirror at a later stage and double my storage if I so wish.

The only other thing I am considering is whether or not to reduce the size of the host OS partitions on the SSDs down to something like 100GB, and then create a second, “fast” mirrored zpool (again with sync set to always) out of the remaining space. As posters have pointed out, this “fast” zpool would not need a SLOG due to the fact it lives on an SSD rather than a HDD.

Oh? That’s some serious redundancy.

Or you can put / on ZFS, just need to be slightly careful with the setup and make sure you can boot with a degraded pool

Also, have you thought about encryption? Typically folks would use devicemapper and cryptsetup underneath the filesystem.

Also, different manufacturers or same manufacturer different models can have slightly different sizes. Try to leave a bit of space empty / unpartitioned to accommodate for future replacement.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.