BTRFS - 4 disks with RAID0 nested inside RAID1?

TomB · December 6, 2019, 11:37pm

Hi all,

This is my first post! I’ve been enjoying Level 1 Techs youtube channel for about a year. I’m upgrading my (somewhat unusually set up) NAS and was looking at L1’s videos on bitrot as I’m wanting to keep data for a long time.

BTRFS looks like an ideal situation for my use case but I was wondering if it’s possible to have nested pools?

I have 4 8tb HDDs and this is the configuration I want:

2x8gb exposed as a 16gb single volume (RAID 0)

2x8gb exposed as a second 16gb single volume (RAID 0)

Then, I want to RAID1 the two 16gb volumes using BTRFS so I have redundancy and can easily check for errors and fix bit rot.

Is this possible? I looked at the documentation but couldn’t find anything obvious.

If distro matters, I’m using CentOS 7 at the moment but I’m planning to upgrade to CentOS Stream at some point so if a newer kernel is needed this is a good excuse for me to do get around to migrating to Stream.

thro · December 6, 2019, 11:43pm

BTRFS is imho pretty dead. But yes you can do a raid0 stripe across 2 sets of mirrored drives. What you are looking for info on is a “striped mirror” or a “RAID10” setup.

I haven’t used BTRFS myself but looking for info on RAID10 May help you.

TomB · December 6, 2019, 11:48pm

What would you recommend for this situation?

This was the video that highlighted the problem btrfs can solve: [for some reason I’m not allowed to include links. YouTube ID: ?v=pv9smNQ5fG0 ]

Is this also possible with xfs or a different filesystem?

thro · December 6, 2019, 11:50pm

Personally I’d use ZFS. But normal MDADM would be another alternative.

ZFS is cross platform. BTRFS is not. BTRFS also is not as mature. There are also RAID5/6 bugs in BTRFS but they aren’t applicable to your situation.

zlynx · December 7, 2019, 3:02pm

I would just run all four drives in a single btrfs array set as RAID10. The filesystem will ensure there are two copies and stripe with the rest of the drives. For example, I run a six drive array in RAID10 and it will read three drives simultaneously for a big file.

SesameStreetThug · December 7, 2019, 4:22pm

BTRFS doesnt provide a block device like mdadm or LVM does. So you cannot nest differnt filesystems inside it. However, BTRFS already has raid 10 support, you would create the filesystem like
mkfs.btrfs -m raid10 -d raid10 /dev/sda /dev/sdb /dev/sdc /dev/sdd

of course replacing sda and so on with your correct drive block devices.

zlynx · December 7, 2019, 6:21pm

A note of what block devices to use for your btrfs:

I would not use the entire block device. On mine I made GPT partitions on each drive, reserving a 500 MB EFI System and a 1 GB Data partition on each one. So each of my btrfs devices is like /dev/sda3, /dev/sdb3, etc.

That way if I need to make one into a boot drive for some reason I have space for an EFI boot and a small Linux install.

Plus, drives are a lot easier to manage with most tools if they have a partition table. Even if you decide to use the entire drive I’d recommend a GPT and one data partition.

TomB · December 7, 2019, 9:20pm

Thanks for all the helpful replies!

Especially for the tip about raid 10, I never used it before and didn’t know that’s what raid 10 was. It makes sense: 1+0 so well done to whoever designated the number.

I hadn’t realised that btrfs was “pretty dead”. I know Red Hat deprecated it but I’d thought it was still a decent choice for cases like this.

I can see that both btrfs and ZFS support raid10 with checksums so my next questions is, should I use ZFS without in-kernel support or kernel supported btrfs? ZFS looks more mature but what happens if people stop development to make it work with whatever kernel is in use in 5 years?

And actually I’ve just answered my own question in my mind while writing this response. I might as well include my question and own answer: My point was going to be that although I could move the data to a different filesystem before upgrading to a kernel that didn’t’ support it, I’d need enough free space to copy all my files from the 4 drives (or however many I have by then).

But, as I was typing I realised that I’m doing mirrored striping so I already have that space. I could scrub the filesystem, I then known that the data on both of the mirrors is fine, format 2 of the disks to a different filesystem and copy the data across then add mirroring on the new filesystem. So as long as s future filesystem supports moving from raid0 to raid10 this will never be an issue in my situation.

I just tried setting this up with ZFS and it was worryingly simple. I definitely didn’t feel like I earned it

If I’m only using them for storage is this still a consideration? I haven’t used a HDD as a boot drive since I got my first SSD in 2007 and don’t think it’ll ever be something I need. I have several spare 64gb and 128gb ssds that are too small for me to find a decent use for that I can use for a boot device if I ever want to play around with different OSs.

SesameStreetThug · December 7, 2019, 9:51pm

ZFS on linux is fine as long as you arent trying to boot from it and have enough RAM.

TomB · December 7, 2019, 9:57pm

Thanks. I have 16gb of RAM in my server and 32tb space (16tb mirrored). How many TB could I add before my 16gb RAM became a limiting factor?

edit: Having said that, I’m accessing it over a 1gbit network, which I’m guessing is going to be more of a bottleneck than the amount of RAM. There are only 3 pcs accessing it at a time, two users and a HTPC so high performance really isn’t an issue.

risk · December 7, 2019, 10:17pm

The rule of thumb I keep hearing for ZFS is 1GB of ram for 1TB of disk.

I’m using both ZFS and BTRFS at home (on two small servers/nas-like machines). So far I’ve had fewer issues with BTRFS due to it being more forgiving of operator errors, and in total about the same amount of mileage with each.
I’ve had experienced data loss as well as metadata corruption with both to the point where it wouldn’t mount by itself on reboot regardless of the flags/mount options with both of them.

(e.g. since drives don’t fail that often… I’ve managed to once use zpool add instead of zpool replace – which there’s no way to undo… on BTRFS you can remove devices, zfs not so much)

SesameStreetThug · December 7, 2019, 10:21pm

1GB RAM per TB of storage. depends on compression and other things like that too.

TomB · December 7, 2019, 10:22pm

How much load does 1gb ram per 1tb need though? Because I have the 16gbx2 volume mounted and I have 382mb RAM usage.

I’m only on a home network and the disk I’m talking about is used purely for periodical backups. I also have 12tb of ext4 drives on the same machine, and machines on the same network. which are storing data and backing up to those.

SesameStreetThug · December 7, 2019, 10:25pm

it will use that ram for ARC, which is the memory cache. It’s required for the filesystem to not be slow as hell.

risk · December 8, 2019, 7:58am

It’s a rule of thumb …
in a home environment where most of the use is “archival storage” you basically don’t care as long as tweaking it and fixing it doesn’t waste all your spare time. Doing an hdd seek to read a bit of metadata once in a while is not going to break the bank leaving your $10k CPUs underutilized, and it’s not going to impact your productivity of a background backup.
As long as you can spare at least 2 GB of ram for the OS running zpools and zfs it can work fine without stability issues, maybe a bit slow at times, but that’s it. The ZFS arc will get resized based on the ram available, and you can tune some parameters there if you really want to run it on a raspberry pi (don’t recommend for many reasons).

ZFS also supports utilizing flash devices intelligently, to provide caching as well as low latency transaction logs storage (ZIL). In home use cases if you’re looking for a more responsive ZFS system and have 4x10TiB drives on 4GiB ram, before looking into adding more ram, I’d look into getting maybe a pair of dram backed flash drives, partitioning a small part, maybe 4GIB for mirrored ZIL, and then repurposing the rest for as the L2ARC . This would make all those metadata writes and reads feel seamless, and a good chunk of recent and small files would land into L2ARC as well.

thro · December 8, 2019, 2:49pm

1 GB for TB on ZFS is only a “rule of thumb” if you are running de-duplication.

DO NOT RUN DE DUPLICATION

Anyone prattling off 1gb per tb as a general rule of thumb without specifying that this is for dedupe is spreading misinformation.

Trooper_ish · December 8, 2019, 3:21pm

Only those who know they need de-dup, need de-dup.

WorBlux · December 9, 2019, 12:15am

Free RAM is wasted RAM. As you use it, the default behavior of the kernel and ZFS should be grow disk cache and the ARC to fill most of the available space. Linux doesn’t consider this space “used”, but it doesn’t count it as “free” either.

That said if you aren’t reading much off the disk, nothing is really going to get cached between reboots. Also if most of your files are large, using a larger block size can decrease metadata overhead. (at the expense of packing efficiency)

thro · December 9, 2019, 2:31am

Pretty much, and even then i would suggest not doing it.

ZFS, unlike other platforms does de-duplication live, inline - not as a scheduled process. In order to make write performance not suck massively, it needs to store the dedupe table in RAM. There’s no way around this and it is an inherent problem with live de-dupe.

If

you have a ZFS box acting as a dedicated NAS ONLY
AND you KNOW your data-set is highly compressible (because you’ve run the ZFS tool to check first)
AND you can’t get enough of a win with compression
AND you can’t afford enough disk

and only then… consider turning on de-dupe. AFTER reading up on all the caveats.

StrY · December 9, 2019, 10:21am

Firstly let me say that I’m all for ZFS - especially on servers!

Just want to say that the whole BTRFS is dead thing is a bit of a hoax as far as I understand. There are lots of use cases where it really does make sense to use it.

I think the misconception started because of the whole RedHat dropping it. The reason they decided to do that is that they are promoting XFS - have a lot of dev’s specifically for XFS and don’t want to support BTRFS. They only ever supported BTRFS as a technical preview.

BTRFS is still being actively developed and new features are being added regularly. As an example they just added support in the new kernel for RAID1C3 and RAID1C4 - RAID 1 with 3/4 mirrors.

There are still examples where running ZFS doesn’t make sense and the low overhead and simplicity of BTRFS comes in ahead for me personally - like with most workstations. That said - If running ZFS makes sense - then do it!

To me ZFS just isn’t ready for some situations. Starting up applications with large memory footprints like VM’s can fail when ZFS doesn’t release memory fast enough and even when it does it causes delays while it clears the RAM…