Storage Pool in Linux

Hey guys

I have eight 2TB hard drives that I want to use as an encrypted storage pool. I am using Fedora 24 Server.

Originally I had four 2TB drives in an encrypted LVM. Somewhat surprisingly this encrypted LVM showed up after I completely rebuilt my home server recently. I'm talking new motherboard, new hard drive controller, new OS, everything. I built my new home server, installed and flashed my new PERC H310 to IT firmware, plugged everything in, installed Fedora Server, and there was my old encrypted LVM. I unlocked and mounted it and everything was still there. Pretty cool.

I then spent the next 3 or 4 hours fighting with trying to add the four new drives to the LVM. Nothing I did worked. Eventually I wiped everything and attempted to add all the drives to the fedora LVM, which also did not work for whatever reason.

So at this point I am planning on starting from scratch. Again. But my question is should I go with an encrypted LVM again, or is there a better option?

How did you try to extend the storage pool? You need to create a physical volume on the new drive then add it to your existing group.

Another option is to use btrfs which supports storage pools natively.

2 Likes

Yeah, I did that for each new drive. I added them to the group, I think using vgextend. It seemed to work, because it increased to 16TB. I then used lvextend to extend it to the max size available (+100%FREE I think), and that also seemed to work. But no matter what I did I could not extend the file system using resize2fs. It always said it was already at the max size or something.

To be honest I started to get myself confused going around in circles for hours. I eventually got frustrated and nuked everything to start over. All the data is backed up, so I didn't lose anything,

I am not sure what happened when I tried to add the 16TB to the fedora group. It shows up, but nothing happens when I try to access it.

This is where I am sitting now:

If I try to mount that 16TB volume either nothing happens in the Disks application, or if I go through Files it crashes.

To clear LVM metadata I have sometimes used dd to write zeros to the first part of the drive, but IMO doing an LVM over JBOD like this isn't ideal, it's maybe one step up from RAID0 in terms of data risk.

A traditional way would be to use a combination of Linux MD + LVM, but there is ZFS now.

1 Like

I would recommend an BTRFS setup. You can use LVM with it I believe. I have a fileserver running Rockstor OS on it. It is BTRFS and CentOS. It is simple to setup and run.

Yeah, it always bothered me that I was running 8TB of storage with no protection. I mean, I had backups. And I actually actively backed it up semi-regularly. Now that I have 16TB of potential data at risk, that bothers me even more.

I will have to do some research on ZFS and BTRFS. I am leaning toward ZFS on Linux. My only problem is I want it to be encrypted.

If I installed ZoL per: https://github.com/zfsonlinux/zfs/wiki/Fedora

Then set up a pool per: https://wiki.archlinux.org/index.php/ZFS#Create_a_storage_pool

Then set up ZFS encryption per: https://docs.oracle.com/cd/E23824_01/html/821-1448/gkkih.html

Would that work? Does ZFS on Linux support encryption? Because I have seen conflicting information on that. Some say ZoL doesn't support encryption because Oracle locked down and it only works on Oracle's version of ZFS. I am not sure. Maybe I just need to research more.

If encryption does work on ZoL, then that would be super easy to set up.

edit: I am not sure that encryption works on ZFS for Linux. If anyone knows definitively if it does or not let me know.

On the other hand, installing and setting up ZFS was amazingly easy. Now I have an 11.8TB RAIDZ1 pool.

I've only done this with btrfs but you should be able to encrypt the disk with luks and then create the zfs array with the encrypted disk. So rather than encrypting the zfs file system you're encrypting the actual disks.

Seems that it does work okay.

So say you have two disks, /dev/sda and /dev/sdb. You would first encrypt those with luks using cryptsetup. The open disks will be avaliable at /dev/mapper/sda and /dev/mapper/sdb (or whatever you choose to name them). Then you create the ZFS array from the disks at /dev/mapper/ rather than /dev/sdx.

Look up a guide for using luks and cryptsetup and it should make sense.

Be careful with BTRFS. raid 5 and 6 mode are fubar and are not to be considered "production ready"

I remember reading an article somewhere quoting one of the devs saying that it's not stable enough to be used with any assumption of data integrity.

@marasm ZFS On Linux doesn't support encryption directly. Use LUKS to create encrypted containers, then make a zpool on top of it. reference...

1 Like

@Dexter_Kane
Thank you very much for the information. The link you provided gave me a great place to start. I am now using this guide to work everything out: http://www.makethenmakeinstall.com/2014/10/zfs-on-linux-with-luks-encrypted-disks/

It seems like everything should work out great. My dual Xeon 2670s are fast with aes-xts encryption (over 1500MB/s for 256 byte and over 1200MB/s for 512 byte). So I don't think I will have any slowdowns on the encryption side of things.

I did, however, find a small hiccup with my drives themselves. I decided to check all the drives to find out if they are all Advanced Format (4K block size). It turns out that two of the older drives (Samsung Spinpoint drives) are actually 512B block size. The other six drives are identical Seagate drives and are all 4K block size. How much of a performance hit would I get if I set up the ZFS pool for 512B instead of 4K? Should I order two more Seagate drives?

@SgtAwesomesauce
Yeah, I think I am going to steer clear of BTRFS for now. I have been hearing that it's just not ready for prime time for quite a while now. ZFS seems to be able to do what I need. Also, thanks for that link.

My other big question is, will all of this automatically unlock and mount at system startup? Something something fstab? Keyfiles?

You configure it to auto unlock the disks in /etc/crypttab. You can use keyfile stored on a USB thumb drive or something if you don't want to enter a password. There's also another file that I can't remember the name of but you set the location of the key files there so that it gets mounted before it tries to unlock the disks. Just look for a guide on auto mounting luks and it should explain it. Then you add the fstab stuff for zfs like normal.

Another tip, I recently ran in to this problem myself, it's worth making a backup of the luks headers for each disk because if that becomes corrupt there's no other way to fix it and it become impossible to unlock the disk.

Also you may be interested in these topics I made if you haven't seen them already:

ZFS is love, ZFS is life... Seriously Mirrored VDEVs is what you want, I could go on for a while ranting about bit rot and RaidZ, but this guy does a far better job of that very task here: http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-not-raidz/ then I am currently capable of doing at work.

I found this regarding the ashift block size stuff: http://louwrentius.com/zfs-performance-and-capacity-impact-of-ashift9-on-4k-sector-drives.html

In any case, I ordered two more Seagate 2TB drives to replace the Samsung drives. Just to give me the peace of mind that they all match. I will do some tests like in the link above. I think I want to go for max capacity over max performance, simply because everything will be bottlenecked by the Gbe interface anyway. Maybe someday soon I'll do some NIC teaming for 2Gbe or 4Gbe, but still it should be sufficient for bulk storage and video streaming. I think I'll use the two Samsung drives for another ZFS pool geared more for performance so I can store some VM images on it.

@Dexter_Kane
And yes, I have followed your exploits with your ultra-secure NAS project. Really cool stuff. I might borrow some of your ideas to use with my server.

I need to learn about crypttab and fstab. I have never been comfortable with fstab.

1 Like

Actually, Red Hat uses it as the default FS for RHEL, CentOS, and Fedora. I believe so does OpenSuse. I know for a fact that Facebook uses BTRFS for their entire filesystem.

Here is a quote from Vitaltothecore.com Feb 2016: > There are many misconceptions around BTRFS on the Internet, some come

from real initial problems that the filesystem had, but also because
people usually don’t check the date of the informations they read. Yes,
BTRFS was really unstable at the beginning, but if you read about huge
data corruption problems in a blog post or topics like that, and that
post was written in 2010, well maybe things have changed since then.
The most important part of a file system is its on-disk format, that
is the format used to store data onto the underlying media. Well, the
filesystem disk format is no longer unstable, and it’s not expected to
change unless there are strong reasons to do so. This, alone should be
enough to tell people that BTRFS is stable.
So, why it is considered unstable by many?
There are few reasons: first, as I said, people are scared of change
when it comes to filesystems. Why changing from a trusted and known one,
to something new? And it’s not just Linux, the same is happening in
Microsoft and its will to move from NTFS to ReFS. But then, I’ve always
seen a paradox here: ok for XFS that has 20 years of stable development,
but ext4, the “trusted” default file system, has been developed as a
fork of ext3 in 2006. So, it’s just 1 year older than BTRFS!!!
Second reason, probably, the fast development cycles. While the
on-disk format is finalized, the code base is still under heavy
development to always improve performance and introduce new features.
This, together with management tools that had been stabilized only
recently, made people think that the entire project wasn’t stable.


First important note, it’s not another ranting post from 8 years ago,
it’s Chris Mason himself speaking at NYLUG in May 2015, so just few
months ago. And the examples he brings are the best proof about BTRFS:
they are using the filesystem at Facebook, where they store production
data. And the nice part is that, right because BTRFS is used in
production at Facebook, the size of the used storage helps in testing
and fixing the code at a pace that wouldn’t be possible in smaller
installations.
And if you watch the video, you’ll see how some really heavy weights
in the industry are supporting and working to improve BTRFS: Facebook,
SuSE, RedHat, Oracle, Intel… And the results are showing up: starting
from SuSE Linux Enterprise Server 12,
released in October 2015, BTRFS has become the default file system of
this distribution. Kudos to the guys at SuSE, because for sure the best
way to push its adoption is to place a statement like this “we are a
profit company, not a group of Linux geeks, and we trust this filesystem
to the point that it’s going to become our default one”.

Ok, so BTRFS is stable enough to be trusted. Or at least I do, together
with guys whose judgement has way more value then me like Facebook and
SuSE Linux experts. At this point, if you still don’t trust it, stop
reading this post and keep using ext4 or xfs, no problem.

Here are some of the features listed from the article along with the link to the full list.

  • BTRFS has been designed from the beginning to deal with modern data
    sources, and in fact is able to manage modern large hard disks and large
    disk groups, up to 2^64 byte. That number means 16 EiB of maximum file
    size and file system, and yes the E means Exabyte. This is possible
    thanks to the way it consumes space: other file systems use disks in a
    continguous manner, layering their structure in a single space from the
    beginning to the end of the disk. This makes the rebuild of a disk,
    especially large ones, extremely slow, and also there’s no internal
    protection mechanism as one disk is seen as a single entity by the
    filesystem itself.

  • BTRFS instead uses “chunks”. Each disk, regardless its
    size, is divided into pieces (the chunks) that are either 1 GiB in size
    (for data) or 256 MiB (for metadata). Chunks are then grouped in block groups,
    each stored on a different device. The number of chunks used in a block
    group will depend on its RAID level. And here comes another awesome
    feature of BTRFS: the volume manager is directly integrated into the
    filesystem, so it doesn’t need anything like hardware or software raid,
    or volume managers like LVM. Data protection and striping is done
    directly by the filesystem, so you can have different volumes that have
    inner redundancy.

  • Another aspect of BTRFS is its performance. Because of its modern design
    and the b-tree structure, BTRFS is damn fast. If you didn’t already,
    look at the video above starting from 30:30. They have run a test
    against the same storage, formatted at different stages with XFS, EXT4
    and BTRFS, and they wrote around 24 million files of different size and
    layout. XFS takes 430 seconds to complete the operations and it was
    performance bound by its log system; EXT4 took 200 seconds to complete
    the test, and its limit comes from the fixed inode locations. Both
    limits are the results of their design, and overcoming of those limits
    was one of the original goal of BTRFS. Did they succeed? The same test
    took 62 seconds to be completed on BTRFS, and the limit was the CPU and
    Memory of the test system, while both XFS and EXT4 were able to use only
    around 25% of the available CPU because they were quickly IO bound.

  • Writable and read-only snapshots

  • Checksums on data and metadata (crc32c): this is great in my view, as every stored block is checked, so it can immediately identify and correct any data corruption
  • Compression (zlib and LZO)
  • SSD (Flash storage) awareness: another sign of a modern filesystem.
    BTRFS identifies SSD devices, and changes its behaviour automatically.
    First, it uses TRIM/Discard for reporting free blocks for reuse, and
    also has some optimisations like avoiding unnecessary seek
    optimisations, sending writes in clusters, even if they are from
    unrelated files. This results in larger write operations and faster
    write throughput.
  • Background scrub process for finding and fixing errors on files with redundant copies
  • Online filesystem defragmentation. being a COW (copy-on-write)
    filesystem, each time a block is updated the block itself is not
    overwritten but written in a different location of the device, leaving
    the old block still in place. If the old block at some point is not
    needed anymore (for example if it’s not part of any snapshot) BTRFS
    marks the chunk as available and ready to be reused.

  • In-place conversion of existing ext3/4 file systems

  • BTRFS Wiki

Simply put, Facebook, Red Hat (the guys who make and sell Enterprise Linux Desktop and Server OSes), OpenSuse, Oracle, etc consider BTRFS as being production ready and have staked their entire livelihoods behind it being production stable. So, don't write off BTRFS yes especially if you think ZFS is able to solve your problems since ZFS, to work correctly requires ungodly amounts of RAM.

Also, why use RAID 5 or RAID 6? You can get all the performance and security you need from RAID 10 under BTRFS and there are no issues there.

Another point of interest for ZFS and BTRFS is:

As a btrfs volume ages, you may notice performance degrade. This is
because btrfs is a Copy On Write file system, and all COW filesystems
eventually reach a heavily fragmented state; this includes ZFS. Over
time, logs in /var/log/journal will become split across tens of
thousands of extents. This is also the case for sqlite databases such
as those that are used for Firefox and a variety of common desktop
software. Fragmentation is a major contributing factor to why COW
volumes become slower over time.
ZFS addresses the performance problems of fragmentation using an
intelligent Adaptive Replacement Cache (ARC); the ARC requires massive
amounts of RAM. Btrfs took a different approach, and benefits from—some
would say requires—periodic defragmentation.

1 Like

AFAIK and having read that page, all strategies give similar protection against rot, VDEVs lets you build for performance.

Actually, Red Hat uses it as the default FS for RHEL, CentOS, and Fedora. I believe so does OpenSuse. I know for a fact that Facebook uses BTRFS for their entire filesystem.

Just because Red Hat uses something as a default, doesn't mean it's a good idea. Red Hat does not always make the best decisions. Also, BTRFS, when stable, is better to ship than ZFS because it doesn't break license.

Simply put, Facebook, Red Hat (the guys who make and sell Enterprise Linux Desktop and Server OSes), OpenSuse, Oracle, etc consider BTRFS as being production ready and have staked their entire livelihoods behind it being production stable. So, don't write off BTRFS yes especially if you think ZFS is able to solve your problems since ZFS, to work correctly requires ungodly amounts of RAM.

Facebook uses BTRFS only for ramdisk filesystems that are wiped every night.

OpenSuse? Who uses that in production? (or even Suse enterprise)

The problem with it is that it has fundamental issues with some of its core functions. If you simply avoid those functions, you're fine. I don't like to have to avoid functions I need.

Also, why use RAID 5 or RAID 6? You can get all the performance and security you need from RAID 10 under BTRFS and there are no issues there.

Not everyone has the money to shell out for N*2 drives. It's much more cost effective to work with N+2 or N+3 on large arrays.

Don't get me wrong, I want to use BTRFS. If it were stable for my use case, I would. It has features I want to use, but at this point in time, it's just not stable enough to risk my data on it.

If I were a good enough developer to work on the FS myself, I would help out, but I suck at programming, so the best I can do is sit by and watch. Hopefully by kernel 5 it will be solid.

On your list of features, the only one that ZFS doesn't have is online fragmentation and in-place conversion, but again, that's not the idea behind ZFS. It's (mostly) meant for arrays and multiple devices.

1 Like

You got it :)

Not by default in Fedora 24, it still uses ext4 + LVM. Or at least if your encrypting the drive it does. There's been no recommendation yet from the Fedora kernel team to switch to it so its not happened.

However I agree on the misconceptions. BTRFS is quite stable, ive not heard of any issues recently. If your on a system with an up to date kernel, your unlikely to have any issues at all.

Well, I got my storage pool running.

I ordered two more Seagate 2TB drives, so now all the drives in the main pool are identical. I now have a 13TB encrypted pool. Pretty cool.

I did the LUKS encryption and added the drives to the crypttab and they unlock using a keyfile at boot. The keyfile is stored in root's home, and is only accessible by root. Not really the best scenario, security wise, but it'll do.

I made the ZFS pool on the un-encrypted drives, and it mounts automatically at boot too.

I did the same thing to the two 2TB drives that I had to replace, setting the pool to an ashift of 9 because of the different, smaller physical block size. I am going to use this smallish (2TB) pool for VM image storage.

As to the whole Fedora defaults to BTRFS, I can say in my case(s) it definitely does not. My Fedora 23 desktop and laptops defaulted to encrypted LVM and ext4. My Fedora 24 server defaulted to encrypted LVM and XFS.

As an aside, does anyone know how to get conky to display storage space in TB and GB instead of TiB and Gib? It's confusing.

1 Like