Storage Pool in Linux

marasm · August 14, 2016, 8:42pm

Hey guys

I have eight 2TB hard drives that I want to use as an encrypted storage pool. I am using Fedora 24 Server.

Originally I had four 2TB drives in an encrypted LVM. Somewhat surprisingly this encrypted LVM showed up after I completely rebuilt my home server recently. I'm talking new motherboard, new hard drive controller, new OS, everything. I built my new home server, installed and flashed my new PERC H310 to IT firmware, plugged everything in, installed Fedora Server, and there was my old encrypted LVM. I unlocked and mounted it and everything was still there. Pretty cool.

I then spent the next 3 or 4 hours fighting with trying to add the four new drives to the LVM. Nothing I did worked. Eventually I wiped everything and attempted to add all the drives to the fedora LVM, which also did not work for whatever reason.

So at this point I am planning on starting from scratch. Again. But my question is should I go with an encrypted LVM again, or is there a better option?

Eden · August 14, 2016, 8:51pm

How did you try to extend the storage pool? You need to create a physical volume on the new drive then add it to your existing group.

Another option is to use btrfs which supports storage pools natively.

marasm · August 14, 2016, 9:01pm

Yeah, I did that for each new drive. I added them to the group, I think using vgextend. It seemed to work, because it increased to 16TB. I then used lvextend to extend it to the max size available (+100%FREE I think), and that also seemed to work. But no matter what I did I could not extend the file system using resize2fs. It always said it was already at the max size or something.

To be honest I started to get myself confused going around in circles for hours. I eventually got frustrated and nuked everything to start over. All the data is backed up, so I didn't lose anything,

I am not sure what happened when I tried to add the 16TB to the fedora group. It shows up, but nothing happens when I try to access it.

This is where I am sitting now:

If I try to mount that 16TB volume either nothing happens in the Disks application, or if I go through Files it crashes.

Buckshee · August 15, 2016, 3:22pm

To clear LVM metadata I have sometimes used dd to write zeros to the first part of the drive, but IMO doing an LVM over JBOD like this isn't ideal, it's maybe one step up from RAID0 in terms of data risk.

A traditional way would be to use a combination of Linux MD + LVM, but there is ZFS now.

LinuxMaster9 · August 15, 2016, 3:27pm

I would recommend an BTRFS setup. You can use LVM with it I believe. I have a fileserver running Rockstor OS on it. It is BTRFS and CentOS. It is simple to setup and run.

marasm · August 16, 2016, 2:29am

Yeah, it always bothered me that I was running 8TB of storage with no protection. I mean, I had backups. And I actually actively backed it up semi-regularly. Now that I have 16TB of potential data at risk, that bothers me even more.

I will have to do some research on ZFS and BTRFS. I am leaning toward ZFS on Linux. My only problem is I want it to be encrypted.

If I installed ZoL per: https://github.com/zfsonlinux/zfs/wiki/Fedora

Then set up a pool per: https://wiki.archlinux.org/index.php/ZFS#Create_a_storage_pool

Then set up ZFS encryption per: https://docs.oracle.com/cd/E23824_01/html/821-1448/gkkih.html

Would that work? Does ZFS on Linux support encryption? Because I have seen conflicting information on that. Some say ZoL doesn't support encryption because Oracle locked down and it only works on Oracle's version of ZFS. I am not sure. Maybe I just need to research more.

If encryption does work on ZoL, then that would be super easy to set up.

edit: I am not sure that encryption works on ZFS for Linux. If anyone knows definitively if it does or not let me know.

On the other hand, installing and setting up ZFS was amazingly easy. Now I have an 11.8TB RAIDZ1 pool.

Dexter_Kane · August 16, 2016, 4:59am

I've only done this with btrfs but you should be able to encrypt the disk with luks and then create the zfs array with the encrypted disk. So rather than encrypting the zfs file system you're encrypting the actual disks.

Dexter_Kane · August 16, 2016, 3:13pm

Seems that it does work okay.

So say you have two disks, /dev/sda and /dev/sdb. You would first encrypt those with luks using cryptsetup. The open disks will be avaliable at /dev/mapper/sda and /dev/mapper/sdb (or whatever you choose to name them). Then you create the ZFS array from the disks at /dev/mapper/ rather than /dev/sdx.

Look up a guide for using luks and cryptsetup and it should make sense.

SgtAwesomesauce · August 16, 2016, 6:55pm

Be careful with BTRFS. raid 5 and 6 mode are fubar and are not to be considered "production ready"

I remember reading an article somewhere quoting one of the devs saying that it's not stable enough to be used with any assumption of data integrity.

@marasm ZFS On Linux doesn't support encryption directly. Use LUKS to create encrypted containers, then make a zpool on top of it. reference...

marasm · August 17, 2016, 1:58am

@Dexter_Kane
Thank you very much for the information. The link you provided gave me a great place to start. I am now using this guide to work everything out: http://www.makethenmakeinstall.com/2014/10/zfs-on-linux-with-luks-encrypted-disks/

It seems like everything should work out great. My dual Xeon 2670s are fast with aes-xts encryption (over 1500MB/s for 256 byte and over 1200MB/s for 512 byte). So I don't think I will have any slowdowns on the encryption side of things.

I did, however, find a small hiccup with my drives themselves. I decided to check all the drives to find out if they are all Advanced Format (4K block size). It turns out that two of the older drives (Samsung Spinpoint drives) are actually 512B block size. The other six drives are identical Seagate drives and are all 4K block size. How much of a performance hit would I get if I set up the ZFS pool for 512B instead of 4K? Should I order two more Seagate drives?

@SgtAwesomesauce
Yeah, I think I am going to steer clear of BTRFS for now. I have been hearing that it's just not ready for prime time for quite a while now. ZFS seems to be able to do what I need. Also, thanks for that link.

My other big question is, will all of this automatically unlock and mount at system startup? Something something fstab? Keyfiles?

Dexter_Kane · August 17, 2016, 2:11am

You configure it to auto unlock the disks in /etc/crypttab. You can use keyfile stored on a USB thumb drive or something if you don't want to enter a password. There's also another file that I can't remember the name of but you set the location of the key files there so that it gets mounted before it tries to unlock the disks. Just look for a guide on auto mounting luks and it should explain it. Then you add the fstab stuff for zfs like normal.

Dexter_Kane · August 17, 2016, 2:26am

Another tip, I recently ran in to this problem myself, it's worth making a backup of the luks headers for each disk because if that becomes corrupt there's no other way to fix it and it become impossible to unlock the disk.

Also you may be interested in these topics I made if you haven't seen them already:

Dexter Kane's ultra paranoid encrypted NAS (Completed) Blog

@Atomic_Charge Submitted for the month of doing it. The goal of my encrypted NAS project is to set up encrypted storage which is automatically unlocked when the system is at home but will become locked if it is removed. This protects against someone physically stealing or removing the server. Obviously for more sensitive information you would keep it encrypted and unlock when needed to protect against someone gaining access to the server. But in my case that would be impractical and unnecessary. The original plan was to store keyfiles on my phone, so that the NAS could only be unlocked if it was on my local network and my phone was also connected to the network. But this would have been a problem if I needed to reboot the server remotely. For my new plan I will be storing the keys on a VPS. As the keys can only be accessed over the local network but the VPS is physically not in my house. This protects against a scenario in which someone just steals everything. Alternatively you could get a raspberry pi with a solar panel and hide it up a tree, or any other way of having the keyfiles logically available within a certain proximity but physically distant from the server. This blog isn't really meant as a guide but I will be including a lot of the steps (as well as links to guides I used at the bottom) which may help if someone wants to do something similar. I'd also like to say that I'm totally not an expert on any of this, and there are bound to be plenty of massive security flaws in this plan. Feel free to let me know what they are :P The original NAS [image] I am not beginning this project with empty disks. I already have a NAS system with around 20TB of data on it. But because I am using individual disk (not RAID or ZFS etc.) I can move data off a disk, encrypt it and move it back. It's going to take a while but it's straight forward enough. If you have a full backup then you could just encrypt everything then load the backup, but I don't have a full ba…

Paranoid NAS Part 2: Mercury Kill switch (Get crackin' challenge) Blog

@Atomic_Charge @gearheadgirl27 This project is a follow up to the last of @Atomic_Charge 's challenges where I put on my tin foil hat and put together an ultra paranoid encrypted NAS. For this project I'm building a physical kill switch/intrusion detection system which will shut down or reset the NAS if it is tampered with. If paranoid mode is enabled this will cause it to lock the disks so that an attacker can't access the data, if paranoid mode is disabled the system can be brought back online and the disks unlocked automatically. Yes I know this is a terrible idea, I know that if there's one thing worse than full disk encryption for unrecoverable data loss it's full disk encryption with sudden power loss. Shhh. So the plan is to have a mercury switch that will detect movement of the case, and a micro switch which will detect when the side panel is removed. There will also be a hidden arm/disarm switch, because otherwise it's going to be a total pain in the arse. Now I know a little bit about electronics, but not a whole lot, so I'm not going to build this in to the power supply which would be a much more reliable option. Instead I'm going to connect it to either the power or reset switch connectors on the motherboard, as either way it will trigger a lock down of the disks if the system is in paranoid mode (for anyone unfamiliar with the previous project, when in paranoid mode the servers will ping each other every minute to see if they're still connected and lock the disks if they're not, so a reset or shutdown will trigger that). Thanks a lot to @Th3Z0ne for helping me out with the plan, and thanks for this helpful diagram: [image] Initial testing I'm still waiting on some of the parts to come from china, all I have so far are the relays, some wire, and a micro switch I stole from work. So for now I've just thrown something together to see how it will work and refine the plan. [image] For testing I have the vibration sensor connected to the…

DamienAnton · August 17, 2016, 2:38am

ZFS is love, ZFS is life... Seriously Mirrored VDEVs is what you want, I could go on for a while ranting about bit rot and RaidZ, but this guy does a far better job of that very task here: http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-not-raidz/ then I am currently capable of doing at work.

marasm · August 17, 2016, 4:14am

I found this regarding the ashift block size stuff: http://louwrentius.com/zfs-performance-and-capacity-impact-of-ashift9-on-4k-sector-drives.html

In any case, I ordered two more Seagate 2TB drives to replace the Samsung drives. Just to give me the peace of mind that they all match. I will do some tests like in the link above. I think I want to go for max capacity over max performance, simply because everything will be bottlenecked by the Gbe interface anyway. Maybe someday soon I'll do some NIC teaming for 2Gbe or 4Gbe, but still it should be sufficient for bulk storage and video streaming. I think I'll use the two Samsung drives for another ZFS pool geared more for performance so I can store some VM images on it.

@Dexter_Kane
And yes, I have followed your exploits with your ultra-secure NAS project. Really cool stuff. I might borrow some of your ideas to use with my server.

I need to learn about crypttab and fstab. I have never been comfortable with fstab.

LinuxMaster9 · August 17, 2016, 4:16am

Actually, Red Hat uses it as the default FS for RHEL, CentOS, and Fedora. I believe so does OpenSuse. I know for a fact that Facebook uses BTRFS for their entire filesystem.

Here is a quote from Vitaltothecore.com Feb 2016: > There are many misconceptions around BTRFS on the Internet, some come

from real initial problems that the filesystem had, but also because
people usually don’t check the date of the informations they read. Yes,
BTRFS was really unstable at the beginning, but if you read about huge
data corruption problems in a blog post or topics like that, and that
post was written in 2010, well maybe things have changed since then.
The most important part of a file system is its on-disk format, that
is the format used to store data onto the underlying media. Well, the
filesystem disk format is no longer unstable, and it’s not expected to
change unless there are strong reasons to do so. This, alone should be
enough to tell people that BTRFS is stable.
So, why it is considered unstable by many?
There are few reasons: first, as I said, people are scared of change
when it comes to filesystems. Why changing from a trusted and known one,
to something new? And it’s not just Linux, the same is happening in
Microsoft and its will to move from NTFS to ReFS. But then, I’ve always
seen a paradox here: ok for XFS that has 20 years of stable development,
but ext4, the “trusted” default file system, has been developed as a
fork of ext3 in 2006. So, it’s just 1 year older than BTRFS!!!
Second reason, probably, the fast development cycles. While the
on-disk format is finalized, the code base is still under heavy
development to always improve performance and introduce new features.
This, together with management tools that had been stabilized only
recently, made people think that the entire project wasn’t stable.

First important note, it’s not another ranting post from 8 years ago,
it’s Chris Mason himself speaking at NYLUG in May 2015, so just few
months ago. And the examples he brings are the best proof about BTRFS:
they are using the filesystem at Facebook, where they store production
data. And the nice part is that, right because BTRFS is used in
production at Facebook, the size of the used storage helps in testing
and fixing the code at a pace that wouldn’t be possible in smaller
installations.
And if you watch the video, you’ll see how some really heavy weights
in the industry are supporting and working to improve BTRFS: Facebook,
SuSE, RedHat, Oracle, Intel… And the results are showing up: starting
from SuSE Linux Enterprise Server 12,
released in October 2015, BTRFS has become the default file system of
this distribution. Kudos to the guys at SuSE, because for sure the best
way to push its adoption is to place a statement like this “we are a
profit company, not a group of Linux geeks, and we trust this filesystem
to the point that it’s going to become our default one”.
Ok, so BTRFS is stable enough to be trusted. Or at least I do, together
with guys whose judgement has way more value then me like Facebook and
SuSE Linux experts. At this point, if you still don’t trust it, stop
reading this post and keep using ext4 or xfs, no problem.

Here are some of the features listed from the article along with the link to the full list.

BTRFS has been designed from the beginning to deal with modern data
sources, and in fact is able to manage modern large hard disks and large
disk groups, up to 2^64 byte. That number means 16 EiB of maximum file
size and file system, and yes the E means Exabyte. This is possible
thanks to the way it consumes space: other file systems use disks in a
continguous manner, layering their structure in a single space from the
beginning to the end of the disk. This makes the rebuild of a disk,
especially large ones, extremely slow, and also there’s no internal
protection mechanism as one disk is seen as a single entity by the
filesystem itself.
BTRFS instead uses “chunks”. Each disk, regardless its
size, is divided into pieces (the chunks) that are either 1 GiB in size
(for data) or 256 MiB (for metadata). Chunks are then grouped in block groups,
each stored on a different device. The number of chunks used in a block
group will depend on its RAID level. And here comes another awesome
feature of BTRFS: the volume manager is directly integrated into the
filesystem, so it doesn’t need anything like hardware or software raid,
or volume managers like LVM. Data protection and striping is done
directly by the filesystem, so you can have different volumes that have
inner redundancy.
Another aspect of BTRFS is its performance. Because of its modern design
and the b-tree structure, BTRFS is damn fast. If you didn’t already,
look at the video above starting from 30:30. They have run a test
against the same storage, formatted at different stages with XFS, EXT4
and BTRFS, and they wrote around 24 million files of different size and
layout. XFS takes 430 seconds to complete the operations and it was
performance bound by its log system; EXT4 took 200 seconds to complete
the test, and its limit comes from the fixed inode locations. Both
limits are the results of their design, and overcoming of those limits
was one of the original goal of BTRFS. Did they succeed? The same test
took 62 seconds to be completed on BTRFS, and the limit was the CPU and
Memory of the test system, while both XFS and EXT4 were able to use only
around 25% of the available CPU because they were quickly IO bound.
Writable and read-only snapshots
Checksums on data and metadata (crc32c): this is great in my view, as every stored block is checked, so it can immediately identify and correct any data corruption
Compression (zlib and LZO)
SSD (Flash storage) awareness: another sign of a modern filesystem.
BTRFS identifies SSD devices, and changes its behaviour automatically.
First, it uses TRIM/Discard for reporting free blocks for reuse, and
also has some optimisations like avoiding unnecessary seek
optimisations, sending writes in clusters, even if they are from
unrelated files. This results in larger write operations and faster
write throughput.
Background scrub process for finding and fixing errors on files with redundant copies
Online filesystem defragmentation. being a COW (copy-on-write)
filesystem, each time a block is updated the block itself is not
overwritten but written in a different location of the device, leaving
the old block still in place. If the old block at some point is not
needed anymore (for example if it’s not part of any snapshot) BTRFS
marks the chunk as available and ready to be reused.
In-place conversion of existing ext3/4 file systems
BTRFS Wiki

Simply put, Facebook, Red Hat (the guys who make and sell Enterprise Linux Desktop and Server OSes), OpenSuse, Oracle, etc consider BTRFS as being production ready and have staked their entire livelihoods behind it being production stable. So, don't write off BTRFS yes especially if you think ZFS is able to solve your problems since ZFS, to work correctly requires ungodly amounts of RAM.

Also, why use RAID 5 or RAID 6? You can get all the performance and security you need from RAID 10 under BTRFS and there are no issues there.

Another point of interest for ZFS and BTRFS is:

As a btrfs volume ages, you may notice performance degrade. This is
because btrfs is a Copy On Write file system, and all COW filesystems
eventually reach a heavily fragmented state; this includes ZFS. Over
time, logs in /var/log/journal will become split across tens of
thousands of extents. This is also the case for sqlite databases such
as those that are used for Firefox and a variety of common desktop
software. Fragmentation is a major contributing factor to why COW
volumes become slower over time.
ZFS addresses the performance problems of fragmentation using an
intelligent Adaptive Replacement Cache (ARC); the ARC requires massive
amounts of RAM. Btrfs took a different approach, and benefits from—some
would say requires—periodic defragmentation.

Buckshee · August 18, 2016, 7:59pm

AFAIK and having read that page, all strategies give similar protection against rot, VDEVs lets you build for performance.

SgtAwesomesauce · August 19, 2016, 8:25pm

Actually, Red Hat uses it as the default FS for RHEL, CentOS, and Fedora. I believe so does OpenSuse. I know for a fact that Facebook uses BTRFS for their entire filesystem.

Just because Red Hat uses something as a default, doesn't mean it's a good idea. Red Hat does not always make the best decisions. Also, BTRFS, when stable, is better to ship than ZFS because it doesn't break license.

Simply put, Facebook, Red Hat (the guys who make and sell Enterprise Linux Desktop and Server OSes), OpenSuse, Oracle, etc consider BTRFS as being production ready and have staked their entire livelihoods behind it being production stable. So, don't write off BTRFS yes especially if you think ZFS is able to solve your problems since ZFS, to work correctly requires ungodly amounts of RAM.

Facebook uses BTRFS only for ramdisk filesystems that are wiped every night.

OpenSuse? Who uses that in production? (or even Suse enterprise)

The problem with it is that it has fundamental issues with some of its core functions. If you simply avoid those functions, you're fine. I don't like to have to avoid functions I need.

Also, why use RAID 5 or RAID 6? You can get all the performance and security you need from RAID 10 under BTRFS and there are no issues there.

Not everyone has the money to shell out for N*2 drives. It's much more cost effective to work with N+2 or N+3 on large arrays.

Don't get me wrong, I want to use BTRFS. If it were stable for my use case, I would. It has features I want to use, but at this point in time, it's just not stable enough to risk my data on it.

If I were a good enough developer to work on the FS myself, I would help out, but I suck at programming, so the best I can do is sit by and watch. Hopefully by kernel 5 it will be solid.

On your list of features, the only one that ZFS doesn't have is online fragmentation and in-place conversion, but again, that's not the idea behind ZFS. It's (mostly) meant for arrays and multiple devices.

DamienAnton · August 20, 2016, 6:36am

You got it :)

Eden · August 20, 2016, 9:50am

Not by default in Fedora 24, it still uses ext4 + LVM. Or at least if your encrypting the drive it does. There's been no recommendation yet from the Fedora kernel team to switch to it so its not happened.

However I agree on the misconceptions. BTRFS is quite stable, ive not heard of any issues recently. If your on a system with an up to date kernel, your unlikely to have any issues at all.

marasm · August 28, 2016, 10:05pm

Well, I got my storage pool running.

I ordered two more Seagate 2TB drives, so now all the drives in the main pool are identical. I now have a 13TB encrypted pool. Pretty cool.

I did the LUKS encryption and added the drives to the crypttab and they unlock using a keyfile at boot. The keyfile is stored in root's home, and is only accessible by root. Not really the best scenario, security wise, but it'll do.

I made the ZFS pool on the un-encrypted drives, and it mounts automatically at boot too.

I did the same thing to the two 2TB drives that I had to replace, setting the pool to an ashift of 9 because of the different, smaller physical block size. I am going to use this smallish (2TB) pool for VM image storage.

As to the whole Fedora defaults to BTRFS, I can say in my case(s) it definitely does not. My Fedora 23 desktop and laptops defaulted to encrypted LVM and ext4. My Fedora 24 server defaulted to encrypted LVM and XFS.

As an aside, does anyone know how to get conky to display storage space in TB and GB instead of TiB and Gib? It's confusing.