1st Production Deployment. Need Raid Guidance and Help (+ any tips you might have) <3

M0rpheus · July 14, 2023, 2:04pm

Hello,

First of all a huge thanks to whoever might read this in the L1 team, I have learned a lot from your content.

Now, as per the title I will soon be deploying my first production server for a web-app I have developed. It will serve 2-4 people daily and won’t scale (at least for the next 5 years) past 15 total registered users. Thus I have opted for a “consumer” build, a ryzen 5600 + 32gb of ecc memory on an ASRock b550m pro4 (should support ecc-mode).

What I’m seeking some help with is the storage. I was intending to setup a raid 1 with 2x1TB nvme’s. The reason for it is simply resilience (I have separate plans for backups, on and off the machine).

Possible constraints:
- Lack of UPS.
- OS Debian 12

Possible ‘opposite-of-constraint’:
- The on-device backups will undergo integrity/hash checks as part of the nightly maintenance.

I’m I being paranoid seeking storage mirroring? Do I need a raid card? Do I just do it in software and cross my fingers? Is there some other solution I’m missing?

PS: any other tips you might have for a self-taught newbie passing through the first deployment tunnel, are most welcome. love to all <3

twin_savage · July 14, 2023, 5:28pm

I’m making the assumption the webapp isn’t something crazy IOPS intensive;
hardware raid shouldn’t be necessary unless you need really high performance in a parity RAID (RAID 5/6), it wouldn’t help much in a relatively simple RAID 1.

I wouldn’t say mirroring the drives makes you paranoid, SSDs have a habit of suddenly dying with little to no warning (as opposed to HDDs which typically give warning).
That being said, mirroring the drives doesn’t get you data integrity per se, it would take a checksumming file system on top of the drives to accomplish that (assuming you want that, sounds like you already have nightly integrity checks happening as part of the backup).
Interestingly enough, some older enterprise grade hardware raid cards could do the the real-time checksumming, but these have fallen out of popularity because of improvement in media error rates; typical modern hardware raid only checks data integrity during scrubs and only in parity arrays)

Regarding the lack of UPS, some nvme drives have power loss protection capacitors on them to help out with power loss scenarios, they might be worth looking into. The PLP SSDs are obvious that they have power loss protection but are often the longer 110mm m.2 standard which alot of motherboards don’t support.

M0rpheus · July 14, 2023, 6:30pm

Thanks for taking the time!

Your assumption is correct. i do not expect my database to go over 2-4GB and that’s looking into the future. + even more so, because on startup I make a copy of it in-memory and use that for any read operations. And that’s with a worst case scenario of 20 queries per minute (in the future)…

Yes what I’m expecting from a mirroring solution is to have no unscheduled down-time should a drive fail. On integrity though, Debian has an option for BTRFS but I do not know if that would help and what implications it might have as I have never used it.

The lack of UPS does not worry me in terms of a component getting bricked. More in the terms of ‘not-completed-drive-ops’, data-loss and how that might stack with a software raid solution.

I just realised that in a bit-rot (or similar) situation, my integrity checks cannot discern which drive is at fault, and since (at least to my knowledge) raid 1 does not know how to handle it, I will still have unscheduled down time (?). So my drive redundancy plan so far only yielded avoiding complete data loss. It’s something :'). Seems like tomorrow’s assignment is reading up on BTRFS and the cards you mentioned

MadMatt · July 15, 2023, 6:30am

I would not consider a raid card given your hardware and requirements, but go with data on dedicated zfs raid1 and scrub it regularly, and when people complains you need ecc for zfs nod politely and keep using it…

M0rpheus · July 15, 2023, 8:49am

Thank you for the input!

After some research it does seem like cards are another can of worms to add to the stack, so I will be avoiding them.

I’m also zeroing in on a solution close to what you suggested. ZFS + mirroring. I’m not confident that I understood the differences between raid z1 and mirroring though. My config should support ecc (hopefully).

PS: I love how finding 1 answer, introduces X more. Like, what should my partitions look like? What’s the steps to set it up? Can I keep the OS installation along with the mirrored data? What are the potential pitfalls of this new file system I know nothing about? How do I have it inform me that a drive should be swapped? etc

MadMatt · July 15, 2023, 11:38am

Rough equivalence in terminology, zfs has a lot more nuances you can add to the mix that makes it different than traditional raid terminology…

Raid	zfs
Raid1	mirror
Raid5	raidz1
Raid6	raidz2
Raid10	striped vdevs of mirror pairs

MadMatt · July 15, 2023, 11:43am

Keep It Simple, then add gradual improvements
You don’t want to plan for everything from the beginning as you’ll never get anything going
Your project, as complex as it may look to you, does not require a full enterprise production environment. It requires a backup plan, and an agreement from the users as to how much unplanned downtime they can take (RTO) and possibly how much of their work they can afford to lose/re-input in case of a major failure (rpo). That will drive your backup strategy and most of the cost…

M0rpheus · July 15, 2023, 12:51pm

This! It makes sense now, because all mentions of raidz1 that I found, where about 3 drives and up. Seems like I’ve come full circle back to raid1, just through zfs, and calling it mirror instead of r1.

Agreed, I just want to spend the time I have until the hardware arrives to figure out the best I can do out the door.

The framework/guidelines you mentioned on agreed-upon/accepted downtime and loss recenters things a bit. Being self-taught I was led by logic to think of what would a failure look like and what’s acceptable but it took a bit to get there. Good stuff!

MadMatt · July 15, 2023, 2:50pm

You want temp, log and home in separate partitions.
It would be nice to use zfs for these but Linux support for root on zfs is going away and is not an option anymore, next best in my experience would be ext4 on LVM, and you do Not want to use the same device as your data.
The reason for that is that in the event of complete hardware/os breakdown, you will still be able to pull the data storage out and reuse it on another machine/after reinstalling the os

You do not want to be trying to recover a non booting os on a device that shares your data at 4am in the morning after a sudden crash…
Also, your raid strategy for the os will likely be different than your data

You might realize how far ahead it is compared to every other one, especially after realizing what you can do with snapshots and remote copies, and try to install it everywhere.
The major downside is that it’s not fully gol compliant so sometimes it needs fiddling to keep it going on Linux especially when updating kernels

The same way you do with other file systems, you can use libenms, checkmk, Prometheus…

These are screenshots from libenms, all the volumes are zfs,
All zd devices are VM virtual disks running off zfs zvolumes…

M0rpheus · July 15, 2023, 4:56pm

Yeah I watched an introductory presentation (by Dan Langille) and it touched on those topics. I’m very intrigued by snapshots, they are on my todo list after the basics are taken care of.

Interesting. For clarification this is a simple flask app behind nginx, served locally, behind a vpn, no vm’s, no OS users, no nas/san/ftp/smb etc., just https and probably ssh for me (firewall everything else). Thus I do not intend on messing with it much. Perhaps I could get by with a soon-to-be-deprecated root on zfs?
Otherwise do you have any suggestions on how to achieve a semblance of redundancy/fault-tolerance on ext4, R1 without zfs’s holistic approach seems kind of incomplete.

(Depending on pricing I could do 2 sata ssd’s for boot)

That looks sweet! I played around a bit with prometheus, and want to try adding grafana to it. I do have a slight suspicion that it might be overkill for my situation, but all in due time.

MadMatt · July 15, 2023, 7:03pm

Raid1 on standard mdadm + lvm + ext4 ?

Are you using a database of any kind?

Prometheus and grafana are a great way to get to know a lot of modern patterns for monitoring system
They are fine for collecting and visualizing data, but are a little bit the kubernetes of the monitoring world. You can create whatever you want with them but it requires sweat. money, time and expertise.
For most use cases, you might as well just use checkmk or librenms, that you can get up and running in minutes and that will give you out of the box all the monitoring and especially the alerting capability you’ll need until you move your stuff to a datacenter, and will not require databases, containers, time series logging and custom programming every aspect of your monitoring solution

M0rpheus · July 15, 2023, 7:38pm

Thanks for confirming my suspicion hehe. I have added both checkmk and librenms, in tomorrow’s todo list.

Sorry I was not specific enough on my concerns (taking note of that guide though ). Firstly power failures, my understanding is that raid1 can be vulnerable to them (something that zfs seems to tolerate?). Secondly if this ‘stack’ can discern the failing disk on softer situations not full on brick.

Yes, I went with sqlite. Not the most feature rich solution, but the decision was made a year ago, and I was still early on my dev-path .

MadMatt · July 15, 2023, 8:56pm

Ok, you are definitely overthinking the requirements for resilience of your solution…
Zfs will survive a power failure just fine (ext4 as well), your SQLite database will survive a power failure, just make sure you have a backup somewhere else with an appropriate time range …

MadMatt · July 15, 2023, 8:58pm

Mdadm doesn’t care, it’s the filesystem that needs to be able to recover, zfs and ext4 will recover just fine, btrfs as well, xfs… In my experience… Not so much…

M0rpheus · July 15, 2023, 10:58pm

It was bound to happen at some point, exploring the unknown and all!

I think I’m settled on most things thanks to your guidance. You’ve really helped both cut-down the time I’d have spent researching this and focus some of the research I have left. A huge thank you!

I do need some clarification on 3 points if you care to indulge my ignorance.

Using md/lvm/ext4 (raid1 for OS), in bad-data situations without reported errors (rare), I won’t know which drive to swap. Should I go with btrfs (xfs in the guide you provided) for the scrub/self-healing or should I consider the OS a throw-away part that should be easy and fast to reproduce from scratch on new drives?
IF ext4 is still the answer in the previous question. Do I first use lvm to create 2 ext4 volumes and then raid them with mdadm. Or do I first raid with md and then use lvm to create a volume?
You mentioned keeping temp, log and home separate. In my root dir I see ‘/tmp’ and ‘/home’ but not ‘/log’, what I’m I missing here? (did you mean to keep the whole installation separate from the zfs raid and I simply misunderstood?)

Again thank you very much and apologies for the ‘noobiness’, if this is too much feel free to ignore, I do not want hog on the community’s resources and goodwill! <3

MadMatt · July 16, 2023, 7:10am

This. You do not want to use Xfs. Btrfs I have no experience with, so I can’t comment

mdadm raid, then lvm, then a single ext4

Sorry I meant /var/log or /var , depending what type of software you use that may overflow that mount point

The rationale for that is that in a server the places where your file systems will fill up are:

Data directories
Home directories
Log (and caches in /var or /var/lib or data in /var/lib if you use docker,postgres,mysql)

If they fill up and are ina separate mount your server will keep working , monitoring will work, you will be able to log in, understand what happened and fix the problem
If your root dir fills up weird things will start to happen, linux is resilient and will not blue screen but any disk write (logs, apps, login prompts) will fail and you will be chasing red herrings until you understand your root folder is full

In long running server what typically fills up is logs, sometimes /tmp

M0rpheus · July 16, 2023, 1:06pm

That makes total sense. After a few hours of research this morning I think I have a plan and I will proceed with vm testing. I might post if something comes up (hopefully not ). You have been of great help, thank you very much!

I will attempt to add some value here for anyone stumbling upon this in the future by logging my findings and thought process on this last subject (keep in mind I’m a noob).

So root partition and redundancy + ease/speed of recovery

The golden standard (for my needs) seems to be zfs in raid1/mirror (not raid-z1). Most importantly due to checksumming the data, enabling me to know which disk is bad in a silent-failure situation.
This is a no-go since it would take some meddling around around using RootOnZfs, which was recently removed from ubuntu, yielding me zfs but with a limited set of features.

Next up raid1 and xfs. This was easy, xfs’s scrub (self-heal) is experimental. + I think it’s meant for metadata. Didn’t go further to see how that would play with raid1.

Btrfs, the string theory of file systems. Took some time because on paper It’s perfect. Checksums, scrubs, raid etc. But then I found this article by arstechnica… And suffice to say that between scrub weirdness, booting a degraded array clunkiness, and seemingly lack of ease-of-use when most needed (during failures), this isn’t it.

dm-integrity. Promising middle ground. Sits below mdadm and adds checksumming to the data. What’s more, it reports a disk error if it detects silent bad data. NICE! Not so nice is the setup process described in the answer to this question. In addition, the previous article points out that dm-integrity uses crc32 which is collision prone (didn’t follow up on chance % and it’s relation to data size being hashed), and it’s need to write over the whole disk on initialisation or disk replacement.

So!
I can make 1 of 2 trades.

dm-integrity. Yields a chance to know which drive reports bad data in silent situation. This saves time because it allows swapping only the failing drive and carrying on. The price for this is increased complexity/time during installation.
Expendable OS. As discussed above with @MadMatt, I can go with the far simpler stack of mdadm raid1 → LVM → ext4, and consider my OS a possibly expendable part. This keeps the installation overall simpler and faster to redo from scratch, which is good for other failure scenarios as well. On the flip side, silent bad data means more downtime because if I don’t know which disk is bad, then both have to go. There is also the money to buy 2 disks, but for explicitly boot drives it shouldn’t be high.

For my particular situation the trade-off of the expendable OS seems good. Especially when you add into the mix a backup image of the installation or a snapshot using lvm (haven’t read much on how feasible this is though!).

Illumous · July 27, 2023, 3:02pm

You could try this:
https://docs.zfsbootmenu.org/en/latest/index.html

Additionally, you could install the OS as a VM on top of TrueNAS Scale and take advantage of zfs features with minimal management headache.

As for ECC…

MadMatt · July 27, 2023, 6:33pm

Yeah, I agree, what I meant to say was ‘nod politely and keep Not using it’

Illumous · July 27, 2023, 11:14pm

Yolo lol