ZFS RAID Config for old disks

ThatGuyB · November 30, 2020, 11:18pm

Greetings everyone,

I’m in the process of building a Proxmox server for my home lab using some old hardware I retired from work a long time ago and has been collecting dust. Got a xeon x3440 with 32 gb of ECC RAM. Planning to run 5 to 7 VMs on it, most important being a mail server, a Matrix Synapse + Jitsi server and maybe a PeerTube instance. I will also have a separate VM to use as off-site backup of security cameras recording from a location ~75 miles away (I’ll allocate 2 or 3tb to it).

So, here comes the question about the setup. I have 10x 2TB drives, basically 2 batches, 5x ST2000NC001 made in 08/2014, and 5x ST2000VN000 made in 10/2015. I don’t really trust those drives, and I won’t have a backup for the majority of the data I will start putting on them (except the mail server, I will have a separate box for mail backup with 1TB mirror where I also rsync my /home folder).

Given the perfect match of different models and time of manufacturing, the most logical thing to do is going ZFS stripped mirrors. Another issue that I have is that the box will basically be “unmaintained” (ie: I will only access it remotely) for a long time, so I’m thinking of going 4x RAID10 vdevs and having 2 hot-spares. Sarge is a big influence on me here at Level1 Forum and I know he doesn’t like hot-spares… and frankly, neither do I, I’d rather change the disk when I see an alarm, but in this situation, I have no choice but to leave at the very least 1 hot spare out of the 10 disks. But I’m starting to be greedy for storage (especially if I will host a PeerTube instance). If I could replace the disks remotely somehow, I would go full RAID-z2 on 10 drives for that sweet storage.

IOPS may not necessarily be as important here, the VMs won’t be very pushy (and will have low resources) so I was thinking that maybe RAID-z2 on 8 drives or maybe RAID-z3 on 9 drives would spare a TB or 2, but at the cost of both IOPS and potentially decreased performance, according to:

ZFS storage server advice

ZFS has a default cluster size of 128KiB. You divide this by the number of drives in your vdev (not including parity) and you arrive at the amount of data written to each disk. Keep in mind that sectors on disks of this size are 4KiB, so you want your result to be a multiple of 4KiB or else you’ll have performance degradation.

Here are a few examples:
3-disk RAID-Z = 128KiB / 2 = 64KiB = good
4-disk RAID-Z = 128KiB / 3 = ~43KiB = BAD!
5-disk RAID-Z = 128KiB / 4 = 32KiB = good
9-disk RAID-Z = 128KiB / 8 = 16KiB = good
4-disk RAID-Z2 = 128KiB / 2 = 64KiB = good
5-disk RAID-Z2 = 128KiB / 3 = ~43KiB = BAD!
6-disk RAID-Z2 = 128KiB / 4 = 32KiB = good
10-disk RAID-Z2 = 128KiB / 8 = 16KiB = good
All this said, your configuration has introduced unfortunate constraints. Additionally, you did mention that performance isn’t extremely important

So, can someone explain to me the ups and downs of using old spinning rust in each setup?

Stripped-mirrors x4 vdevs + 2 hot spares for ~7.3 TB of usable storage
RAID-z2 x8 disks + 2 hot spares for ~10.9 TB of usable storage
RAID-z2 x9 disks + 1 hot spare for ~12.7 TB of usable storage
RAID-z3 x9 disks + 1 hot spare for ~10.9TB of usable storage
RAID-z2 x10 disks + screw hot sparring, we going all or nothing for ~14.6 TB of usable storage + no performance degradation from badly configured arrays

My gut feeling tells me to go RAID10 (as I would in my production environments), lose some capacity, skip the bs and not have to worry about resilvering times. I always liked this article: https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-not-raidz/
But in my searching for the ideal config of 8 drives, I stumbled upon this reddit thread (I probably shouldn’t trust plebbit tho’): https://www.reddit.com/r/zfs/comments/erizhw/at_what_size_would_you_stop_using_a_striped/
The idea behind RAID-z* on >4 drives is that your parity calculation would be done faster than the drive can write that data on it, so it wouldn’t really matter if you’re going mirror or parity, because the drives can’t write data fast enough. BUT given again, that I’m using old, used drives, I’m afraid the potential stress put on the drives during parity calculation is just asking for trouble. But again, being just 2TB drives, technically it shouldn’t put too much stress on them for too long to resilver a disk.

The last thing I could do is split the zpool into:

RAID-z2 x6 disks + 1 hot-spare for ~7.3 TB of usable storage
and
Stripped-mirror x2 disks + 1 hot-spare for ~1.8 TB of usable storage

Not sure if the hassle of splitting into 2 zpools is worth the potential increase in performance of the RAID-z2 zpool. I’d rather just have 7.3 TB and have RAID10. Am I too conscious about this issue? Should I just risk it and go all or nothing with RAID-z2 on 10 disks? Plz halp!

thro · November 30, 2020, 11:24pm

I always use RAID10 unless I have significant reasons otherwise.

Why?

Better IOPs; RAID10 generally means more VDEVs (assuming you use 2-3 drive mirrors). Each VDEV within a pool runs at the speed of the slowest device in the VDEV. i.e., you don’t sum IOPs for each VDEV - a 1 disk VDEV has the same IOPs as a 30 disk VDEV.
more flexibility come upgrade time. you can upgrade one mirror at a time, one drive at a time. yes the RAID will become “unbalanced” if you have different sized mirror VDEVs in the pool temporarily, but I have noticed no ill effects. its no different to adding another VDEV to an existing pool. ZFS will balance it out as it reads/writes to the pool
if you use RAID10, ZFS can read from both sides of a mirror (as well as both stripes!) at the same time. Because it uses checksums to validate data, it doesn’t need to validate one side of the mirror against the other (unless the block checksum fails, and then it hits the other side of the mirror). Thus - if you do RAID10, you get the performance of all of your non-hot-spare drives for reads.

Now, some of that stuff may not matter so much to you, but it does for me, as my home NAS also acts as a play VM store for NFS and iSCSI for lab purposes. With spinning rust, I need all the speed I can get out of it, and capacity (for rust) is already cheap enough for more than I need

I’d agree. As soon as you get into multiple pools, you just know that at some point you will run out of space on one of them and play the data shuffling game. and both pools will perform worse than they could if they had more drives (which are in the other pool).

For a home “general purpose” situation, I’d focus on putting as many VDEVs as you can into a single pool.

ThatGuyB · November 30, 2020, 11:36pm

I didn’t mention, because I thought it was obvious, given that the server will be remote, but I don’t intend to upgrade it, like, ever. It will stay with 2TB drives until I upgrade to a whole better server with higher capacity drives (and probably fewer, I’m only doing 10 drives because that’s all I got).

At work, I go with RAID10 because I run way more than 7 VMs from 1 array. But for home use, I thought that maybe, just maybe, I could get away with higher capacity and slower array which is why I am so thorn on this issue. I have no idea how to calculate MTBF, but if the drives could last at least 2 years, I’d be more than happy with the risk of not having hot spares.

thro · November 30, 2020, 11:39pm

Fair enough, take out the upgrade comments, the rest applies

I like keeping my options open and capacity is cheap

My experience with spinning rust for VMs is that you run out of IOPs way before running out of disk capacity, however with ZFS at least you could add some cache to help performance if it is insufficient?

ThatGuyB · November 30, 2020, 11:59pm

I was thinking of doing some testing and if push comes to shove, just get more RAM. The server is filled to the brim, I have no more expansion options on it (it had 1 16x, 2 8x PCI-E slots that I filled with 3 1x to 2 sata ports cards).

As mentioned above, I won’t need a lot of IOPS in this case, so the thread is mostly a question of reliability of old disks using different kinds of ZFS RAID configs. Performance and capacity are just nice bonuses to have, but I’m more interested in having this whole zpool last 2 years without me having to touch the server (which is why I’m thinking of opting for hot-spares inside it). And I’m willing to sacrifice some performance for capacity, but not resilience. If RAID-Z2 would mean potentially more dead drives, I’d rather have lower capacity and more spare disks, but not as far as going 3-way mirror.

It’s all a theorycrafting between resilience, usable storage and performance, in this exact order of preference.

gordonthree · December 1, 2020, 1:46am

could add some RAM (its very cheap right now too) … l2arc doesn’t help as much as folks think with ZFS, except rather specific workloads.

dwm · December 1, 2020, 1:49am

If it were me… mirrored vdevs, period. Otherwise I’d spend sleepless nights wondering when that first drive in a raidz vdev is going to fail and take the whole thing down for good due to the other drives not being able to survive a resilvering.

thro · December 1, 2020, 1:50am

Good point. Mirrors re-silver fast

Yeah, ram cache still cache

misiektw · December 1, 2020, 1:50am

Just read that zfs 2.0 added sequential resilvering. If works like advertised it will be moot point…

dwm · December 1, 2020, 3:53am

Not totally moot from a reliability perspective. Faster and less mechanically brutal, yes. But you still need to read from multiple old drives from only 2 batches to resilver, and write to an old disk from one of those batches as the replacement (since you said you’d be using the old drives as your hot spares). The drives in a vdev are all going to have roughly the same amount of writes over time if they’re all put in the vdev at once. If one from a batch is dead due to age, others from that batch in the same vdev are probably not far away from dying. Especially if the old drives you’re using have already been in use for a few years. How many hours on them? Any issues reported by SMART?

So all-eggs-in-one-basket (one raidz vdev) here seems to me a potential imminent disaster. You don’t have backups. The drives are old. You’re not going to maintain it. Feel free to prove me wrong though!

An advantage to the mirrors is that you can remotely add vdevs as needed. Meaning you don’t have to start with all drives participating in I/O. Start with one mirror vdev. Out of space, add another. Out of space again, add another. Hence you can stagger the usage of the drives, which can help with eeking more life out of them. Of course, if they’re spinning even as spares… you never know. I’ve had drives fail from spindle bearing failure, head failure, and of course media failure. These days I generally don’t let the drives in my home ZFS pools go much beyond warranty before I swap them (3 or 5 years depending on the drive). Especially the pool I use for on-site backups.

If you go raidz, especially one big vdev, I’d at least seriously consider new drives as your hot spares. Or are your old drives still new in terms of usage? Maybe consider what a pair of 10TB spares would give you in terms of options when something goes awry on the vdevs composed of the 2TB drives? Or the fact that instead of hot spares, they could be a mirror vdev in a separate pool that’s used as the automated backup target of the main pool?

ThatGuyB · December 1, 2020, 8:39am

As expected, everyone is pointing towards raid10 with hotspares.

I don’t have all the parts to start assembling it and see the reports (and maybe do a stress test). Already ordered, should be here by Thursday.

I mentioned I will backup my mail server at least. The rest of the VMs will not have backup. I am assuming that risk. But the data in them will either not be forever lost, or the data won’t be important enough to be backed up. My plan is to have this junkyard server running for a year or at most a year and half, while I save some money to replace it with something better. I have some savings, but those are reserved for other priorities and I thought I could experiment earlier than expected with some stuff that I wanted to do for a while now.

Unfortunately, the first 2 or 3 / 7.3 TB in the pool will be immediately allocated as off-site backup of surveillance cameras footage, so 2 vdevs are out the window in 1 move. I could add 1 more vdev and be ok for a while, but I wanted to have the pool balanced from the start. I prefer adding all disks to the pool from the beginning, but that’s what I do when buy new storage servers.

So, being used disks, is it wiser to add vdevs as I need more storage, instead of just allowing ZFS to write to all disks from the beginning? I mentioned I don’t like hot-spares, so having even more disks that just idle while being powered is something I’d prefer avoiding if I can help it (unless, of course, I can get more usage out of said drives). And since data will be continuously written and deleted from the surveillance cameras, the pool will eventually balance itself anyway (probably in 1 month, since in 1 month, I write about 2 TB of footage on my DVR’s internal 2TB HDD).

At this point it seems like my bias towards striped vdevs seems to be confirmed, so the thread will continue with the question of adding all drives at once, or on-demand (I technically don’t need all the performance from raid10, even a simple mirror would suffice). But if someone manages to make a good argument for parity raid, I’m all ears (or rather, eyes ).

misiektw · December 1, 2020, 10:10am

Of course you right, around 0.02%

But seriously, everything you say is worth considering, but if you scrub reasonably often and you monitor SMART, then chances of that scenario go really down.

And what I would argue is, that this new feature (if it “just works ™” ;)), pushes real chance of array failing way down.
Like $(old_time/new_time) of rebuild times down.

And if this percentage starting to resemble chances of f.e. server room fire, flood, thunder strike and so on, then you starting to come to a point of “moot point” I was talking about.

Because past 8 years I already had 2 floodings in my server room, one from leaky roof, and one from AC. Yet, I still didn’t have had failed array due to resilvering, and I have like 10 raidz6’s arrays.

And I really wasn’t factoring in flooding when I was planning my arrays layout. If you agree, that would be ridiculous, then well…

Hope I proved my point.
.

Now, as for going mirrored strip instead of raidz.
First and foremost I would mention flexibility of changing amount of devices, just like you did.
Because that’s almost 100% occurrence possibility, especially if you planning doing that in the future.

Although, taking into account, how lazy average person is, and thinking that half of them is lazier… o well, I may have fudge my numbers a bit

ThatGuyB · December 1, 2020, 2:18pm

In maximum 1 year and a half, I will most likely not change the devices, but replace the whole server and migrate everything. I’m certain I don’t want to keep this old lad for long enough to become a liability. Maybe at most I will make it a testing / staging server, but given its power usage (compared to the Celeron J3455s I got now), I don’t think it will happen.

I am assuming this is an argument for raid-z2. Can you pretty please tell me how old are your 10 raid-z2 arrays? How big are they (# of drives and size)? What were your resilvering times? What are you running on them (VMs, backups, dbs)? Would you actually recommend for such old disks? (as mentioned before, I will come back to this thread Thursday, when it’s all assembled and post SMART test results, those will probably have a big influence on the level of trust I will put on the drives).

I love war stories from the server room, especially those kind of really unexpected things, like floods and fires. They make me grateful that I never had to deal with anything alike.

misiektw · December 1, 2020, 5:02pm

Few 6x2TB, 8x3TB, 8x4TB, Working since at least 2010-2015 (cant remember which exactly when)
Also most recent 6 arrays 8x6TB, oldest like 3yo? Cant remember when 6tb started to be economic enough.

Thinking about going 8tb next year probably, gonna replace smaller disks.

6x and 8x come usually from my case and workflow restrictions. And its mainly storage for large satellite/aerial images. So hi iops not required. SSD is for that.

Resilver around 24h? it varies on zfs and I cant remember 6TB dying last time. But I’m scrubbing all every weekend. 12-24h depending on array.

I run separate VM-s for each of storage arrays and corresponding backups.
And also s**tload other VMs, (like 6 license servers alone, few web servers, conference VM (big blue button), WMS proxy, PBX…).
To much to list them all, and not much point, since I assign VM depending what it needs to do.

ThatGuyB · December 5, 2020, 11:56am

There’s an update here to make. It seems I didn’t have 5x one HDD model and 5x another model. I have 4x ST2000VN000 and 6x ST2000NC001. Also, I will bring another, relatively unused IronWolf ST2000VN04 from work (it’s from 2017, it was used to backup some data and it stayed outside systems, as cold storage). This will bring the total number of drives to 10+1 hotspare, which means I’ll be going RAID-z2. I feel like this is a bad idea, but omg, 14.6 TB just for myself, it feels unreal (and for the first time, I got ECC RAM in my home lab, hurray!). I have a feeling I will be bashing my head on a wall in a (relatively long) while, but again, I am assuming the risk.

I guess I solved my own issue, but your inputs are very much appreciated. And for the love of smoke, I don’t want to go through this thing next time. I guess that’s what happens when you receive free stuff. When I will build my next server (in around 1 year hopefully), I will just buy 10 TB drives and ZFS raid10 them. Fewer drives, lower energy bill, less hassle.

misiektw · December 5, 2020, 1:38pm

It will be fine I would go raidz2 up to maybe 12 disks. Then I would consider raidz3.
Just use SMART and scrubbing and make it auto-mail you if there’s a problem.

ThatGuyB · December 5, 2020, 1:49pm

12 disks will have performance degradation ( 128KiB / 10= ~12.8KiB = BAD ). And I don’t want to mess with ZFS cluster size. 10 disks is optimal ( 128 KiB / 8 = 16KiB = good ) and I really can’t fit more HDDs in my case.

https://forum.level1techs.com/t/post-what-new-thing-you-acquired-recently/149881/5738

misiektw · December 5, 2020, 1:59pm

Yeah, it’s just my arbitrary ballpark numbers.

That’s for the best imo. Because 10 disks on sata is my absolute maximum I would try to run. (Again, just my arbitrary numbers)

ThatGuyB · December 6, 2020, 11:00pm

Build log update:

I did a rookie mistake by not checking the specs of the disks:

ST2000VN000 (x4) have 5900 RPM and 64 MB of cache
ST2000NC033* (x6) have 7200 RPM and 64 MB of cache

I got some older ST2000NM0033 (09/2013) that have 7200 rpm and 128 MB of Cache. Should I combine these with the NC001s in my array? I think I trust those even less, but idk… better have same RPM disks.

One of the VN000s is already reporting SMART errors. I haven’t looked too long at the screen, I just took the SN and device ID (it was /dev/sdk, so I knew which SATA controller it was going in, so it was easy to identify), then shutdown the system and taken the HDD out. It was either “spin_retry_count” or “when_failed” (probably the later), telling me “FAILING_NOW - backup data!!!11! reeeee”
The **** server is rebooting at random times. 2 days ago, it rebooted like 4 or 5 times. I did some system updates and it had an uptime of 1 day and something, then rebooted again once today (which is when I observed the smart logs). Proxmox is reporting that I have Foreshadow , also known as L1 Terminal Fault ( L1TF ) bug (basically another one of Intel’s CPU vulnerabilities, I got an Intel Xeon X3450). But I doubt this is the issue of the reboots. And neither journalctl, nor syslog reports anything. Journalctl basically says that there is nothing wrong, while syslog reported a ****ton of RRD and RRDC update errors and only 1 line with lots of “@@@@@@” (I believe this has to be corrupted logs from the reboot) around the time the server restarted.

Any tips on how to update an Intel Server Board ( S3420GP )?

ThatGuyB · December 11, 2020, 6:33pm

Back with other updates. I’m not sure what is happening and I can’t find any logs, but my server randomly restarts. I brought a Seasonic 430W (older) 80+ Bronze PSU, in the hopes that the PSU is the one that craps out, although the S3420GP does have a blinking Amber LED which doesn’t indicate anything (the manual says either that the RAM had too many corrected errors, the system may be too hot (doubt it), or the system may be failing, which could be possible. And ras-mc-ctl reports no errors.

Because the server keeps restarting, I can’t finish long SMART tests on the HDDs. I managed to find another defective one (which was crapping out a lot), but now I have 11x 2tb 7600rpm disks. And I got 4x 2tb 5900rpm disks on the side (will make a backup system out of them). And I got 2 defective disks laying around, will go throw them at an electronics recycle bin. I do have another similar, slightly weaker system at work which I could swap some components around (replace the motherboard and keep my RAM and CPU). I’ll keep you updated on what is to happen next (this has basically become an old HDD-related build log).