New consolidated data storage server sanity check

I’m thinking of consolidating 5 of my NAS system down to a single server.

Right now, my hard drives are kind of scattered between those systems and then two plan is going to be moving to Proxmox VE 7.3-3, running ZFS.

I wanted to run the idea of setting up 4 vdevs, each with 8 drives each, and each vdev being a raidz2 vdev.

Two of the vdevs will consist of 6 TB drives (8 x 6 x 2) and two of the vdevs will consist of 10 TB drives.

If I have such a massive storage pool like that, but broken out into vdevs that are no more than 8-wide, is there anything that I should be aware of before setting something up like this?

Any “gotchas” that I should be thinking about or be weary of?

(I tested this configuration in a VM environment with TrueNAS and it didn’t seem to have an issue with it, but I wanted to double check before deploying Proxmox on bare metal, and then using Promox’s ZFS capabilities to set something like this up.)

Your help is greatly appreciated.

Thank you.

Certainly a lot of HDDs :slight_smile: But ZFS doesn’t really care and excels at vertical scaling.

8-wide RaidZ2 is a proper setup (There is an argument for going RaidZ3 on 8 disks but that’s about it) and having 4 vdevs improve performance. And the option of having both 6TB and 10TB drives in a single pool without making concessions to available space (like 10TB downgrading to 6TB if within the same vdev)

Depending on the drive models, you will see an unequal distribution of allocation between the 6TB vdevs and the 10TB vdevs. Just because 80% filled 6TB drives are just slower than 50% filled 10TB drives. And ZFS prefers write performance over perfect balance. But that’s more a quirk than a problem.

You can replace 6TB with 10TB ones but not vice versa. And additional vdevs also have to be 8-wide. But otherwise I don’t see a problem. Get some RAM, add L2ARC and maybe a special vdev and have fun working with that new server.

Since you have multiple NAS’s (?) in service, maybe turn one of them into a backup target for your storage server.

3 Likes

The proposed system will either have 128 GB of RAM or 256 GB of RAM (because I am also going to be consolidating my VMs from my Beelink GTR5 5900HX mini PC to the new, single, consolidated server.

I’m still debating about consolidating my gaming system over to the server as well because whilst I can use PCIe passthrough for the GPU, I think that the problem is going to be the speed of the noVNC VNC client (and it’s ability to send the video data quick enough over GbE).

So my gaming system might end up still being a physically separate system. Not sure yet. Still doing research about that and also some “proof-of-concept” testing with my older HP Z420 workstation and a GTX 980.

(If I can migrate my gaming system over as well, that would be great, because then I can cut my total power consumption even further.)

So, I figure that 256 GB of RAM should be plenty/sufficient.

I’m not really particularly concerned about read caching. I mean, if it really becomes an issue, then I think, that in theory, I should be able to add PCIe L2ARC read cache devices ex post facto. We’ll see how that goes.

I already LTO-8 tape backup system in a father-son backup scheme.

(This was a part of the reason why I posted the thread here, asking about how to backup iSCSI targets to LTO-8 tape, since iSCSI is block level storage, which means that LTO-8 won’t see the files of said iSCSI targets.)

Thank you.

When I do the inital set up of the ZFS pool, so I need to add 4 vdevs right away, when the pool is created, or can I add it in as the data gets migrated over?

Thanks.

(If I need to create all 4 vdevs when I do the initial creation of the pool, then it would mean that I would have to send all of the data that currently resides on the NAS systems over to my LTO-8 tapes first, so I can clear the drives first, and then move all of the drives physically over into the new server, and then set up the ZFS pool, along with the VMs, etc.), and then put all of the data back. Not a TERRIBLE issue, but it would be easier if I could just consolidate one NAS system at a time (playing musical chairs with the data).)

Thanks.

You can add one vdev at a time. Create the pool with one vdev and add other 8-wide RaidZ vdevs later. ZFS won’t balance the data across the vdevs by itself, so you might end up with an uneven distribution of data across the vdevs in the end. Having all vdevs available from start is more optimal in that regard.

Is there a way to “force” ZFS to re-balance the data after a vdev is added?

Thanks.

deleting the pool and restore from backup is the usual approach to ensure even balance. This also applies for e.g. special vdevs where those devices have been added later and all the stuff is still stored on HDDs.
There is no inherent ZFS command or tool to do so. Over time the imbalance will cancel itself out once those blocks are getting modfied/overwritten. I’ve seen some script mentioned here in the forums that will do this, but I’ve no experience using it myself and will probably take longer to run than a resilver (which is quite some time by itself, considering your vdev config)

Yeah…then that would be the same as what I mentioned earlier about having to offload all of the data onto tape and then load it onto the array/pool.

It’s too bad that it’s very difficult/impossible to predict what the performance impact of an unbalance pool/array is going to be, if I were to add the vdevs to the pool one at a time, after the initial creation of said pool.

Can’t you create new datasets and copy data to it, so you still got the old copy, and the new data is written (roughly) to all the providers?

Should work. Copying stuff to other datasets creates new data. Assuming ~125% (plain copy + single stripe of 4 vdevs) of the entire pool data fits into the initial vdev.

Problem: It still causes uneven distribution across the vdevs because the initial vdev will be disproportionally filled and drive performance and thus allocation when filled to the brim, will be worse than empty drives. ZFS will probably allocate most stuff to the empty and fast vdevs in the process. It will be far less severe, but should be noticeable in e.g. zpool list -v

Just to clarify things: There is no ZFS pool in the world with perfect distribution and striping across all vdevs. Even in my pool with 6 identical drives in 3 mirrors, allocation varies between vdevs and some files just don’t get striped. It’s normal behaviour to have some deviation because of how ZFS works: optimize write performance when things are written. Data gets written to the disk that reports “finished with the queue” and ZFS says “alright, here is more work for you while I wait for your slower neighbors to catch up”.

2 Likes

Two things about this:

  1. If I read this correctly though, if you have 6 identical drives in 3 mirrors, you won’t get a stripe. Ever. Cuz there’s no striping (as I understand it) between 3 mirrors.

If you have three vdevs, A, B, and C, and they’re all two-drive mirrors, then my understanding is that by virtual of that, you don’t have striping that occurs between them anyways because of that topology/layout.

  1. In terms of performance, I am not 100% sure that I am going to necessarily be too worried about it given that I am using spinning rust HDDs.

I think that the Supermicro chassis ( SuperChassis 847BE1C12-R1K68LPB4) has a front 24-port SAS3 backplane and a rear 12-port SAS3 backplane.

(I can’t tell if it is 16+8+8+4 topology or if it is really going to be just “flat” 24+12 topology.)

My point is that if it is just going through a SFF-8463 connector (which might be a SAS 12 Gbps x4 link, that means that the backplanes might only be supporting upto 48 Gbps, which is either divided by 24 drives in the front (2 Gbps each) or 12 drives in the rear (4 Gbps each).

So, it is quite possible that the backplanes might end up being the bottleneck anyways. Not sure.

(Trying to calculate what the most likely bandwidth/throughput is going to be is actually not very straightforward at all, so I have no idea, but I don’t think that the speed is going to be very high either way.)

But it would be interesting to see how ZFS will or won’t balance the data across the four raidz2 vdevs because if I create all four vdevs when the pool is created, then I won’t really have much of a say in terms of how the data is distributed across the vdevs.

Thanks.

Data is striped across vdevs no matter if they are single disks, mirrors or RaidZ. In my case it’s 3 vdevs each being a 2-way-mirror.

Depends on how many SFF-8643 connectors there are. Usually it’s one link per 4 drives which is plenty for HDD. But if there is some active backplane or SAS expander within the chassis, things will be different.

Which isn’t a problem most of the time because HDDs are slow beasts. Having 2 Gpbs is plenty for a HDD in such a setup.

I didn’t know that.

That’s interesting. I did not expect that.

Yeah, there are SAS3 backplanes.

(The parts list for the chassis says that some of the ports can be NVMe capable, hence the 16+8+8+4 notation. But if I am using all SATA and/or SAS drives, then I am not sure if it will just see it as 24+12 or if it will still keep it in the 16+8+8+4 groupings.)

For the most part, yes.

But where it gets complicated is that traditionally speaking, if you had ONLY a stripe (i.e. RAID0), that you should be able to, at least in theory, sum up the individual drive’s bandwidth and that would be the total bandwidth available to you (at least in terms of a theoretical limit).

But when you have four 8-wide raidz2 vdevs, trying to calculate or predict what the theoretical limit is going to be, is practically impossible (at least not with any degree of certainty that that is what you will actually get).

Switching topics for a moment – my new bigger concern is how the indexing of the contents of the system is going to be. (i.e. how do companies handle indexing millions and millions of files such that the indexing task itself, doesn’t grind the system down to a halt for its users?)

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.