Overthinking pcie lanes for NAS/Multipurpose server

BlakeTech · April 12, 2024, 11:13am

Copypaste of my reddit thread, but L1T is more technical so I guess this should have been the place I started with…

Skip to the next part if you wanna get a shorter version of the rambling below.

I’m considering my options for expanding my nas, and while I currently run Truenas Scale, with the way the charts keep seeming to break previous setups and etc, I’m considering swapping to unraid.

Which of course leads me to consider other aspects of the NAS that I’m not quite satisfied with. Thus, my first steps to remedy the situation. I have ordered one of those Rosewill L4500U chassis so I can get 15 3.5" drives in, but I wanna also set up expansion for other pcie heavy stuff.

Cue the topic of the title. I’ll need to find an LSI card, I think, which can support the 15+ (sneak a few 2.5" ssds or something in the sides) disks I expect to be able to plug in, but it seems that those cards for cheap would only support 8 drives per. So, that’s two cards at least.

I’d also like to slap on a few optane p1600x as a special metadata drive. (I think that’s why Windows Explorer keeps lagging out when I try to save images to a folder with a lot of unmanaged files atm?)

Thinking 3, or 4 in either a raid 10/alike or more likely a raidz1. And maybe 2~4 nvme ssds for application pool or just a really fast pool. Apps can survive on the sata ssds fine, right?

Additionally, would possibly grab an Arc GPU for media transcoding.

And if in the future the 10Gbit nics come in cheap, might wanna grab one of those too. Sfp/rj45, don’t care atm, but my switch only has the one 10Gbit uplink sfp as it is so I’m not too hurried to get that. Probably just a 2.5Gbit one for now.

Anyways. Adding all that up, there’s at least 40 lanes being used on just optane (4 x4), nvmes (2~4 x4), Arc (1 x16?), and I don’t know how many lanes the LSI cards would take, plus the extra 1 lane for the 2.5Gbit, not accounting for future expansions.

So. Physically, that’d be around 5~6 PCIE slots already, and way more lanes too.

Am I overallocating things here? Overthinking stuff? Or is the only way to really run a decently kitted out NAS to abandon any notion of AM4/5, go threadripper with the higher tdp, or the epycs, or some sort of xeon to get the number of lanes required?

On a side note too, should I go with 2 6 wide raidz2’s and three hot spares, or 3 5 wide pools…?

So, Adding onto the initial post, this is what I’m currently thinking with some of reddit’s suggestions.

A recap though. The NAS OS is currently running truenas scale, though with how often truecharts stuff breaks on any updates, either will be moved to unraid, still on zfs, or will stick with truenas, but spin up a VM and start learning ansible for the apps. It will be running media playback, serving documents, and hosting game servers.

Firstly, the special metadata drive. As I saw on Wendell’s video on Optane drives, that’s what caused the initial idea to add that into the plans.

File explorer is actually kinda of sluggish, lurching forward when I try to save to a folder that contains many images. I’m hoping that this would help with that issue.

The plan is a RaidZ1 of 3/4 optane p1600x’s(, though I’d have much preferred the smaller M10’s for the much cheaper cost, but the same video said that the performance of those is practically pointless to get them, not to mention the capacity anyways) split between two carrier boards for the benefits of OPtane, with the redundancy in case an Optane fails. However, there was apparently concern that the Special metadata pool in that case may not be big enough.

As I understand it, I should aim for 0.3% of the main pool’s capacity, and while I’m currently running the commands to check the block sizes and the zdb Lbbbs poolname command to try and get that info, as it is at the moment, I have a 5 wide single vdev pool of 14TB drives in RZ2, and the possibility for 10 more drives, which I can either expand the current pool to make a 6 wide vdev, and put another 6 into a second vdev into the same pool, with the remaining 3 as hot spares, or I can leave the 5 wide vdev, and set the other 10 as two more 5 wide vdevs, making for a pool with 3, 5 wide RZ2 vdevs.

I have also grabbed an LSI 9305-16i as recommended by reddit, though I may run some sata ssd’s off the motherboard’s built in sata sockets too.

I’m now on the fence for a dedicated gpu for transcoding in Jellyfin.
At least, for the time being, I’d like to stay on the DDR4 level of hardware, and my current hardware on hand is what I’m working with, so I’m sticking with my Unbuffered ecc I got in my current system. Also why I cannot use intel’s igpu for transcoding, and instead using the rdna2 gpu in the 4650u, though I’m not confident it’s actually working.

I’ve also gotten a 2.5Gbit network card for now, and that takes up only 1 pcie gen 2 lane anyways.

There’s no intention to make a slog or a l2arc, as it seems like general consensus is if the arc hit rate is >10% then I’ll need the latter, which I don’t, and the SLOG is kinda useless unless for NFS, which the charts are using to access the vault.

Still. that does mean at least at least 4 pcie cards, maybe 5 if I get a ssd vdev/separate pool as well for superfast stuff.
This is as follows:

16x Pcie Gen 3/4 bifurcated for the 4x optane disks + 4x nvme drives, x2 carrier cards = 32 lanes
8x Pcie Gen3 LSI 9305-16i = 40 lanes
8/16x (?) Pcie Gen3/4 Transcode GPU = 48/56 lanes
1x Pcie Gen2 2.5Gbit rj45 nic = 49/57 lanes (Though I if I do in the future get a 10Gbit rj45/sfp card, it’s still x1 Gen4)

So, given this, admittedly and knowingly rather overkill setup to begin with, what would be my options for the platform? Xeon? Threadripper? Epyc? Those should be the only options capable of handling that many lanes, but I’d also like to keep power draw down, obviously.

Any suggestions? Places I can whittle down to save on lanes/money/power/etc that don’t make sense? I’d like to see what criticisms there are for this, since it’s obviously the sort of “I’d love this, but is it actually sensible?” kind of build, and I’d like some help to ground myself back to reality.

Space constrained too, thus why I don’t try and split off the transcoding/ charts off to a separate device. It’s certainly open as a potential option in the future, but that one’s going to be quite far away, and this project is going to be piecemeal upgrades as parts get cheaper and I try to snipe them.

Additionally, am in Canada, so suggestions to just get an R730XD for cheap ($300~$400 range) isn’t exactly viable, unless I’m missing something locally, since the shipping costs to get a box like this is around 2/3x the cost of the box itself.

BlakeTech · April 13, 2024, 7:13am

Finished running both Lbbbs and and the histogram command in the ZFS metadata thread, but not entirely too sure how to interpret the former’s result.
The latter, however, I think seems to show majority of 1k blocks, followed by 8k, 2k and 16k blocks respectively.

If it helps, i can drop the output here as well if someone can help figure out what it means?

adman-c · April 13, 2024, 3:06pm

I too have found myself annoyed by the lack of PCIE lanes on consumer hardware–I’ve gone back and forth between enterprise and consumer gear because of it. My current approach is back to consumer gear and to make a “hybrid” zpool with two large enterprise NVME SSDs as the mirrored special vdev for a rust-backed RaidZ2. I can then set the special_small_blocks flag for any dataset that wants/needs NVME performance. Using the “bifurcation” available on my z790 board, (x16 to x8/x8) I have space for 5 NVME, a 10GbE card, and a HBA. I don’t need a GPU because I’m running Intel and can use the iGPU for transcoding. Optane is undoubtedly superior from a latency standpoint vs my Gen3 U.2 drives, but the performance of this setup vs rust alone is still massively better, and I have room on my board to stick a couple of Gen4 NVME drives if I want to make an even faster NVME pool. Now, I have no experience with Optane, and maybe the performance uplift vs the plain old gen3 NVME I’m using for my special vdev is similar to the uplift of going from rust to NVME. But I’m not sure that it is, and the extra capacity of non-optane u.2/u.3 made the choice pretty easy for me as a homelabber.

compy386 · April 13, 2024, 3:18pm

If you find a board with some m.2 slots you can usually get pcie adapters for them. That might help?

I think you could make your life much easier by trying to find some kind of JBOD/disk shelf solution for the 15 disks and just have them separate. That would free up two of your pcie slots in the NAS.

Am I overallocating things here?

It’s difficult to say since I also didn’t catch any description of your use case. Are you a bachelor with sole access to the server? Then yes, it’s probably very overbuilt. Are you a family of 10? Then it might be fine, Etc.

I think the optane stuff might be a bit overkill regardless of how many people would be hitting the server. If you’re having problems in (Windows?) File explorer I would suggest that is an issue with your pc, and certainly not something that you need to build a $500 cache to troubleshoot!

BlakeTech · April 14, 2024, 12:07am

Unfortunately, As pointed out in the original post, I’m a bit space constrained, thus why no JBOD solution, at least for the moment!

Regarding access, there’s going to be at least 3 users that hit it fairly regularly, and possibly more friends that might hit it too, but they’re all over the globe, so whatever I can do to reduce latency as well for streaming Linux ISOs to them would be great. There are also plans for the server to host a few more game servers, as well as the existing charts that are running.

And in that case, then I might drop the optane stuff, though I think I should still look into grabbing a few on hand for the future, since intel’s no longer making them and those stuff are kinda hard to find…

I’d still have to find something with 33/41 lanes, but at least I think that’s in the workstation class stuff’s capabilities, rather than the server/enterprise level stuff.

So, I guess it’s back to the age old question of “What’s the lowest power, fairly cheap, high pcie lane count chip” that there have been multiple questions on!

compy386 · April 14, 2024, 12:24am

I’m not really super experienced here so do take my opinions with a grain of salt re: optane.

Given the new information about the other users, here is my thinking:

If you are in New York and your friend is in London, suppose they make a request to your server with the optane cache. Suppose there is 10 microseconds latency in the block device response. Now, suppose you live in a perfect world and you and your friend have a direct fibre connection to one another. The theoretical minimum time for a packet to traverse this distance is something like 50 microseconds. So, 60 microsecond latency in a perfect world. Maybe a decent return on your investment for that minimal additional latency on the block device. 1/6th seems like a lot!

But the real world is different, I think a transatlantic packet (one way) is something more like 35milliseconds (that’s 35,000 microseconds). Optane would be responsible for 0.17% of the one-way latency.

Edit: actually it might be helpful to give the “average latency” for other types of storage: sata ssd seems to range from 70-200 microseconds, hdd is 10-15milliseconds. Obviously this is highly dependent on the nature of what is being read/written. I would think there is diminishing returns past sata ssd (i.e., optane), but I don’t really understand the latency differences between nvme and optane, so maybe I’m missing something entirely in my analysis… it seems like you could just do nvme->sata and while you’d be kneecapped for transfer speed, you might get comparable latency to optane?

What’s a few dozen milliseconds between friends?

Source