Looking for advice on Truenas setup- Pools,Cache,Specials

ductruffe · April 19, 2023, 6:49am

Hello,

First post, and have tried having a dig through the forums, Wendell’s ZFS/Optane Youtube clips and reddit but was unable to get a clear direction.

Basically building a Trueness system, and had obtained 2* Octane 118Gb drives with the intention of using these as metadata special devices. Unfortunately, I didn’t do enough forward planning and don’t think these will end up being large enough potentially with the amount of spinning rust.

I just purchased some Iron Wolf Pros at a good price (was considering Exos/WD HC550’s but the price on these was too good), so I have 9 * 20TB on the way, planning to do 4 * 20TB mirrors in the pool. Reason for buying 9 was to mitigate risk of one drive arriving DOA and then having to buy a single at a premium (happened in the last batch I ordered) and worst case I have a cold spare.

The base PC has the following:

W2XXX CPU
192GB DDR4 RAM
in X16 Slot: 4*4 NVME Holder (handles the x16 interface lane splitting)
in X8 Slot: 2*10Gb SFP+ Card (Intel X520-DA2 maybe)
in X4 Slot: LSI 9201-16e (have this now working in Truenas with some test drives)
GPU (in x1 slot, might do some Plex transcoding etc, so assume this doesn’t need more than x1)
in X16 Slot: SPARE (considering another NVME carrier)
There is also U.2 Flex bays (currently booting off 1TB, but could go to Sata boot drives instead then utilise the U.2’s in a mirror.

Storage Hardware
2* Intel Optane 118Gb P1600X
2* WD SN850X 4TB (plan to mirror, VM Storage, Bulk video editing storage etc)
8* Ironwolf Pro 20TB Drives (plan to run in mirrors for ~80TB)

In terms of data storage, was planning to use this to backup 2 * Synology NAS units that have 16TB Mirrors in them, and not much space used (they are currently backing each up other using Synology HyperBackup tool). Will also be storage GoPro footage, PC Backups, TimeMachine MacBook backups and then obviously general other stuff. Unlikely to come anywhere near filling the total 80TB, mostly spanning more mirrors for speed (just upgraded to 10/20Gb bonds throughout home). Eventually this server will also house IPCAM footage.

I would say generally speaking, bulk of the data is larger files, especially on the hard drives, but I do have some reason sized data pools containing heaps of small things too. Also want to run a Steam/Game cache so I don’t need to keep everything on Laptop and PC all the time.

Basically, looking for recommendations on the best setup to go. I.e. get 2 more P1600X’s, and have roughly 240Gb of Metadata Special for the spinning rust pool (@40TB data that gives me 0.5% ratio approx). My biggest worry is creating a metadata special now and then it not being big enough long term. Reality is I probably won’t have heaps of data at the start, so could easily add and then re-write the data to refresh the metadata on a special vdev if I decide to add it. Alternatively, do I find half decent 1TB NVME and use that as the metadata special (in theory will be easier to replace quickly with off the shelf NVME in future.

Things to clarify:

Do I bother with an SLOG, or given VMs would be on half decent storage its not a big deal. I could use the Octane’s for SLOG on the spinning rust if I don’t go with metadata special.
Metadata special - do I bother, or do I need more storage (obviously have to know since the data set isn’t loaded in ZFS yet and will just grow over time).
If I add more NVME for either of the above options, would I be better having the bulk storage NVME on a seperate PCIE carrier board so its less likely I accidentally pull the incorrect drive in the even of a fault (planning to physically label drives with serial numbers etc so I can more easily correlate in Truenas, but my understanding is if I have a free slot there I can install new drive, then “repair” without having to remove faulty drive initially)
I understand if I am using mirrors throughout (i.e. Not RaidZx) that the metadata special can be removed and migrated back onto the spinning rust drives.

Sorry for the large amount of info - I have seen too many threads of people detailing how they wish they had setup their Truenas system differently and now it’s too late to make major changes

Exard3k · April 19, 2023, 11:05am

You can test this by letting the pool run with sync=disabled. If it’s faster (disks will tell you), you will benefit from a SLOG. Use your P1600x in this case, they make a great SLOG.
If it’s not faster, SYNC writes aren’t a thing and you don’t have to worry about random writes. You may want to test this with the NVMe pool as well. Flash usually performs the worst for random IO, so you want some way to avoid it.

If you can afford all those disks, 1-4 TB NVMe for mirrored special vdev is barely noticable in costs. No need to limit yourself to 100GB and make things complicated. And with more space to work with, you can include small files too. Bandwidth isn’t a limiting factor, so PCIe 3.0 or even SATA will be fine.

Correct. But it doesn’t work the other way. ZFS doesn’t move the stuff off the disks to fill a newly added special vdev. Just like it doesn’t balance the drives when adding a new mirror for data.

Check out my guide for general usse and recommendations on ZFS and support vdevs: https://forum.level1techs.com/t/zfs-guide-for-starters-and-advanced-users-concepts-pool-config-tuning-troubleshooting/

GeorgePatches · April 19, 2023, 1:21pm

Huh? If flash is the worst, what’s the best?

Exard3k · April 19, 2023, 1:42pm

Compared to sequential writes in the “flash bracket”. If you can avoid to write random I/O, you should, even with flash. Also saves on drive endurance. There isn’t any faster storage other than optane and DRAM, so SLOG for NVMe pools is much harder to do than it is for HDDs, if it can make a difference at all.

GeorgePatches · April 19, 2023, 1:47pm

Oh, gotcha. Yeah, random IO is just the worst no matter what.

GeorgePatches · April 19, 2023, 3:56pm

@Exard3k I have a related question about the special vdev. So with standard vdevs you can “upgrade” their capacity by changing out the drives one at a time with larger ones and resilvering in between. Can this also be done with a special vdev?

Exard3k · April 19, 2023, 4:14pm

I’ve never done this for a special vdev, but zpool replace is used for all vdevs. You gotta replace failed drives after all. Keep in mind that you temporarily need an additional slot to replace a disk safely.

autoexpand=on should work just as fine as with any other vdev.

I tend to overbuy rather than underbuy and not having to mess with plugging M.2 in and out too often. So don’t quote me on that

GeorgePatches · April 19, 2023, 4:32pm

So that’s a “probably, but no one has tested that yet.” I guess I’ll fire up a VM later and see what happens.

Personally I live dangerously and just remove one working disk while I’m upgrading. Definitely not recommended and definitely a “do as I say, not as I do.”

ductruffe · April 20, 2023, 1:27am

Hi Exard3k, appreciate the detailed reply to all my concerns. I will definitely have a solid read through of the thread posted.

Thanks for that info, had seen bits and pieces about using regular 2.5" SSDs but again nothing super concrete.

In terms of affordability, there is a lot less available in Australia where I reside, so often the good deals that come up also come with a large postage cost which negates any savings. There is not a huge difference in price between M.2 Nvme and 2.5" SSD, so I guess it makes sense to go Nvme.

Are there any specific recommendations? Presumably reliability (Intel 670p is available at a reasonable price) and low latency are what are important here. This would probably also trigger me to add an additional NVME carrier board so I have the slots to replace without remove in the event of failure.

2TB for an 80TB pool gives me a 2.5% ratio, which should be heaps factoring in small file storage as well.

Again, thanks for this info, very handy to know. I imagine that this is something that may be supported in the future.

I did forget to mention in the original post that all systems are on UPS backups and will be set to shut down.

I’m already too far down the hardware purchasing path to turn back, so the additional cost of a few more drives I will have to just eat.

Obviously another consideration is Core vs Scale, which I was planning to do a bit of testing on before actually committing to deploying.

I don’t have huge VM needs, so I imagine Core can handle the simple things (Pi hole type stuff) I want running all the time and my regular machines can run VMs with the network stores where required.

Exard3k · April 20, 2023, 11:05am

If you have the slots, get NVMe. The main benefit comes from NOT having it stored on mechanical drives. So SATA is totally fine, but with available PCIe lanes, there is hardly a reason to not use NVMe.
Any consumer NVMe drive that is not garbage-tier (without cache or really low TBW values) is totally fine. I’ve used Mushkin Pilot-E PCIe 3.0 (can’t get anymore average than this) drives in the past and I never hit performance limits in practice, nor would endurance (listed as 650TBW for 1TB) be a problem in the next 20 years.

No. That’s a design decision we have to live with. However, It will balance over time as you modify and delete stuff. It if really troubles you, you can always recreate the pool and restore from backup.

No need to read through all of it, but I covered all “support” vdevs as well as useful tuning parameters, setting up properties and links to administration guides once you got everything up and running. Should alleviate 90% of all concerns you have.

For storage and ZFS, go Core. If you need clustering, kubernetes, GPU passthroughs, more than 16 cores per VM, etc. Scale offers just more for virtualization. It uses QEMU/KVM in Linux after all.
But Scale is also a fairly new product and comes with teething issues. I had to revert back to Core because of buggy iSCSI implementation on Scale. But YMMV.
I use Proxmox for everything virtualization and Core for storage. I just think Proxmox is better for my virtualization needs and Core is better for my storage. It’s very much a personal preference. No one is preventing you from downloading the ISOs and boot up a VM to test drive everything.

GeorgePatches · April 20, 2023, 2:38pm

I found the why on this quite interesting. (I’m sure you know this already Exard3k)

So when you give ZFS a bunch of stuff to write, it turns to its vdevs and gives everything a small bit of that data to write. Whichever vdev finishes first gets the next piece of data, the second getst the next piece, and so on down the line. So the fastest vdev is going to get a a bit more data written to it. Then over time as that fastest vdev fills up a bit more, it’s going to start slowing down and the other vdevs will start to be favored more for the placement of data. It’s a very clever and simple mechanism for balancing the pool. Even if you add vdevs, over time it’ll all balance out on it’s own.

I think things like this about ZFS are just the coolest.

If you’ve never really run TrueNAS before, I’d go with Core. It’s the OG, and it’s dead reliable* at what it does. I think Core is also the best way to learn the TrueNAS design philosophy.

If you want one box to rule them all, Scale is a better choice. It gives you all the usual linux trappings that people are used to these days.

*(except for 13.0U4 which has slightly broken ARC, but they’ve fixed it and we’re just waiting for U5 now)

SkipPerkins · April 20, 2023, 9:30pm

Those Intel SSD.s are nice, but I think Samsung pro models would do the job, and they are cheap right now (1tb for under $100). Those optane drives have great endurance and crazy iops, but I think a quality larger NVME might work better, but I could be wrong. I do love the durability of intel storage. I used to put 3510’s and 3520’s as boot drives in everything at one job many years ago because they would never die. I love durability.