How-To - Kioxia 4x 1tb ZFS Root

What are We Doing?

ZFS – once you know, it’s hard to go back.

ZFS has a fair bit of overhead, and I suspect there are changes coming down the pike to better support performance of NVMe devices, but ZFS is unmatched when it comes to engineering and stability.

For me, it is worth the performance tradeoff to have the redundancy and reliability.

In this writeup, I’m going to share some performance numbers and tuning. (Note: ZFS Vets, share your tips and tricks too, please!). The “ingredients” for this writeup are aimed at someone with a fairly high-end workstation:

  • Intel Xeon or AMD Threadripper, or equivalent platform (lots of PCIe lanes)
  • Multiple devices for Speed and Resiliency
  • 4x NVMe (I’m using Kioxia 1tb m.2 NVME and they are nice
  • Gigabyte CMT4034 compact 4x NVMe to PCIe x16 adapter.

For Intel Z490/Z390/Z370 users

I don't recommend bothering with anything except perhaps mirroring of two NVMe devices. Generally, these motherboards are troublesome because they have multiple NVMe slots wired through the DMI interface, which itself is limited to 4gb/sec. For 4 NVMe drives, it just doesn't make sense. It is also troublesome, in my experience, to try to get the 16 lanes from the CPU split up as x8 to your GPU and then 2x x4 to two NVMe because most motherboards don't support this well. 

Note for X470 and X570 users

It is possible to run NVMe as follows for optimal setup: 1x NVMe through the CPU NVMe lanes, 1x NVMe through the chipset and 2x NVMe through an add-in card that splits the second x8 slot into x4, x4. This often requires setting the slot to be “x4x4” in the bios, so you must be sure you get a board that supports that.

As such I recommend going for a mirrored configuration only on AMD X470/X570 using just two NVMe and not 4.

For Intel X299 Users

Intel offers VROC under some somewhat screwy conditions on their X299 motherboards. I have several “VROC Keys” that unlock different types of VROC. The idea is that the CPU would handle some of the raid functions more transparently. This is actually a legit concern – Windows really struggles with bootable software raid based arrays and this is an attempt at handling NVMe redundancy kinda-sorta in hardware. Unfortunately the marketing team seems to have gimped what the engineers tried to give us, rendering it nearly unusable in this context and scenario. I’ve never been able to get it to work, on any operating system, to my liking. For ZFS specifically, it is always best to have ZFS deal directly with the underlying block storage devices so VROC is not applicable here for that reason alone.

Where are the bodies buried with ZFS?

I love ZFS, but the ZFS design predates the idea that storage arrays are approaching main memory bandwidth limits. Track this thread on github:

Eventually we’ll get “real” Direct I/O support for ZFS pools. For this 4-up NVMe setup, it’s not that bad. It’s also possible to use a striped mirror (two mirroed vdevs), but you only get half capacity usable. That’s quite a fast setup, though.

This is prettymuch the only downside of ZFS. I am aware of BTRFS and I use it. It can be fine, too, as an alternative to ZFS, depending on your use case. My preference, for technical reasons, is still ZFS.

What are we working with?

Here we are – for the guide I experimented with Kioxia BGA4 NVMe as well as more standard 80mm m.2 from Kioxia – all 1tb in size as that’s the sweet spot for capacity/performance/endurance right now, imho.

I’m using the Asus ROG Zenith II Extreme Alpha with Threadripper. In this case, Threadripper 3990X. This motherboard has a total of 5 NVMe slots – two on a dimm and 3 on the motherboard itself. Many Threadripper motherboards also have add-in cards to support a large number of m.2 devices.

For this guide, I will use the Gigabyte CMT4034 carrier which is a half-height half-length PCIe card. This card is designed for use in servers and places where there is high airflow. If you use it, be sure it has appropriate airflow. It is unnecessary to use with the ROG motherboard, and many other threadripper motherboards. It also wastes a PCIe slot I could otherwise use for another add-in card (vs using onboard M.2).

Alternatives to m.2 cards

There are M.2 carrier cards in the market now that offer a PLX bridge, including a highpoint device which also offers raid functions. With this, it would be possible to use an x8 slot, for example, to support 4x x4 NVMe devices at full bandwidth (until you hit the upstream limit of the x8 PLX bridge).

This would be awesome if you could get a PCIe 4.0 PLX bridge because x8 PCIe 4.0 = PCIe 3.0 x16 worth of bandwidth, and that’s plenty of bandwidth for storage. Currently, I know of no such beast on the horizon.

Liqid offers some killer all-in-one HHHL cards that have a PLX bridge and all the performance you need. They’re tailor-made for what we’re doing here, and often use Kioxia devices for their underlying storage. That’s the “off-the-shelf” solution for what we’re doing here, if you don’t want to DIY it.

Hardware Setup

Most motherboards refer to “x4x4x4x4” mode but asus refers to it as pcie raid mode. Not to be confused with AMDs RaidXpert.

Ubuntu Fresh Install

“The Ubuntu installer can do ZFS on root!” – well, not so fast. It’s really rough. I’d recommend you remove any other disks you do NOT plan to use with the Ubuntu installer. Dual booting? Pop that disk out!

It’s possible to setup ZFS right from the Ubuntu installer., but only when you do it automatically. This creates a couple of zfs storage pools, bpool for /boot and rpool for / (the root filesystem).

This will give us 3tb of usable space, and any one of our 1tb NVMe can die and our computer won’t miss a beat. Modern versions of ZFS support Trim just fine, so the NVMe performance shouldn’t fall off as the drives are used.

It is worth doing some experiments for your use case in terms of creating dataset(s) for your files and seeing what works well for performance for you.

TODO Here’s how I setup Ubuntu

lsblk -o name,size,uuid


zpool create tank raidz -o ashift=12 /dev/disk/by-partuuid/something ... 

Halfbakery

Note: In general, if you setup /home on a separate partition, it makes it easier to totally reinstall your OS or to distro-hop. The downside is guesttimating a size for / but not including home. I usually go for 250-450gb for /, not including home, especially when I’ve got more than 2tb of total disk space. YMMV however.

In Conclusion

With the relatively high density and low cost of flash, everyone can enjoy some redundancy and fault-tolerance on their workstations these days. Sure, flash is way more reliable than spinning rust, but that doesn’t mean it is bullet proof. And Raid, ZFS, snapshots, etc is no substitute for a functioning backup system. Thanks to Kioxia for suppling some hardware for my mad science. I’m very satisfied with the performance here.

5 Likes

I blanket apply the following to my datasets, and also have it applied to my proxmox mirrored root pool (note, some of these kill BSD compatibility, which I’m fine with):
zfs set acltype=posixacl xattr=sa atime=off relatime=off aclinherit=passthrough rpool

Normally I ALSO include dnodesize=auto in that list, however I came across the following:

https://lists.proxmox.com/pipermail/pve-user/2019-March/170503.html


So don’t use dnodesize=auto on rpool in proxmox, and possibly other distributions. Also be wary of ZFS using it as a default when creating a new pool and moving your root pool. It seems zfs set dnodesize=legacy rpool is what is needed.

I’m sure you know this, but it should be mentioned so others can more easily search is that the term for splitting x16 to x8x8 or x4x4x4x4 is “PCIe bifurcation”, and should be present and searchable in the motherboard manual if the feature is present.

As far as hardware goes, I’ve had zero issues with ASUS Hyper M.2 X16 PCIe 4.0 X4 Expansion Card, though all they are doing is holding some cheap ebay 16gb optanes (which only use 2 lanes PCIe 3.0 each) for my pools SLOG’s.

4 Likes

I’m so envious of you Wendell with 4TB worth of M.2 2230 goodness. I can’t find the BG4 for sale anywhere. All signs point to it being OEM only.

Good to know the VROC is useless if I ever wanted to do this on my X299 system.

This just jumped at me on eBay. https://www.ebay.es/itm/Kioxia-BG4-Series-Solid-state-drive-1-TB-internal-M-2-2230-PCI-KBG40ZNS1T02/124444342168
The same seller has more sizes.

Good luck getting that cause last time I ordered from there my item never arrived because of Italy’s lockdown.

That is almost certainly not an official channel to sell these SSDs. And shipping to the Americas is gonna cost an arm and a leg IF it even makes it out of Italy.

1 Like

ZFS is pretty cool.

My issues with ZFS sits around the sheer amount of RAM needed to use its much more advanced features. I dont run stuff capable of rdimms so RAM is a limitation for me.

I usually opt for RAID6 instead which also has a decent advantage at being resistant to an SMR drive self destructing itself. The CMR vs SMR debacle is a bit of a nuisance right now. I understand its cheaper to manufacturer but WD doesnt need to lie.

Fortunately in your case you use SSDs lol so this wont even bother you. Now I cant remember are those 3.0 or 4.0? Assuming 3 due to age.

300 Euro for 1TB? pass. Size is pretty crazy, but it just doesn’t justify 3x the price of normal SSD.

1 Like

It’s a third party reseller through unofficial channels, that’s part of the price inflation there.

If those tiny M.2’s float your boat, wait’ll you get aload of Toshiba/Kioxia’s XFMEXPRESS form factor: up to PCIe x4 in just 18x14mm!

https://business.kioxia.com/en-jp/memory/xfmexpress.html

https://www.anandtech.com/show/14711/toshiba-introduces-new-tiny-nvme-ssd-form-factor

It was announced over a year ago, but sadly I haven’t seen or heard of any production designs that implement it. Seems like it would be perfect for mobile and embedded devices with high I/O demands.

2 Likes

I am blown away no oem is making xqd cards from their bga form factor. The margins would be over 25% at current xqd prices, which is quite a lot in this sector.

I am aware of some oem designs using this, but nothing is in-market yet. I don’t know if it’s real, but I think I saw this form factor in a metal slide/slot inside a phone. Like a memory card, but pcie. EVERYTHING is going pcie, for speed and efficiency. It can’t get here fast enough. Even raspberry pi has pcie interfaces now! exciting.

2 Likes

I’d love it if someone would make a PCIe4 x16 card with a PCIe switch and 12 PCIe3 m.2 2230 slots. When fully utilised, each m.2 drive would have ⅔ peak PCIe3 bandwidth.

With 14 of these little Kioxia drives (2 on the motherboard), that’d be optimal for dRAID2 (8:2:1) plus 3-way mirrored special vdev for small blocks.

Liqid has this – 8 m.2 ports behind a PLX bridge in x16. It would be just as you describe if you populate 12 of the 16 bays. $$$ though.

2 Likes

Yeah, unfortunately, that’d use 2 precious x16 slots. :wink: And like you said, $$$.

Highpoint has the SSD7540

Also big money

They have some pcie3 cards that might be cheaper/on sale in various places, but there’s no 12 port card, just 4 or 8 ports.

1 Like

Hi Wendell,

You mentioned in your video that you were searching Ali Express for a super small adaptor. You have any luck in finding one?

You got me kind of interested in what actually goes into one of these NVMe raid/adaptor cards, and I was bored so I did some stuff.

This is by no means a working board. More of a first look/proof of concept. Here are a few rambling things I have looked into so far:

  • I believe at it their simplest these cards pretty much a direct connection of the PCIe lanes to the corresponding locations on the NVMe drives. The 16 lanes get split up into 4 groups of 4.

  • There are some other signals you have to take here or there, but those are just details. Haven’t done much with them yet. Would need to do some research.

  • You need a DC-DC converter to make use of the 25W you get from the PCIe 12V into a usable 3.3V for the drives (that’s the thing to the left).

  • I think the boards might be doable for something in the neighborhood of $30 with a MOQ of 5 units. The boards are available at ~$40 for 5 boards (less at higher volume), and the BOM cost is probably in the range $10. The difference is wiggle room for something I probably forgot.

So to the question: Is this something you would be at all interested in Wendell? I don’t particularly want to go dig through 400 pages of NVMe documentation for those “details” I mentioned earlier if you already found something elsewhere.

3 Likes

never did! I regret that I have but one like to give. Do I need to diy solder it together or can the proto service populate it also for a few bucks more?

each of these tiny cards is max like 4ish watts so we have more than 5 watts of overhead to spare lol

you miiight need some decoupling caps on those signal lines too

The board house I was looking at (JLCPCB) does assembly but only of parts that are in their catalogue. They charge ~$10 bucks for it. The problem is that they don’t have NVMe connectors and those would be the hardest to solder (0.5mm pitch).

A different supplier (PCBWay) will accept custom parts for assembly, but they are significantly more expensive ($100-150 for 5 boards and another $88 to have them assembled). That said they do offer some features that you should technically have on this type of edge card like hardened gold and beveled edges.

I think assembly doesn’t really become practical until you get to larger runs, but that leaves the problem of trying to solder tiny parts. I think I can manage 0.5mm pitch, but I think that is probably beyond most people’s equipment setups.

I wasn’t too worried about the power budget. Even 5W on each drive lets me be as low as 80% efficiency on the converter.

The signal lines are run as 85ohm differential pairs as per the PCIe spec. There are supposed to be DC block caps on each line, but from what I can tell those are supposed to be handled by the transmitter (so either the MOBO or SSD depending on which pair of lines). Any of simple adaptor cards (ones without any raid or special functionality) I can find decent pictures of seem to be direct connections.

Do you happen to have the exact model of those Kioxia SSDs? Or better yet their dimensions and hole location. I need to actually put the mounting holes on the board. I think you mentioned these one aren’t quite standard lengths.

Sorry to dig up this 3 year old post. If you’re using atime=off then relatime=off is ignored.

And if you use acltype=posixacl then aclinherit is also ignored. By the way posixacl is a synonym for posix.

JUST FYI.

https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops.7.html

1 Like