What should I do with this hardware (pve/esxi/zfs/ceph)

Recently decided to rebuild and redesign my Nas and home lab after upgrading my lan to 10g.

Was previously running truenas virtualised on pve and several nested hypervisors for testing terraform and ansible.

I’m trying to decide between two things

Pve vs esxi as the main hypervisor. I like the storage flexibility with pve and built in backups however esxi is what 70% of enterprises use and it’s good to remain plugged in to that.

Zfs Nas vs ceph cluster - this is really where I’m stuck. I’m trying to avoid spending more on hardware and would like to keep using what I have.

I’ve done some basic tests on zfs and a small 3 node ceph cluster. Zfs totally destroyed the ceph performance. I understand a distributed file system is going to be less performant however in my setup ceph ssd pools performed only slightly better than a zfs spinning disk.

By ceph setup consisted of 3 nodes with one HDD backed osd on each, using a consumer nvme as the db/wal. Zfs was tested using both HDD and nvme. Network is 10g.

Given this hw list, would ceph be viable or do I need more scale for performance?

20 HDDs of varying sizes, smallest is 2tb 5400rpm
6 enterprise ssds of varying sizes
12 consumer nvmes
12 consumer ssds

I have 4 ATX sized machines of different generations, the main 3 are
Threadripper 2950x 128gb ECC
5900x 128gb
I5 8400 64gb

The other machines are mainly nucs/mini PC’s with limited CPU power.

Is ceph viable on thos hardware or should I just go with a zfs san and deal with the main host being a spof once again?

You can’t beat ZFS in performance. You need a lot of nodes and good gear to beat it with Ceph. And then you still have to fight latency that cripples your QD=1 4k writes just because everything has to go over the network.

And ZFS is fine with all drives. Ceph needs SSDs with supercaps (PLP) to provide any reasonable amount of IOPS. So all your consumer drives won’t do well in Ceph.

You can probably use your enterprise SSDs as db/wal and maybe a replicated pool depending on the capacity. HDDs will do fine if you can balance the OSDs in number and capacity across your nodes.

If you want some fancy NVMe pool in Ceph, be ready to buy Enterprise NVMe and new CPUs, as Bluestore OSD with NVMe will eat your cores alive.

But Ceph is slow. If you can put everything is a single server, ZFS is boss. People here like to get JBODs to get some extra room for HDDs and SATA SSDs to keep everything inside one server and not spread out over the network.

If you still want HA and clustering, check out GlusterFS. It is usually used with ZFS as the underlying filesystem and should give you more options with your hardware. I’m not that much into Gluster and can only report from things I’ve read. It will have the same problem with network latency (THE problem with clustered storage), so prepare to be disappointed regarding performance.

But it doesn’t matter to your hypervisor if the shared storage is a single server or a cluster. Block device is block device.

Thanks for helping rationalise this. I started looking into gluster and maybe I should go back there.

Issue I had with gluster is that by the time you account for zfs mirroring and then gluster mirroring you’re sitting at a 4x consumption on storage which seems a little insane. You could use zfs stripes at the node level and then the node becomes the failure domain, however I keep hearing that gluster replication is no where near as performant as zfs, not sure how it stacks up to ceph.

Hrm decisions decisions

Failure domains…you have to see what’s comfortable to you. Having ZFS mirrors on each node gets expensive really quick. Ceph is designed to have very customizable failure domains via CRUSH map and you end up with the same storage efficiency you get with a single ZFS server. Node based failure domain instead of OSD/disk based.

Without knowing the details on Gluster…but network is really really slow. Just sending a packet takes time, processing in CPU, asking the drive to read/write and the whole trip back to the client.
I’ve seen Ceph benchmarks with really good hardware (Enterprise NVMe+EPYC+25Gbit networking) and QD=1 BS=4k results in like 1000-1500 IOPS. That’s the best random write you can expect with the best hardware. Bottleneck is the network and Ceph, not the drives.

This is scalable to dozens of clients and you can get millions of IOPS with enough nodes and threads or queue depth.

This statement never gets old when talking storage.

I’m in a similar position, but I don’t have that much hardware and have to buy new stuff. I’m doing a 3 node Ceph cluster with both HDDs and NVMe pools. I really hate leaving ZFS because it’s so fast and I’ll really miss easy backups (ceph just doesn’t have zfs send). But I’m doing this because I want to learn clustering, Ceph, more networking and I love tweaking things. So Ceph is a welcome construction site. Virtualized cluster with ZFS zvols as OSDs is nice and all, but only gets you so far.

I would still recommend single ZFS server to anyone with <250TB of data or need for (NVMe) speed, which is all of us homelabbers :wink:

Valid point, so what happens when you run out of drive caddies in your typical tower? I’ve ended up with 12 drives in the case (rm400) and I have 12 smaller ones gathering dust. I’d love to put them to work which is what got me thinking ceph.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.