For Best Performance - Proxmox Cluster with CEPH or ZFS?

iadityaharsh · June 27, 2023, 6:37am

After months of planning, I came to a conclusion to assemble 3 Proxmox Nodes and cluster them together.
I’m mostly interested in Mini-PC (NUC Style) with dual 2.5GbE LANs but after building 32 Core Epyc Proxmox Node, I’m known to the performance boost with actual server hardware. Anyway, I will be building 3 nodes and one thing is haunting me. Should I use ZFS with mirror disks on each node and replicate data across all other nodes to achieve HA or Install CEPH on all nodes and combine 6 M.2 NVMe drives to 1 large CEPH pool?

I’ve heard some amazing things on both sides and some nasty drawbacks.
I would like to have some fresh opinions on this topic, especially after Proxmox VE 8.0 Release.

Some Specs that I have in mind for each nodes:

SP A55 512GB SSD (For Proxmox Boot Environment)
Intel i5 13th Gen (1340P if going with NUC or 13500 if building new altogether)
(IF APPLICABLE) B-Series Intel Motherboard with 2.5GbE and dual M.2 Gen-4 NVMe
64GB DDR4/DDR5 RAM (as per system)

Application for Cluster (Primarily with HA in mind):

2 docker servers (1 for home use, 1 for business)
MySQL Server for business
1-2 containers for small applications like Omaha Controller, Pi-Hole, etc.
Jellyfin Server for small Media Collection (stored in a separate TrueNAS Scale Build)

Exard3k · June 27, 2023, 11:18am

Ceph requires Enterprise drives to perform (by Ceph standards). So if your M.2 Drives have supercapacitors (PLP), go ahead. Otherwise those NVMe drives will perform more like local HDDs. Ceph isn’t known to be fast, but it will be much worse without proper drives.

And I don’t understand the topology in the graphic. Ceph is running on all nodes, there is no central “Ceph” or master node. And you mention 40GBit networking, so every NUC gets a dual 40Gbit card on top of dual 2.5G copper? You got a switch for that or some fancy OSPF routing going?

2.5G LAN will severely limit cluster performance. On writes, the primary OSD send data across the cluster to other OSDs, multiplying throughput requirements. Every single bit is going over the network. And you have to keep a low latency on top of everything. 10G+ is recommended.

The question is…how many IOPS do you need? Is 100 per VM ok? That’s what I get with my test cluster and consumer NVMe on a RBD image.

There is no consistency across the nodes. And no HA because the data is different on each node. ZFS is no distributed filesystem. It’s fast but it isn’t designed for clustering.

Single ZFS server will outperform any cluster by far.

iadityaharsh · June 27, 2023, 1:54pm

In Proxmox I don’t know of running any CEPH Central or master node.

In the original post I mentioned either 3 NUCs or 3 rack servers. and yes, dual 40GbE on top of dual 2.5GbE

I think you got it all wrong, 2.5gbe will only be used for general web dashboard usage and internet in when using 40gbe. If I opt for Option B, and use ZFS replication, 2.5Gbe will be sufficient as far as replication is concerned. I have tried and tested this, it is sufficient at least for my use case.

For single Server I can surely vouch

crembz · June 28, 2023, 10:13am

I went through the same decision recently. I liked what ceph offers functionally but honestly unless you’re scaling to 100s of tb across many nodes or osds you won’t see any performance. You’re also creating a lot of complexity for no real reason.

I went with a ‘master node’ pve configuration in 9 node cluster, the master node runs a virtual truenas VM and provides shared storage to the others. It holds my media collection mainly and provides a central template store to all the pve nodes. It is a spof so for always alive vms I fallback to using zfs replication and ha.

Also, you may want enterprise ssds for pve if using zfs anyway. Consumer grade ssds really struggle.

ThatGuyB · June 29, 2023, 12:23am

If performance doesn’t matter (even a slow ceph can satisfy a decent amount of workload), but you want the redundancy (and the fanciness) of hyperconverged clusters, go with Ceph. If you want performance, go with option C.

Option B will be very manual and replicating and balancing the data will be a struggle, even with ZFS send. With 2 nodes, it wouldn’t be a big deal, but with 3, that’s a bit much. Say you have 6 servers:

1 and 2 run on A: 1 gets replicated to B and 2 to C
3 and 4 run on B: 3 gets replicated to A and 4 to C
5 and 6 run on C: 5 gets replicated to A and 6 to B

This is the best configuration when it comes to both resource and storage load balancing, because if 1 node fails, the load gets distributed equally on the other 2.

Even if we’re being generous and say you have enough resources for a single server to handle the loads of 2 (which is a complete waste of idle resources) and say you replicate all the services from A to B, B to C and C to A, it’s still going to be a manual procedure configuring them.

Option C.
Get 3 hypervisor hosts and 2 NASes. Replicate the ZFS pool from one NAS to the other. Trust me™, I’ve ran more than 2 dozen VMs off of a single md RAID 10 spinning rust on HP proliant microservers gen 8s and they were still working well (heck, I had one of these with Samsung 860 Pros in RAID 10 and the CPU and the 8 GB of RAM weren’t even used and the SSDs weren’t even sweating - on a bond mode 6 on 2x giagbit NICs).

Have all the services run from a single NAS and in case one fails, have the other failover. If you’re feeling fancy, you can also load balance them by splitting half the services to run on one share in NAS 1 and the other half in another share in NAS 2. Both shares get replicated to each other.

While I haven’t tried this, you could probably just use NFS shares on a VIP and using something like keepalived, do a hot failover to the other NAS. But someone with more experience needs to show their thoughts (or you should test in a VM as a POC before building all this). Otherwise, you might need to use corosync and pacemaker and make the 3 proxmox hosts just voters in the quorum between the 2 NASes (all of them will obviously vote the same, because it’s just an OR statement when one NAS is down - then you can have a hard rule to prefer NAS 1 for share 1 and NAS 2 for share 2).

Now, the question with replication comes to mind: how will the hypervisor handle a share migration to another NAS when a VM or container is running? How about the services themselves? I’ve no experience with this, but I can guarantee with Ceph it wouldn’t be an issue.

If you pull the rug from under NAS 1, I think it will be enough of a delay for NAS 2 to become the active one for the share 1, that either the service will crash or error out because of a failed read or write (or worse, kernel panic), or the hypervisor will be smart enough to notice the failed read or write, the missing access to the share and pause/suspend the VM. On the resume after the delay, typically hypervisors are smart enough to resume the VM, but I was never faced with such a scenario. Live migration from a host to another, or simple disk migration from a NAS to another wasn’t a problem in proxmox, it was smart enough to pause and resume the VM just in time to have everything synced up - but not sure how it handles unexpected rug pulls.

Option C is easier to handle than B when it comes to replication and load balancing, but you need more hardware for it. Thankfully, the hardware doesn’t need to be very fast, a stripped zfs mirror will behave well on any low end hardware. With the SSDs, you might need something beefier to keep up though.

Conclusion time. A Ceph cluster will basically ensure that data is always accessible at all time from any host. Option B or C, with replicated storage, is not entirely certain what will happen when the active storage goes down (at least to my knowledge, I’d love for someone to educate me). But in my experience, this happens way less frequently than hypervisors going down, which happen less frequently than the service misbehaving randomly.

DoozyBytes · June 29, 2023, 2:47am

If I were you I would most definetly run zfs on each node. Ceph is just way to complex and does not offer any significant advantage for your use case.

I would setup the same path for all my storage, that way I can setup replication jobs and ha for all my important VMs….that’s actually how I run my cluster.

That being said nothing stops you from trying out ceph and add that to your knowledge base.

gogoFC · July 28, 2023, 9:41am

I don’t know what I searched and ended up here. Something CEPH HA Proxmox, but this is funny.

Exard3k said CEPH doesn’t have a master node, ladityaharsh replies he doesn’t know of any CEPH master node.

lad you just agreed but thought you were disagreeing. In case anyone missed that.

Anyway, which option did you choose? CEPH or ZFS, and what are performance numbers?

system · April 27, 2024, 3:41am

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.