Proxmox + Ceph 3 Nodes cluster and network redundancy help

gogito · June 3, 2022, 2:43pm

So I just joined a startup as their sysadmin, my role is to build the server from scratch both hardware and software. My experience is setting up a single node Proxmox homelab with things like OMV, Emby, nGinx Reverse Proxy, Guacamole, … so I’m a noob but I’m learning.

The use-case is hosting things like git, a monitoring/database web server (Think something like a farm management system with a bunch of sensors everywhere sending data to the server database for logging). The VMs hosting these services will be Windows 10 Server or just bland Windows 10/11 Pro/Home.

I’m having this setup in mind:

3x Proxmox/Ceph Node in HA (identical specs):

Each Node will have:

The boot drives will be installed on ZFS with a 2xNVMe in a mirror

Ceph config:

Pool of NVMes for VMs => Proxmox will use this for their VMs (Mostly Windows VMs)

Pool of HDDs for storage (with either NVMe or Sata SSDs as WAL and cache) => CephFS with samba to allow file access

=> I’m not too familiar with Ceph, haven’t used it yet. But I’m thinking 3x of each drives for redundancy alongside the redundancy from 3 replica from the 3 nodes.

3x 24 ports 10Gbe Switches.

The expected network will be 10Gbe with bonding if I need extra speed.

I’m thinking of doing some type of bonding/aggregation so each Node will connect to all 3 switches

I am a total noob when it comes to more complex networking, my only experience is connecting a bunch of switch together and make sure them work.

They will be connected like this:

The goal is so that there will be no SPO (Single point of failure) and anything can fail and the system will continue working and be transparent to the end user while I replace the failed component and the system heal/balance itself. Thinking of adding in a second router/ISP but the most important thing is internal network.

From my understanding, in my current setup, if the router fail, everything in the system will fail because the DHCP server will be down. If I replace the 3 switches with Aggregated (Tier 3) switches, that should take care of it? ie if the router is down, only internet access would be lost and everything in the system will continue functioning normally?

Is there anything wrong or possible future issue with expanding, upgrading in the future with the way I set this up?

And what would be a good switches for this tasks?

Hardware recommendation for the Nodes would also be nice. VMs use should be about 2TB and Storage 20TB.

Thanks so much everyone!

Ruklaw · June 3, 2022, 5:47pm

Do you actually have/need three switches?

For redundancy you only need two, this also means your servers will only need two network interfaces.

For your servers you should be using sfp28 network interfaces, this is the sweet spot imho of price/performance and is backwards compatible with sfp+ if that’s what your switches have.

risk · June 3, 2022, 5:47pm

3 switches looks like overkill - 2 switches will make it simpler for you to utilize dual ports on machines - you can use dumb switches for simplicity, or some simple L2/L2+ switches (to prevent loops).
You don’t need VLANs or L3 routing on switches on your network. Simple flat topology will work fine for your use case. (e.g. you could get away with L2 for physical devices + VXLANs for VMs)

Ceph on 3 nodes is perhaps overkill - try making 3 VMs with e.g. 8 virtual disks each and try setting up a small ceph cluster with 24 OSDs – I wish you luck, I hope this work out for you.

DHCP is not needed on your network, but you can run it in a pair of VMs and use some kind of keepalived / vrrp for convenience - same as DNS.

Here’s a barebones setup on how to make DHCP/DNS/routing redundant in a small network with keepalived and openwrt, same principle applies regardless of your router vendor, e.g. pfSense can run in a failover-like configuration

do you have multiple ISPs / uplinks?

Ruklaw · June 3, 2022, 5:53pm

Three nodes is basically the bare minimum you should use for production for ceph, but this should be done with an eye to scaling up in future (the beauty of ceph being how easy it is to add nodes/drives as your needs grow).

Ceph performs the better the more nodes are involved - in a three node cluster every write goes to every node which can end up a bit slow for write intensive stuff. If this is important to you linstor/drbd should be considered, or just a classic zfs with replication setup.

gogito · June 4, 2022, 1:32am

Thanks, after doing more research I think I’ll stick with 2 server using MLAG for redundancy. Multiple ISPs/router is one of the things that I’m planning for. Should just be as simple as setting up VRRP for the 2 Router and each router hook up to a different ISP right?

The sfp28 is a great suggestion. I’ll look into that. I picked Ceph with scaling in mind and the redundancy concept seems good. The only thing I’m worried about is performance since I’m planning to use it to store my database which will constantly be write to and read from.

I’ll check out the other storage suggestions.

gogito · June 4, 2022, 1:37am

I’m thinking of going with 2 switch in MLAG and 2 router in VRRP so I can achieve redundancy.

“Ceph on 3 nodes is perhaps overkill - try making 3 VMs with e.g. 8 virtual disks each and try setting up a small ceph cluster with 24 OSDs – I wish you luck, I hope this work out for you.”

I’m not sure how this is different from using 3 Nodes? Can you help explain the benefit?

risk · June 4, 2022, 5:46am

There’s nothing in the world of networking that will allow hardware to die in a way where a bit of traffic isn’t taken away with it, for a few seconds until forwarding (L2 or L3 or whatever) converges. It’s doable for planned shutdowns, but not for

No, you’d connect each router to both ISPs. Each of your two ISPs would see a single IP/same IP from their perspective regardless of your routers (VRRP).

I’m assuming you don’t get your own AS number. Without an ASN, you don’t own your IPs, when your ISPs fail, your traffic that was going over that ISP breaks, and that’s ok. It’s not a thing that happens often and it doesn’t require waking anyone up. You can setup automation to failover DNS entries within ~5 minutes or so that should take care of unassuming traffic. Your own stuff on the internet can be configured to handle this case better, usually by trying multiple endpoints.

When moving proxmox VMs, how are you moving networking?

I’m thinking, each of your routers would need to know which of the 6 proxmox IPs to use as a gateway for your VMs. How is proxmox propagating/advertising these IPs to your routers?