3-node 65w Proxmox Thunderbolt Network Cluser

wendell · August 12, 2023, 5:50pm

Background

The goal is to setup a high performance but lowest-power high-availability cluster with software such as Proxmox or XCP-NG to host and manage cluster operations.

What I came up with for the video was to use 3 mini-PCs connected to each other via “high speed” thunderbolt and then using the 1 or 2 built-in 2.5 gigabit nics for network-facing operations.

Each Mini PC idles at less than 9 watts; at full load the cluster is using less than 300 watts of electricity.

A downside of this is that often these mini PCs only have one or two network connections, and they usually top out at 2.5 gigabit.

In another video we covered m.2 cheat codes to give them faster network cards… but for a lot of home use cases… 2.5 gigabit is fine.

And more often than not people want something that doesn’t use a lot of electricity more than they want raw speed.

A nas could be used for optional bulk storage… connected with iscsi or a simple file share.

Let’s take a closer look at the setup.

Setup

The physical setup is pretty easy – I used some different Intel NUCs I have been collecting around the office to test for this kind of use case.

Install Proxmox and plan your IP addresses on each node. Connect the thunderbolt ports together. If you plan to use 3 nodes, make sure that each node actually has two thunderbolt ports and connect each node to each other node.

Security/Thunderbolt

The hardest part is to modprobe thunderbolt-net and then configure thunderbolt for, essentially, no security (just to make our lives easier).

vi /etc/udev/rules.d/99-local.rules

and add this line:

ACTION=="add", SUBSYSTEM=="thunderbolt", ATTR{authorized}=="0", ATTR{authorized}="1"

That should be enough for thunderbolt0 to be available in the network interfaces. If you don’t see a thunderbolt0 interface double-check your thunderbolt cables. Often USBc to C cables will “work” but not really for this use case.

For a three node cluster, we want to ensure that all of our nodes can see each other node.

Design Pattern for High Speed Interconnect

Two Ports Direct Connect Three Nodes is very useful outside of Thunderbolt. Remember Those Cheap Intel Omnipath cards I bought?

Thunderbolt is almost forgotten for networking purposes. It is very unoptimized and the “ping” latency times out of the box on the linux kernel leave a lot to be desired. I don’t think anyone that knows what they are doing is testing this anymore, and thunderbolt-net does not (yet) seem to even support the AMD chipset solutions (though it can with some backing – more on that in the future).

Okay… mini PCs and anemic 40 gigabit is too slow for you, the same what about speedy 100 gigabit? Intel omnipath cards are around $30 and two of those in some regular desktop machines. They work fine with cheap fiber qsfp adapters to direct connect. Cool them well and the sky is the limit for an inexpensive way to scale your cluster without a central switch.

IP Addressing

Since Thunderbolt is a direct connect beast, that means each port essentially gets its own IP address.

	thunderbolt0 IP	thunderbolt1 IP
Nuc1	192.168.100.1	192.168.102.1
Nuc2	192.168.101.1	192.168.100.2
Nuc3	192.168.102.2	192.168.101.2

… This is assuming the subnetmask is 255.255.255.0 (default for). If you’d rather do it with a /30 IP subnet for each port to show the class how talented you are at subnetting, or just virtue signal, or confuse newbs in the reply, yes, any IPs will work as long as you get the subnetting and netmasks right.

Proxmox also needs another trick to work properly and that is getting the nslookup for each node right.

Each nuc will need a hosts entry that specifies the thunderbolt IP that corresponds to the other 2 nodes. And this is different on each node – not something PVE proxy really expects to be a thing – but it does seem to work anyway.

What I mean is that if you are on the terminal for Nuc2 and ping nuc1 and nuc3 you should get
192.168.100.1 and 192.168.101.2 back as the IPs (and they should respond to pings).

This requires that you manually edit /etc/hosts on each of the 3 nodes in the cluster and add the other two nodes’ IP Addresses.

It is also very important the thunderbolt cables be attached as shown in the video and that all the ping commands work.

Setting up The Proxmox Cluster

With the network interfaces configured, it is essentially a texbook Proxmox Cluster configuration from here.

https://pve.proxmox.com/wiki/Cluster_Manager

Can the cluster have more than 3 nodes?

The reason it is easy to run 2 or 3 nodes is because each node can connect directly to each other node using one or two ports. If you wanted to run 4 nodes, you’d need 3 fast connections on each node.

Why am I only getting 800-1200 megabytes per second?

Internally our little NUCs share 40 gigabit link for both 40 gigabit thunderbolt ports – at most we’d only ever see 4 gigabytes/sec between nodes. Software limitations mean we typically see 10-15 gigabit per port +/- anyway.

The other problem is that a fair amount of bitrot is creeping into the thunderbolt-net driver its seems. I’ve been working on some fixes around AMD PCIe tunneling, but nothing to share here just yet. If you’re itching for something to work on, understand kernel drivers (or want to learn) this would be a pretty fantastic project for you, I think.

Would a thunderbolt NAS work for the cluster?

Not really because there is no easy way for the 3 nodes of the cluster to share

PhaseLockedLoop · August 12, 2023, 6:03pm

So

Is this only a software limitation or can the minisforum 7940hs or the 5600x not really push this required bandwidth? I imagine the low power hardware becomes an issue of a bottleneck at some point correct?

wendell · August 12, 2023, 6:15pm

Afaik 5600x doesn’t have pcie tunneling. 7000 series amd only has one port so no internal sharing.

The software unoptimization is rough

PhaseLockedLoop · August 12, 2023, 6:25pm

Yeah I bet. Its one of those things where if its not a common need it gets the bare minimum maitenance. Whats worse rn are all the exploits coming out on AMD and intel. Waiting for the patch and am wondering if it will affect performance and/or this application as well. Often each time the patch first comes out theres a lot of performance regressions but this is really cool. I want to grab like 4 minis forum um790pros and do this at some point. Ill have those take over as compute and HAproxy and turn the ZFS pool machine into a storage machine with a 10GBE card or something

Im fine with the 2.5GBE ports on them (i believe thats the cap?) 4 of those should saturate the 10GBE connection to the storage server

I actually wonder can one use TB4net to act as a fail over to protect against a SPoF on the network?

PhaseLockedLoop · August 12, 2023, 6:30pm

What i mean here is 2.5GBE is the primary but if it fails there could always be another path around maybe? I guess there would need to be a TB4net connection from the storage server or some redundant path link over ethernet as well

wendell · August 12, 2023, 6:35pm

Thunderbolt is only point to point so not exactly normal. But that also means you won’t ever have a single point of failure on a 3 node cluster

Dual 2.5g nucs are perfect

PhaseLockedLoop · August 12, 2023, 6:40pm

Ive seen people puling way high passmark scores on the um790s already. They look like the perfect compute nodes for me and I can put the pis on other hobbyist stuff. Gotta get my new job sorted then will probably do so. Hope to see how this develops over time. The mini pc popularity might push development on this tech (maybe)

Nyquith · August 30, 2023, 11:06pm

This is a really interesting concept. I could definitely see this expanded even further beyond 3 nodes. To ensure redundancy it would just need some dummy type network interfaces for use in the hosts file, cabling it up in network ring topology, and setting up a Linux routing suite like frr or bird. I am not sure if the CPU overhead for large data transfers going through intermediate nodes would degrade the cluster’s performance though.

The current NUCs I use for virtualization are getting old so I might have to look into using this technique for an upgrade.

c7a55cd06258db6 · August 31, 2023, 10:48pm

Hi, saw this video today. Neat project.
Might have a little value-add concerning the routing… the ‘bgp unnumbered’ approach makes it very easy to implement full mesh, spine-leaf, ring (cisco stacking), etc. networks on linux these days; on arbitrary ethernet interfaces. ‘unnumbered’ means hosts get loopback /32s and you can basically plug and unplug things willy nilly in whatever configuration you like - without reconfiguration.

You can do it with FRR. Heres a starting point:

xi_Slick_ix · September 1, 2023, 12:26am

Many thanks for all those “Soon™” along the way to get here. I have three skull canyon NUCs that are going to get some work this weekend.

Those are particular interesting nodes for this instance as they have two proper nvme slots and one wireless card slot that I have successfully tested an adapter with. It’s only two lanes, but it’s still faster than sata.

Boot drive on a mini nvme with kapton tape, plus 2 regular nvme per node for ceph.

Here’s hoping this tests well!

ThatGuyB · September 1, 2023, 1:25am

I watched today’s video. GJ Wendell, like always.

About this thing, you mentioned you’d need 3 servers for quorum in proxmox, but what I may have missed if it was mentioned, is that the witness can be something other than proxmox. If you go with something like a linux-based NAS (probably even FreeBSD) you can just install corosync on the NAS and have it serve as a witness / tie-breaker.

Problem is, if it’s going to have different IP addresses and not addressed by both hosts, you need to find a way to share the nfs shares (or iscsi) to both of them. Maybe you can trick proxmox by using hostname and a different /etc/hosts entry for the nas too, idk. Maybe that’ll work.

But IMO if one goes all the way to make a 3-way thunderbolt cluster, why one doesn’t use Ceph and 3 proxmox nodes and make the cluster hyperconverged by that point…

I was thinking of doing something wild with some Odroid H3+, but like always, I’ve been reluctant to buy more stuff, because I haven’t even finished setting up what I already own with services. It’s got dual 2.5 Gbps and I could add USB ethernet for the VM networking if I don’t buy a 2.5 Gbps switch.

I was hoping to make a budget tutorial / guide on HA home stuff with ARM SBCs, but that could actually be less price-competitive to something like a h3+ or a NUC thunderbolt cluster. I never thought there’s the option to just wire everything together (although I did that in the past).

@Nyquith welcome to the forum!

MvL · September 1, 2023, 5:14am

You guys are talking about a high availability cluster? If I’m correct a high availability cluster uses shares storage for the VM’s to store. So the VM’s are stored on the NAS thus you don’t have to move your VM’s between nodes. So in this case you need as fast as possible storage to your NAS I guess? Forgive if I’m wrong I’m came across the video on YT and found this thread. I’m also looking into a Proxmox cluster with mini pc.

Blockquote

System requirements

If you run HA, only high end server hardware with no single point of failure should be used. This includes redundant disks (Hardware Raid), redundant power supply, UPS systems, network bonding.

Fencing device(s) - reliable and TESTED! NOTE: this is NEEDED, there isn’t software fencing.
Fully configured Proxmox_VE_2.0_Cluster (version 2.0 and later), with at least 3 nodes (maximum supported configuration: currently 16 nodes per cluster). Note that, with certain limitations, 2-node configuration is also possible (Two-Node High Availability Cluster).
Shared storage (SAN or NAS/NFS for Virtual Disk Image Store for HA KVM)
Reliable, redundant network, suitable configured
A extra network for Cluster communication, one network for VM traffic and one network for Storage traffic.
NFS for Containers

It is essential that you use redundant network connections for the cluster communication (bonding). Else a simple switch reboot (or power loss on the switch) can lock all HA nodes (see bug #105)

slidermike · September 2, 2023, 2:32pm

@wendell can you do a guide like this but with the intel 100Gb cards mentioned (omni path) in the video? That sounds much more appealing and for only a nominal additional cost.
Thank you.
Appreciate the guides you post!

rmn · September 3, 2023, 4:51am

what a coincidence! I received my latte panda sigma board last week and I did some initial tests with other thunderbolt compatible devices I had (Mac studio, Windows PC, Linux Laptop). I was about to order 2 more latte panda sigma boards but thanks to your video, I am going to make sure the 10 GB/s speeds are possible before I invest more.

I will be keeping an eye on the thread and happy to take a look at the progress you made on the thunderbolt-net driver so far.

Somewhat related to this type of infrastructure, I am in the search for a tiered storage solution and I was considering using bcache (PCIE 3.0x4 NVME Drive + Sata HDD) and using that to setup the Ceph OSD for the large/slow Ceph Pool for the workloads in the proxmox cluster. Any other recommendations?

allen.000 · September 4, 2023, 9:10pm

@wendell Hoping you can help or at least clarify…I’ve following this guide with my cluster to setup a 3-way point-to-point cluster. I used the same layout for static IPs and I’m able to ping each node and resolve by hostname in the /etc/hosts file.

However, I’ve hit a roadblock as I’m confused on how define and set up the cluster links. When you create a cluster and Proxmox asks you for links, I’m not sure how many to create and what to set it to for each node.

My understanding is that Proxmox assumes that it is able to communicate with every other node using one link but that’s not the case here with point-to-point connections.

My assumption is the links would look like the following but would appreciate a confirmation or correction:

	pvehpe01	pvehpe02	pvehpe03	Comments
link 1	10.10.201.1	10.10.2012	10.10.203.3
link 2	10.10.203.1	10.10.202.2	10.10.202.3
link 3	??	??	??
link 5	192.168.60.61	192.168.60.62	192.168.60.63	NB: Management network to reach nodes web GUI after cluster creation.

wendell · September 4, 2023, 10:39pm

Use a hostname, not ip, and then pve proxy works by machine name even though the IPs aren’t universally the same.

ggiinnoo · September 5, 2023, 7:12pm

I also would like to know wich cards he is talking about. I cannot seem to find them.

wendell · September 5, 2023, 9:50pm

This video?

You can use them crossover, no switch. So two cards per node. For dedicated 100gb links

ggiinnoo · September 6, 2023, 8:29am

ahhh yes, now i remember.

And i also remember why i didn’t order them. In europe these things are not cheap. for about 90 euro a pop it doesn’t seem a smart idee to buy 2 just as an interconnect from a truenas machine to a proxmox host.

They have older dd3 384gb of memory and 20 cores/40 threads, but that card wouldn’t keep up.

I’ll gues that i am sticking to 10gbit for now… until the 200gbit cards and amd epyc cpu’s are affordable for mere humans.

Dqixol · September 6, 2023, 11:24am

Hi @wendell, All,

I am a PhD student in high energy physics, with (I hope) good C++ and python skill, but no kernel driver development experience. Here is a MR I did for the LHC: Enable TauJet reconstruction at AOD level - infrastructure for boosted lep-had ditau reco (!50531) · Merge requests · atlas / athena · GitLab

I recently watched your video about the thunderbolt-net optimisation/availability issues, and I am keen to do something with it.

Now I am using two ex-datacentre 40G mellanox connect-x3 to hook my NAS and server, but that does takes two x16 pci-e slots (one on both machine) which I very much want to reserve to storage / GPU and what not. It would be nice if the same throughput could be achieved with thunderbolt-net so 40G point-to-point connection will be widely available for us consumers.

It would be nice if I could take part on this!

btw huge fan to L1, Love your vision to consumer hardware (SRIOV for ALL!)