Proxmox with Ceph - Failover Issue

Hi there,
I have a question with regards to a failover issue on Ceph enabled Proxmox cluster.
I have built a test environment with three identical nodes. Added all of them to the Proxmox cluster and enabled Ceph. The Proxmox cluster has its own subnet. Put a few VMs on the Ceph based storage, all worked fine. I am able to migrate the WMs across all three nodes without any problems. However, if I were to pull the power on one of the nodes to simulate a power outage, the VMs (live or not) do not get failed over to other nodes in timely manner. It takes over three minutes to fail over in the end, which is far from ideal. I do not know if this is by design (which would render it unusable for me), or if there is a setting I missed somewhere.
I’d appreciate any input you can provide me.
Thank you

1 Like

Just want to confirm you have HA set up in addition to the cluster.

HA is its own configuration in addition to cluster in the datacenter settings:

Screenshot 2023-09-27 at 7.27.17 PM

And VMs need to be explicitly configured for HA:

Screenshot 2023-09-27 at 7.29.22 PM

2 Likes

Another thing is fencing. Nodes with quorum have to be sure the node is entirely down and it’s not running the VM in question as running a VM twice is very dangerous for integrity.

Having a fencing device in any form ensures the remote shutdown/powercycle of the node in question and failover can (safely) start.

check the Proxmox wiki on fencing:

https://pve.proxmox.com/wiki/Fencing

I hope this helps you understand and fix/optimize your failover issues.

2 Likes

Yes, HA settings are correct, and works OK, the issue is that it takes minimum three minutes to failover, which is the issue.

1 Like

Thank you, I have never configured Fencing before, I will look in to it and report back.

1 Like

Does 3 minutes include the reboot time? It’s been a while since I tested it, but I believe a hard cut will necessitate a full reboot of the VMs since the RAM doesn’t sync over the network…

1 Like

Digging more into the Proxmox support wiki…

https://pve.proxmox.com/wiki/High_Availability

I’m getting into the matter atm because I want to do a HA cluster short- to midterm. Only got some improvised homelab tinkering and VMs to work with, to get more familiar with everything.

But I know that live migrate is way faster than failover time. Simple asynchronous copy from machine to machine is a much simpler process than declaring a node to be dead.

Maybe you got a latency spike, some quirky cable that recovers network link after 10 seconds. But if quorum dediced to boot a failover VM in the meantime and then the original machine with VM gets back, both writing data to the same shared storage…
That’s consistency nightmare and probably requires restoring from backup at the very least.

1 Like

Yeah, if the requirement is to have services with <1m downtime on host failure, either the service itself needs to be clustered (ex: ctdb for samba) or you need kubernetes. All of the above could run in proxmox VMs though.