Tips on how to spec out a multi-node vmware cluster for failover

mrpopo · March 12, 2024, 8:57pm

Hey everyone,

My company is pushing forward with standing up a multi-node cluster in a colocation site and using VMware’s Site Replication product so that, in the event our main office cluster goes down, it can fail over more-or-less seamlessly to the off-site cluster without too much down time. Our current solution is that our SAN is replicated to a second office 250 miles away from the first, but in case our main cluster goes down, restoring everything from the SAN is completely manual so it’s not really viable.

I’ve been put on as the lead on this project and for the first time since taking this job I feel like I’m actually really out of my depth. For context at this time two years ago I was working in MSP hell fixing label printers and un-fucking Outlook PSTs for the billionth time. Now I’m one of two tier 3 server admins at my current company. I’ve been able to hit the ground running with most things but trying to come up with these specifications for a server cluster that’s probably going to cost over $80k is… phew.

Our current main VMWare cluster is HPe DL360 (gen10 i think) hosts totaling 216 cores and 2.75 TBs of memory across the hosts, along with a pair of PureStorage SAN arrays. The hosts have no internal storage. Right now metrics show CPU usage across the hosts never goes above 30% even during peak times, but memory usage easily exceeds 60-70% during peak loads. 95% average of IOPs is right about 6k, excluding the spikes from when the SANs and Veeam run their backup jobs.

For the colocation site we’re looking at getting 3/4ths to 2/3rds the performance of our current cluster, and want to use internal storage pooled together using VSAN rather than purchase another SAN array. Does anyone have any tips for how to spec out what would be need to cover those needs? Maybe sites that let your contrast and compare different configured systems that might show potential max throughput and such?

Any help is appreciated.

mutation666 · March 12, 2024, 10:54pm

Reach out to vendors get them to pitch you some ideas use them as a feeler for price and ideas

MazeFrame · March 14, 2024, 11:40pm

I recently got a quote for a minimum-spec storage server that costs that

If you are happy with your SAN-situation, I would just replicate what you know works for you. Regarding performance, napkin math would be ~150 CPU cores and 1.8TB memory, which could be done with two mid-spec EPYC-boxes, use three for redundancy, off you go!

bedHedd · March 15, 2024, 12:03am

Didn’t wendell just do a video about this topic?

Where he asked if people were speccing out vdi + vmware clusters?

mrpopo · March 18, 2024, 2:40pm

He asked what their experience was like speccing out VDI clusters from vendors and if the vendors are recommending any graphics acceleration in the spec, whether that’s Intel ARC or NVidia, IIRC.

mrpopo · March 18, 2024, 3:11pm

Yeah CPU and RAM are easy to estimate for future-proofing, my main concern is the throughput of the disks since we’ll be moving from an external PureStorage SAN to internal disks on the hosts using vSAN. It’s hard to find any metrics that say “Each host will have 8 disks totalling this many potential IOPS or potential throughput” etc.

We aren’t doing that for the colo site because each PureStorage array is somewhere around $50k just by itself. If we can keep the entire project under $100k including the hosts that would be ideal and PureStorage makes that hard. Plus given the metrics off of our current array, we don’t really push PureStorage hard enough to justify it’s performance benefits over a more traditional SAN in my opinion. The management and support is fantastic, but we have our arrays networked using 10gbit and there’s not enough load to even saturate those links.