Ultimate Availability HV Setup

Koonsie · May 28, 2021, 6:14pm

Looking for advice on a configuration to maximize uptime on my windows environment for Virtual Machines. I am operating about 10 VM’s for data acquisition of scientific equipment. we currently have 2 hypervisors to prevent an issue with catastrophic hardware failure. I am wondering if there is a better arrangement to “guarantee” 100% uptime even when the primary hypervisor needs to reboot for updates or goes offline.

mutation666 · May 28, 2021, 6:16pm

Which one?

redocbew · May 28, 2021, 6:17pm

And how many physical machines are those VMs spread across?

oO.o · May 28, 2021, 6:45pm

Afaik, the high availability offerings from oVirt, Proxmox and maybe XCP-ng all involve bringing a VM back up on another hypervisor based on distributed storage or SAN, so it still goes down, it will just automatically come back up.

From what I remember, VMware has a “fault tolerance” feature where the VM can survive a sudden node failure without a restart. How this works, I have no idea since it would presumably require syncing memory across a network. Also a decent chance I’m remembering that wrong or didn’t fully understand the feature at the time, but obviously it’s possible since cloud IaaS does it.

nx2l · May 28, 2021, 6:54pm

you can get live migration with most hypervisors, (only about 1ms downtime)

its free on proxmox and some other open source hypervisors.

for an HA cluster… most say to use 3 hosts.

for preventative stuff… moving VMs around usually doesnt have any (real) downtime.

But if a VM is on a host that fails… it could take a min for the VM to start on another host.

Koonsie · May 28, 2021, 8:23pm

All running on windows Server using Microsoft hypervisor

Koonsie · May 28, 2021, 8:27pm

We have all of our VM’s (8 instances) running on a single virtualization server. there is a secondary virtualization server laying in wait. unfortunately the failover is not acceptable for my instruments. a VM reboot or loss of connectivity is not ideal

cybersplice · May 28, 2021, 9:55pm

Yep, fault tolerance / vLockStep has been around forever. It’s got pretty strict requirements for guest, host, and network configuration but if you need to guarantee the guest stays up in the event of host failure then this is a good solution.

It’s not perfect.

The guest OS can still crash or patch out or whatever.

If you can run the data gathering application in a kubernetes or similar solution that might be better. I know that might be difficult for this application ingesting from an instrument.

cybersplice · May 28, 2021, 9:57pm

Hi Koonsie.

If your hosts have the same CPUs and sufficient resources you might be able to pull off vmware fault tolerance, but that’s the only solution I’m aware of that will directly satisfy your requirements.

You’ll need the paid version of vSphere / ESXi.

rcxb · May 29, 2021, 2:13am

That’s only if you initiate a migration (for resource-balancing or scheduled maintenance or similar reason). If the hardware actually has a catastrophic hardware failure as described, the VM crashes and has to be booted from a powered-off state on the other node.

oO.o · May 29, 2021, 5:09am

Same with ovirt

risk · May 29, 2021, 10:52am

I’ll try to approach this particular reliability problem from a different perspective:

How is this “data acquisition” being done?
Can this be re-engineered somehow to allow for parallel “data acquisition” from the instrument(s) to multiple physical hosts simultaneously?

How much data are we talking about?
What’s the format of the data?
What’s the protocol for getting it?

The reasoning here is that no system is 100% reliable, but if you minimize the fault intolerant part of the system to be smaller and simpler then the system overall would be more reliable.

In other words, you can divide and conquer, split the problem of reliability into smaller parts, and solve each problem separately.

For example:

If your instruments are outputting data onto a serial port, you could have a microcontroller with a small buffer that would relay the data over the network to any host connected to it over http or over mqtt in parallel. Acting like a “tee” of sorts. That way, tripping over a power cable on one of the big fat large storage hosts mid way through “the experiment” doesn’t cause an outage. Normally the data would be recorded twice, and deduplicated later. But when tripping over a cable, only one copy would survive.
In this example, this microcontroller could be a $10 WT32-ETH01 ($10) and you can reuse the esphome firmware stack and write your own UART component to hold the ringbuffer and your own asyncwebhandler component to handle streaming data from the ring buffer over http. Overall i estimate you’d be relying on < 1000 lines of custom c++ ; (<100 maybe, but it’s a stretch) and the rest would be widely used opensource code, and cheap and easy to replace hardware.

Your SPOFs (multiple single points of failure - things that can fail and don’t have a replacement) then become your instrument, UART cable to your microcontroller, microcontroller running your code, ethernet cable to your switch, switch, and from there you could have a redundant “real network” with multiple VM hosts.

In general, the fewer moving parts to the setup the better, you can avoid lots of complexity by replicating your important data early.

nx2l · May 29, 2021, 12:13pm

fyi
I said the same things, if you read the rest of what i wrote.

Koonsie · May 29, 2021, 7:05pm

I have gas chromatograph/mass spectrometer instruments from a vendor and their typical configuration is a single computer connected locally using ethernet. the software is proprietary (not as simple as RS232) and the process of collecting the data has a long of relatively small reads and write processes. Upon completion of the data acquisition, the files are then saved to my file server and the process repeats 20ish times a day for 7 different instruments. the ultimate file size for each acquisition is only 2 mb. The VM’s are set up with 6 cores at 2.3ghz and 16gb of ram. The allocation of hardware is adequate based on the specs listed from the manufacturer. This software cannot be mirrored as far as I know. it also only runs on windows so i cant do some of the other things mentioned.

cybersplice · May 29, 2021, 9:06pm

Unfortunately a lot of scientific equipment is like this. I used to work with some spectrographs at a major soft drinks manufacturer and they were the same. They went for running it on AS/400, giving my first exposure to Unix. I didn’t know what hit me.

@Koonsie I think vmware fault tolerance is your only option.

I appreciate this may not be reasonable financially, but possibly you can get academic or nonprofit pricing?

In case I forgot to mention and you aren’t familiar, Fault Tolerance synchronises your running VM to a running but passive second copy of the VM on a second node. In the event the primary host fails, the passive copy comes online without any interruption in service.

It’s used in medical and scientific applications quite regularly.

The open source hypervisors like Proxmox and xcp-ng don’t have fault tolerance as far as I know, just HA.

The underlying qemu software supports FT, but nothing is implemented in the hypervisor distributions yet. I’m sure they will come at some point, but there’s less pressure with people using kubernetes et al.

The commercial version of the citrix hypervisor has an FT feature I believe, but I’m not experienced in that. It may be cheaper if so.

risk · May 30, 2021, 5:12am

@Koonsie what’s the device vendor/model name?

I agree doing VM migrations - assuming your storage is networked (disks backed by iscsi for example) is worth doing.

Background for reengineering

My mom used to do XRD, when I was in my teens I ended up helping her migrate the proprietary ancient controller PC into a VM and later reverse engineered the serial protocol used (VM helped do mitm on serial - in your case I’d consider busting out wireshark to look at some ethernet pcaps).

My dad used to work in a civil engineering lab where they did soil core sample testing and concrete testing - a bunch of test rigs hooked up to transducers and other physical sensors that would data log stuff at various resolution and sample rates depending on the type of testing - allowing the experiments to run mostly unattended 24/7 . Again, back in my teens, I wrote a simple Linux driver to help them with data acquisition and then did some database/workflow modeling.

A friend of mine has a hobby of reverse engineering glucometers, and most recently did the linbus based protocol used by LG HVACs.

I’m not saying re-engineering the control side of the instrumentation is always practical, but it’s certainly technically possible, and it’s definitely doable if there’s a pressing need. Folks who are doing science often aren’t really aware of what options are out there when it comes to software/computer engineering or what’s involved in getting some of these things running. In my experience science folks are often very aware of how the instrumentation works at a physics/hardware level, but software/control is a black box to them.

Koonsie · August 6, 2021, 2:59am

Agilent mass spectrometer. And sorry I dropped off the thread for a while. I also have a instrument from trace elemental, a thermo scientific ion chromatograph and a few others.

system · May 6, 2022, 9:00pm

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.