I faced a problem a few weeks back that caused the cogs in my head to turn a few weeks ago.
After I updated my SG-5100 to 2.4.5 I had an utter failure. The system crashed turning the update and it wouldn’t successfully complete. I tried to do a recovery on it after the fact with a Netgate provided ISO and I was able to get back into production, but it had various issues that I later discovered had to do with a problem with the onboard eMMC storage.
I’ve since reinstalled the OS on an m.2 drive and haven’t had any issues. Perhaps too much logging wore out the NAND (the firewall was purchased in November)? That being said, shit happens, especially to me. I thought spending a mint on Netgate hardware would mitigate enough of that risk to keep me in production without much of an issue, but clearly I was wrong.
I know that generally speaking when you do HA clusters, especially in pfSense it is best practice to do them in identical pairs, but I have enough resources in my VMware host to potentially do an HA cluster, half physical with the SG-5100 and the other half in a VM. Is this something my friends here have ever messed around with doing?
Let me know what land mines lay ahead if you can
I do work with my brothers MSP IT Business (37West), he runs pfsense for his core router at the data center site he colocates in. When he went to upgrade his setup recently this was something we debated heavily. We discussed virtualization of the setup or going bare metal. We opted for bare metal for sanity of mind and simplicity of the architecture. Our thought on it was although virtualization of the setup would give us all the magic and wonders of virtualization, it would also add abstraction to the setup. Thus adding an additional layer(s), additional complexity, and potentially more points of failure. Though what we did find, that is a HUGE plus is, if you pay for pro support for your pfsense, then the SN/support key tied to the pfsense installation will carry with the VM. Where as with a bare metal setup it is tied to the hardware. So if your brand new server catches fire moments after you tie the hardware to the SN/support key, you are hosed and have to buy a new one. We confirmed this with the support team. Personally and professionally I recommend going bare metal with some good quality servers and lots of old fashion hardware redundancy. We have 2 identical boxes tied (carped) together in a failover setup (primary and active hot spare). We have it configured to auto switch all traffic should one box go down. Additionally he (I think recklessly haha) uses the mechanism to perform live upgrades/patches to prod. He’s had a patch fail and just had to press a key to flip it over. Then just a matter of re-applying the good config to the broken box. Unfortunately I am not well versed personally on all the details with pfsense in this configuration. Perhaps others could chime in and add detail. Hope this helps!
The problem is the Sg-5100 is in my house…and it is $700…
I haven’t done this specifically but depending on the hypervisor you are using, sometimes the arp cache in them can be quite sticky.
In the case of a failure/fail-back you may need to manually clear the VMware host’s arp table. I think I’ve even had to reboot a VMware esxi host before to get around a similar stuck ip->MAC address in the past. I can’t remember exactly HOW my situation occurred (but I suspect something to do with a Mac address or IP address. moving around outside of vMotion’s knowledge - i.e., outside of the virtual environment, possibly like you will run into mixing physical/virtual), but basically I had traffic not getting routed properly due to incorrect ARP information in the virtual environment.
Not saying you WILL run into the issue, but depending on your version/configuration of VMware you may run into the issue where the virtual switch’s/host network stack idea of reality and the actual reality aren’t quite right. i.e., I’d give it a shot, but if you DO run into an issue, keep arp cache issues in mind and a reboot may just fix it.
Yup… exhibit A in my post above
For clustering you need three or more static IPs in each subnet where you want to use CARP. If you can’t get that from your ISP you’ll need another device in front to do NAT.
Oh man!! Ya that’ll sure throw a good wrench in things. Given this and after re-reading your original post, I would advise you enact the manufacturer warranty on the Netgate appliance (should be at least a 1year and cover the storage issue), and then either flip it or keep it. Then for roughly the same price as one of those new. Buy two 1u SuperMicro 1u mini server from ebay. Ebay is a great place to pick up second hand enterprise hardware, lots of reliable sellers and good components. Its a healthy market. Additionally the SuperMicro mini servers come in at a few different price points and configurations to fit your fancy.
I have worked with these boxes in the field, and installed them as primary routers onsite for clients businesses. I recommend them and have had good (if not very good) experiences with them and they’re reliability. They have good enough performance for what most routers do.
For HA clusters in pfsense here is the netgate documentation, https://docs.netgate.com/pfsense/en/latest/highavailability/index.html it does mention and include information for VMware hosts doing this.
As @carnogaunt mentioned, there are requirements for CARP. This is from that same documentation "For a cluster to function, a few things are required, such as:
Minimum of three IP addresses per subnet (one for primary, one for secondary, one or more for CARP VIPs) – This can be avoided on pfSense software version 2.2, but is still recommended.
A dedicated interface for state and configuration synchronization. For security reasons it is best to keep this isolated, though it could be run on a local network interface it is not advised.
Layer 2 equipment that properly handles multicast
Upstream/ISP/other involved routers that properly respect the addresses used by CARP."
You may have issues with going this way as you are operating on a home ISP connection so your ISP may not like you doing this. Not to say it can’t be done, just that there may be more to that.