NVME Based Storage Clusters (Ceph vs AzureHCI vs VSAN)

Hi Everyone,

First post here but I’ve been watching the YouTube videos and was hoping for some direction/guidance.

We are looking to deploy a multi-node, multi-site enterprise infrastructure for my company’s virtualization and storage needs. With the affordability of NVME drives these days we are excited to go all NVME but are struggling to compare different options meaningfully head to head.

For example, Vmware VSAN seems to be offer the best of everything performance and feature wise but the costs are becoming increasingly discouraging (licensing is almost as mutch as the hardware!). Azure HCI was our original default pick but watching some of the YouTube tech channels it seems that there are issues with overloading system RAM on high throughput transfers (i.e. I saw this mentioned on LTT but I’m not sure if that is still true?). Finally, I like the open source/reliability/scalability/flexibility of Ceph but the performance seems to be a real issue (or is it?).

Just wondering how best to approach this and most importantly where to find some good head to head reviews (ironically on Google almost everything I’m seeing dates back to the pre-all-NVME era 2019, etc).

Thank you!

sync or async failover side?

Likely async because we are looking at trans-continental distances on a private network so ~100ms type latencies. We aren’t yet at the stage of being able to justify in-metro redundancy.

That said if we do a multi-cluster set up (for ex 2x 2-4 Node Supermicro Big Twins) then we would likely establish some active-active interaction.

If it only needs to be async, why not ZFS with replication between the sites?

So for the start and medium term async replication and then in the long term active-active to get what vmware calls “automatic transparent failover” or fault tolerance?
That speaks for VMware, I don’t think you’ll get to that level with Ceph, but not sure, I’ve been out of the topic for a while.
We have a hybrid 4 node VSAN cluster since VMWare 6.0, now 7.0.3 to get rid of the flash-based web client.
It is a great solution in combination with Veeam Backup, no real problems so far, but unfortunately the price …
To have a comparison, we have a Proxmox/ZFS cluster running at another location.

It’s by far not the same level as VMware, e.g you have to ensure that your snapshots are “application-aware”, but otherwise it’s fine and works for us.
Both nodes have active VMs and are synchronized with the second node every 15 minutes, if one node has problems, only 50% of the VMs need to be restarted.
Since our most important database has its own active-active cluster solution, we don’t lose any important data.
When it’s time to replace the VSAN cluster, it will be replaced by Proxmox.
And Azure HCI is then again a completely different approach, there is certainly more to discuss than just the performance.
And in the end, it also depends on your team, there are no solutions without the right customer :wink:

1 Like

That’s definitely an option and I like ZFS a lot, just started to become aware of Ceph and the flexibility with scale-up down and around was really appealing. That way we can for example get a Storinatory or a JBOD and just add on incrementaly as needed without having to buy more chassis, do reconfigs etc

1 Like

You bring up a good point which I was also thinking about. Just rely on application-native HA to manage the real failover which makes things like Vmware Continuous Availability less relevant (frankly I don’t think we have any workloads that couldn’t survive an unexpected reboot but it was a nice feature to have piece of mind wise).

The other feature I really like that VMware is offering is the per VM storage policies. These seems like a really smart way of managing disk space/raw data inefficiency with resiliency but again not sure it justifies that cost----you can afford to waste a few TBs of data if you aren’t paying Vmware licensing!

Curious–what is your high level concern with Azure HCI?

Well it looks like I didn’t get the memo that MS has a counterpart to Vsphere/VSAN since 2020.
I thought Azure HCI is a cloud only solution and not an on-prem hybrid cloud solution for AZURE.

I don’t know, maybe I’m biased, but I can’t get Windows and “vhost” together in one sentence, an antivirus program at the hypervisor level?
These are problems I don’t want to think about.
I tested a bit with Hyper-V and Storage Spaces some time ago and wasn’t convinced, performance with Storage Spaces wasn’t good, but that has probably gotten better.
The release of Windows 8 made me think, since then we have been removing dependencies on Microsoft and proprietary solutions in general with every change in our IT.
But that always depends on the situation, with the requirement to possibly operate a Stretched cluster across several locations, I probably wouldn’t plan to switch to Proxmox completely now either.
Although there will certainly be solutions without tying yourself to Microsoft or similar bloodsuckers.