So I was looking for a good server solution for a multi-player games as a solo dev, and I don’t like windows but figured I should learn hyper-v and found this. I think this sounds like the set up I’m looking for and thought it might help any solo devs or small teams.
7 hypervisors running Windows Hyper-V Server 2019 in a cluster.
All nodes are running Threadrippers with a minimum of 128 of memory per node, probably half the nodes have 256.
The Threadrippers have so many PCI lanes it’s amazing. We have 3 NVMe drives per node, some nodes have a dual port 10 gig NIC and the ones that don’t have dual cards, and a GPU.
These are used for our build and test farm. All of the nodes have a VM with the GPU passed through to the VM so we can run automated tests of our game.
The rest of the capacity is a mix of Windows and Linux VMs to act as build nodes, test servers (running the server, not game), and a couple random other VMs such as our internal game servers (Valheim, Minecraft, Factorio, etc.).
All the VMs we run can be brought up from nothing to full via Ansible, so we don’t really care about having them on a CSV for high availability. We prefer to have them on local NVMe disks for speed. Our build servers also usually have 750GB or 1.2TB VHDs because our builds are fucking stupid in size. Honestly our biggest limitation is disk space locally… so with that in mind we don’t replicate the VMs. This is why we make sure we can bring new build servers up from scratch quickly if needed.
We do have a couple CSVs though. One for things like ISOs, base VHDs that we can copy from, and other random things. The other CSV runs the VMs that don’t have such critical IO needs, such as our game server. We run those off of a TrueNAS server.
Everything I just described is what we call our “Tier 2” cluster. Basically, if it goes down we fix it, but it isn’t an emergency.
We have a Tier 1 cluster as well. That is 3 Dell R6515 nodes with AMD Epyc CPUs, 256 GB of memory, and runs all the VMs where uptime matters. This is things such as our TeamCity server, Perforce, monitoring infrastructure, Ansible AWX, GitLab, and more. Each node here has 4 SFP+ ports, 2 onboard and 2 via an expansion card.
This cluster is backed by a Dell ME412* and two expansion ME4012*. The head unit is connected to compute nodes via DAC cables directly, no switching layer in between. iSCSI with MPIO access to the LUNs to share the CSVs.
Overall we have been super happy with it. Pushing our build servers onto commodity Threadripper hardware has been great. We get a huge amount more build capacity for the same money, performance has been great for the build machines, especially the newer ones that have PCIe4 NVMe drives.
We could get away with not clustering our Tier 2 nodes. Having the CSV just makes things a smidge easier. We have migrated VMs between nodes and, other than taking a bit of time because 1.2TB aint small yo, never had a problem on the Threadripper nodes.
There is always more then 1 way to weld 2 cats together so if you have other suggestions…