Hi,
this is gonna be a long one. So today’s topic is my server, which runs a broad spectrum of different services - starting with GitLab (CI), Jellyfin, web sites and also many game servers (all this just an excerpt). The specs are (about):
- AMD 7601
- Supermicro board
- 128GB RAM
- ZFS “raid_01”, 3x2TB WD Red HDD (RAID5)
- ZFS “raid_02”, 3x1TB WD Red SSD (RAID5)
- Everything runs Debian 10
This all is done using the host only as ZFS system & hypervisor and than running everything inside their own libvirtd-vms. All vms can access the ZFS data sets using NFS (over an internal virtio-network). I’ve got different vms for different purposes - two for different isolated docker environments (trusted and public), two gameserver vms (one with NFS storage, one which stores everything on its disk) and some others (not relevant today (?)).
The vms root disks are all located on the raid_02 pool (inside a shared data set), the raid_01 is my general data pool (there are also the game server files located).
All pool have lz4 compression enabled and are tuned to pass trough any TRIM events from the vms (yes, there are two minor data disks also located on raid_01 - not relevant).
Whats the problem?
I use Pterodactyl for hosting my game servers - it also provides a handy service to create backups for the servers. I have currently about 8 active Minecraft servers on the vm called “Gameserver-1”. Until recently they all ran their backups (which basically copies the whole server) simultaneously starting every full hour. As one can imagine this caused all the gameservers to lag as hell and freeze for several seconds - often timing out the players on them. I’ve then changed the backup schedules to more spread out over the hour and this problem is now “gone” - but still the performance of the NFS attached data sets are still not impressive. I have only simple copy-paste benchmarks for now, but peak write performance feels slow (my pc 1GB/s → ssh → vm → nfs → root server → zfs: ~50MB/s) while the direct performance is way better (my pc 1GB/s → ssh → root server → zfs: ~120MB/s).
What I tried…
I assume something is going wrong on the ZFS side, as NFS is managed by it. All NFS shares are mounted using “defaults,_netdev,x-systemd.automount,x-systemd.requires=network-online.target” as options. Adding “async” does not affect the performance.
All vms are connected to the server using an virtual network and virtio devices - so the only limit should be the cpu?!
Help…
So here my questions:
- Do you see general problems, which need to be addressed?
- Any idea how I could improve the NFS performance inside my vms?
- I thought about reducing the raid_02 size and adding SLOG / L2ARC on them - any thoughts regarding this?
- I use Netdata for all monitoring, so In case you have some idea on which graphs I should look at…
Anyway, thanks for reading, every response and also for your ideas!