I have a strange issue that I am not having a lot of luck finding information about.
I have a windows server 2019 VM running on a VMware ESXi 6.5.0 host that has all SSD storage in a raid 50 array.
The VM is running SQL server 2019 and has 3 virtual disks, C: for OS and application, D: for the database files and F: for manual database snap shots and backups and shuffling things around.
the VM has 16 Cores and 96gb of RAM, the database is currently sitting at 136GB
Recently we have noticed that database performance has fallen of a cliff and it seems to be related to the virtual disk D: that the database files are located on. It basically has the performance of an old spinning disk from the 90’s
Strange thing is there is no obvious reason as to why, or at least to me with my limited experience administering SQL Server. The other 2 virtual disks are performing exactly as expected at SSD speeds and expected response times.
The CPU, memory and network are all well within requirements with non of them coming anywhere close to ever hitting 100% utilisation, however when looking at the disk metrics in resource monitor we are seeing some strange behaviour.
When there is little or no load on the server it looks ok at a glance but then as soon as any load hits the disk, it will shoot to 100% active time and queue length and response time will rise rapidly, and transfer speeds cap out at only a couple of megabytes a second. Basically bringing the database to a screeching halt. This all while the CPU, memory and network barely even register a blip.
Its got me a bit stumped as I have not seen this before and I’m not experienced enough to diagnose it.
All three virtual disks are in the same storage pool in the host, are connected to the same virtual controller in the VM config and are all the same Thick Provisioned lazy zeroed. Its like something is sapping all the disks performance on that one disk, but is not showing in the OS?
Any pointers as to what I could investigate would be appreciated. If I can’t figure this out I will have to call in the consultants.
Edit: as I’m sure someone will ask, yes we checked the raid array and everything is healthy, and green lights on everything on the server.