Strange virtual disk performance on server 2019 vm running SQL Server

Hi everyone.

I have a strange issue that I am not having a lot of luck finding information about.
I have a windows server 2019 VM running on a VMware ESXi 6.5.0 host that has all SSD storage in a raid 50 array.

The VM is running SQL server 2019 and has 3 virtual disks, C: for OS and application, D: for the database files and F: for manual database snap shots and backups and shuffling things around.

the VM has 16 Cores and 96gb of RAM, the database is currently sitting at 136GB

Recently we have noticed that database performance has fallen of a cliff and it seems to be related to the virtual disk D: that the database files are located on. It basically has the performance of an old spinning disk from the 90’s

Strange thing is there is no obvious reason as to why, or at least to me with my limited experience administering SQL Server. The other 2 virtual disks are performing exactly as expected at SSD speeds and expected response times.

The CPU, memory and network are all well within requirements with non of them coming anywhere close to ever hitting 100% utilisation, however when looking at the disk metrics in resource monitor we are seeing some strange behaviour.

When there is little or no load on the server it looks ok at a glance but then as soon as any load hits the disk, it will shoot to 100% active time and queue length and response time will rise rapidly, and transfer speeds cap out at only a couple of megabytes a second. Basically bringing the database to a screeching halt. This all while the CPU, memory and network barely even register a blip.

Its got me a bit stumped as I have not seen this before and I’m not experienced enough to diagnose it.

All three virtual disks are in the same storage pool in the host, are connected to the same virtual controller in the VM config and are all the same Thick Provisioned lazy zeroed. Its like something is sapping all the disks performance on that one disk, but is not showing in the OS?

Any pointers as to what I could investigate would be appreciated. If I can’t figure this out I will have to call in the consultants.

Edit: as I’m sure someone will ask, yes we checked the raid array and everything is healthy, and green lights on everything on the server.

Q: why are you still running ESXi 6.5? It’s ancient.

I can’t remember if 6.5 has official support for server 2019 at all - it may have capped at 2016, definitely not unless it’s a later patch level than rtm (and if you aren’t running vSphere you have no hypervisor patch management?).

I’d upgrade to fully patched 6.7 or 7 and see if issues persist.

Edit:
Also. Not sure if ESXi supports ssd trim. And if does, maybe only in more recent versions or on specific controllers. I’d wager your hardware maybe isn’t on the VMware HCL(?), which does matter if you want a properly supportable environment.

I’d at least look into trim and confirm whether or not your version of esxi even supports server 2019 to start with, and remediate if necessary.

Just because it runs, doesn’t mean it is fully supported or fully functional. I’ve personally run into esxi version vs windows version compat issues in the past. They’ve caused hypervisor pink screen of death or other problems.

I’ve never done it as we don’t run massive sql here but heavy sql use is also one of those scenarios to consider raw disk pass through to the vm from memory.

If on the off chance you do have VMware support i’d also suggest logging a ticket. They’re generally pretty good.

If you don’t and are running free esxi in production due to license costs, I’d seriously suggest considering a move to hyperv as at least if you’re up to date with windows licenses it covers virtual instances much cheaper - and windows is less picky with hardware.

Sorry if this post smacks of “you’re running an unsupported platform” but… diagnosing this stuff in a virtual environment is more complex than physical as there’s far more layers of abstraction going on and far more room for bugs or configuration “gotchas”.

Can’t stress enough to try at least get up to date at least to the level where esxi officially supports 2019 though; it does matter and I’d be pointing the finger at ESXi as things stand in your current environment. Not inside the windows VM (but also maybe with vm resource allocation).

Windows resource monitor won’t give you jack shit on that and without vSphere the performance metrics on the VMware side will be awkward to extract without getting into the ESXi command line.

1 Like

Oh other things…

The VMware scheduler can trip over itself if your host is resource constrained.

If you don’t need (I.e. via observed consumption, not dickhead consultant reading a spec sheet written by a nerd based on bang for buck running on physical hardware) 16 cores do not allocate them. Especially if you don’t have more than 16 real cores (not hyper threaded ones) on the host. Unused but allocated virtual cores in VMware are super not good. They burn time slices for nothing.

Having 16 virtual processors in a VMware vm means the vm does not get scheduled to run until all 16 cores are available at the same time. Which may mean it is delayed (e.g. 16 real core box needs some for the raid, virtual networking, hypervisor overhead/housekeeping, etc.). Be careful with over subscription and allocating a heap of cores.

You say the vm is nowhere near hitting its limits. Try taking some cpu and ram off it based on checking the below stuff.

You can see if this is happening on both the host and guest by measuring the cpu “co stop” and “ready” metrics.

Ready and co stop are indications that the vm is sitting there waiting to be scheduled but all the processors required aren’t available. Ideally both of these metrics should be ZERO. But less than 10 (on the real time graph) isn’t terrible. It show many ms your vm was frozen waiting for cpu scheduling in the sampling period.

Reducing cpu core count in that case can sometimes actually improve performance as the vm can be scheduled to run with less delays.

What does the RAM consumption look like on the host? Any memory balloon on the guest? Is SQL servers memory setting tuned appropriately? Just because you’ve given the guest 96 gig and sql thinks it has 96 gig, doesn’t mean the hyper visor will actually supply it. On the hypervisor metrics “balloon” is memory the hypervisor has reclaimed from the guest. In the guest any memory used by the balloon process is how much of the VMs real memory has been swapped out by the hypervisor.

If you’re seeing balloon consider how much memory you have available in the host vs how much the guest really needs vs what to give it and what to tune sql for.

Again if sql thinks it has 96 gig and is tuned for that and VMware has reclaimed say 50 gig it won’t all be happy.

If you definitely have plenty of ram, and definitely want to ensure sql has it consider locking the resources in vm settings.

Hi Thro, thank you so much for your reply.

So to answer some of the points you mentioned

The server has 2 x 12c 24t CPU’s so 24 physical cores and 48 threads and is currently running 2 production vms one with 16 cores (sql) and one with 8 (an application server), it also has 256gb of ram installed and 8 1tb hp ssds in raid 50. We are will within the hardware limits of the host server.

It was configured with the raid and this version of EXSi before I started working here and its what they had license for. I then setup the new VMs for a new ERP software on it

You may be right about the age of the hypervisor and server 2019. Your suggestion of switching to Hyper-V is well received, I had suggested that myself and was going to start testing on another server we have and see if I can convert the VM to hyper V (I know I can just build a new VM and move the database files over but I don’t want to rebuild the app server if I can avoid it, it was a bloody pain to setup, I would rather convert if I can)

The idea being if I can convert the virtual disks and spin them up on the Hyper-V host then I can reconfigure the production host to the latest version of Hyper-V and migrate them back. I also have a lot more experience with Hyper-V than ESXi and so I’m more comfortable with it.

I will have a look into the setting and metrics in ESXi you mentioned and see if they help. Given I cant go buying more licensing at the moment for newer ESXi versions, there may be some limitations I can’t overcome, as you pointed out.

I will let you know what I find.

Yeah, I’m actually in a similar situation here - free ESXi installs in a lot of sites (we were a vSphere shop with a lot of shitty little remote sites that couldn’t justify vSphere). This was from before HyperV was a thing.

Not sure if you’re aware, but ESXi itself is free/no support from VMware, not sure if there is a pay license for it, that’s normally associated with vSphere/vCenter server which does the control/central logging/license server, etc.

If you’re running ESXi standalone, it may well be a FREE UPGRADE for you, as you may well be on the free license. If you have no vCenter, I’d suggest this may be the case?

If you have VMware support subscription (and vCenter/vSphere) I believe upgrades are included.

Virtual disks can be converted, either offline with Starwinds v2v converter, or sort of on-line using Microsoft SCVMM and vCenter server.

vCenter is available as an eval for 60 days, you’ve (maybe?) had contact with SCVMM before if you’ve run hyper-v with multiple hosts in a cluster (it’s Microsofts equivalent product to vSphere).

Basically you can add your vSphere cluster to SCVMM, and it can handle migration of VMs. Not whilst running unfortunately, but it will automate disk conversions, etc.

We do run vSphere, and before they hired me (more for my experience with the ERP software than my IT admin experience, although I do alright most of the time) they had a contract with a small local IT company that managed and maintained all their servers for them and did all the other general IT stuff like client setup, support, networking, etc.

There annual spend with them had got to the point where it made sense to hire someone full time and that’s where I came in.

We still have them on contract but now only as a backup to me and to assist me where needed, I contacted them previously about this issue.
I called them again today and mentioned some of the points you raised and asked why we where on such and old version and he told me that it was to maintain the compatibility to migrate VM’s to some of the older servers we have for failover purposes.

So that’s a reason I guess. He is going to help me look into the stuff you suggested to look at in esxi and then do the Hyper-V tests, he thinks we can restore/convert the VMware stuff to Hyper-V from within Veeam Backup which should make life a little easier.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.