On an ESXi host that I’ve been messing with, the SSD that I use for the main datastore (120GB CS900) will intermittently disappear and cause the running VM to crash. Ultimately the host crashes as well when I try to access the datastore in web UI. Everything is fine upon reboot.
Not really knowing what I’m doing, I SSH’d into the host machine and eventually found plenty of I/O errors in /var/log/ relating to this SSD. So I think ‘Ok, the drive is just going bad’ but when I move the drive to another machine and run short and long tests with smartctl, tests are all completed with no errors and I can mount it’s file system with vmfs6-tools and nothing looks out of place.
Replacing the drive is no problem, but as an experiment I’d really appreciate some input on further steps that I can take to confirm that this drive is causing the problem I’m having.
It might be a BIOS thing. Or the CPU/chipset. The only way to exclude the troublesome SSD is to replace it with another one and let it run a while to see what it does. If that one fails too, a hardware upgrade would be in order.
It may also be a cache or probably better wear-levelling issue, especially if the drive is quite full the controller may not have enough free memory cells to account for failed ones. But that should show up in SMART and you’ve reported no errors. Besides, I’d only be worried about such issue when the drive has 50+k power-on hours. (I had a OCZ Onyx series SSD fail on me, after well over 70+k hrs )
50k, holy smokes. This drive only has ~1800 hours on it and runs with ~70% available in this application. Not to say it hasn’t been abused in the past. As affordable as SSDs are in this capacity I think your advice to retest with a second drive is a good idea.
I have been testing with different onboard SATA ports, but it didn’t occur to me to test the cable itself. The cable I’m using is at least as old as the drive.
Swapping ports dropped the MTBF from days to hours, which very well could be explained by a flaky cable.
The known good cable didn’t solve it. It occured to me that this morning that the drive that ESXi is installed on may be the one having problems, I’m going to take a look at it next time I take the server down.
Also, recovering ESXi dump files isn’t as straightforward as I expected, though I’m sure it’s second nature once you’re used to it.
As I was digging through the BIOS to enable pxe for further smartctl tests, I noticed that xhci-handoffs were disabled in BIOS. I had been intermittently seeing xhci-handoff errors, but the hardware seemed to be working and I didn’t know what it was so I ignored it. I’m passing a pcie usb controller to one of the guest VMs so I’ll see if clearing that up makes the difference.
This was the first time using PXE on a physical computer, which was pretty cool.
exactly while the cables are robust they are subject to heat stress and vibration.
loose cable heads can cause flaky operation due to thermal expansion and contraction.
loose electrical connections have a slightly higher resistance and will in turn generate even more heat.
The xhci errors persisted, so as a diagnostic step I’m running the workload on bare metal so see if the issues persist without the hypervisor. Looking like I was simply experiencing some of the USB hub passthrough sketchiness that I’ve read about.