Intermittently failing SSD, but drive "tests" good

wizarddata · November 18, 2021, 7:14pm

On an ESXi host that I’ve been messing with, the SSD that I use for the main datastore (120GB CS900) will intermittently disappear and cause the running VM to crash. Ultimately the host crashes as well when I try to access the datastore in web UI. Everything is fine upon reboot.

Not really knowing what I’m doing, I SSH’d into the host machine and eventually found plenty of I/O errors in /var/log/ relating to this SSD. So I think ‘Ok, the drive is just going bad’ but when I move the drive to another machine and run short and long tests with smartctl, tests are all completed with no errors and I can mount it’s file system with vmfs6-tools and nothing looks out of place.

Replacing the drive is no problem, but as an experiment I’d really appreciate some input on further steps that I can take to confirm that this drive is causing the problem I’m having.

Thanks in advance.

Dutch_Master · November 18, 2021, 9:07pm

It might be a BIOS thing. Or the CPU/chipset. The only way to exclude the troublesome SSD is to replace it with another one and let it run a while to see what it does. If that one fails too, a hardware upgrade would be in order.

It may also be a cache or probably better wear-levelling issue, especially if the drive is quite full the controller may not have enough free memory cells to account for failed ones. But that should show up in SMART and you’ve reported no errors. Besides, I’d only be worried about such issue when the drive has 50+k power-on hours. (I had a OCZ Onyx series SSD fail on me, after well over 70+k hrs )

HTH!

SgtAwesomesauce · November 18, 2021, 9:35pm

Check the cables? I’ve had faulty/loose sata cables cause issues similarly to how you’re describing.

wizarddata · November 18, 2021, 10:35pm

Dutch_Master,

50k, holy smokes. This drive only has ~1800 hours on it and runs with ~70% available in this application. Not to say it hasn’t been abused in the past. As affordable as SSDs are in this capacity I think your advice to retest with a second drive is a good idea.

SgtAwesomeSauce,
I have been testing with different onboard SATA ports, but it didn’t occur to me to test the cable itself. The cable I’m using is at least as old as the drive.

Swapping ports dropped the MTBF from days to hours, which very well could be explained by a flaky cable.

SgtAwesomesauce · November 18, 2021, 10:48pm

Sounds like you’ve got a suspect. I’d grab a known-good cable and see if that fixes it up.

Everything degrades with age, cables are often forgotten about.

Dexter_Kane · November 19, 2021, 4:40am

I’ve had similar issues with disks in my NAS which turned out to be a bad PSU.

wizarddata · November 19, 2021, 5:30am

Was it just that the voltage was sagging under load or something more insidious?

Dexter_Kane · November 19, 2021, 7:37am

I’m not sure, I just tried it with a different PSU and all the problems went away.

wizarddata · November 19, 2021, 7:35pm

The known good cable didn’t solve it. It occured to me that this morning that the drive that ESXi is installed on may be the one having problems, I’m going to take a look at it next time I take the server down.

Also, recovering ESXi dump files isn’t as straightforward as I expected, though I’m sure it’s second nature once you’re used to it.

wizarddata · November 19, 2021, 11:58pm

As I was digging through the BIOS to enable pxe for further smartctl tests, I noticed that xhci-handoffs were disabled in BIOS. I had been intermittently seeing xhci-handoff errors, but the hardware seemed to be working and I didn’t know what it was so I ignored it. I’m passing a pcie usb controller to one of the guest VMs so I’ll see if clearing that up makes the difference.

This was the first time using PXE on a physical computer, which was pretty cool.

Gnuuser · November 20, 2021, 7:30pm

exactly while the cables are robust they are subject to heat stress and vibration.
loose cable heads can cause flaky operation due to thermal expansion and contraction.
loose electrical connections have a slightly higher resistance and will in turn generate even more heat.

wizarddata · November 22, 2021, 6:38pm

The xhci errors persisted, so as a diagnostic step I’m running the workload on bare metal so see if the issues persist without the hypervisor. Looking like I was simply experiencing some of the USB hub passthrough sketchiness that I’ve read about.

wizarddata · November 24, 2021, 4:15am

After a few days of testing I’m fairly certain that the issue with with ESXi compatibility with my usb controller and not either SSD.

system · August 24, 2022, 10:15pm

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.