Bizarre PCIe trouble with 4TB Crucial T500 NVMe SSD

So I just got this drive, and it has worked flawlessly in every other system I’ve tried it in, including my ancient Skylake testbench board.

I have an MSI PRO X670-P WIFI, which while otherwise working fine, including with other NVMe drives, is displaying a truly bizarre behavior with this particular drive.

And yes, I’ve just swapped out the CPU with a brand new one and the behavior is unchanged, so this isn’t some CPU PCIe circuitry degradation or something.

When I install this drive, in any M.2 slot on the board, it will work flawlessly for exactly one power cycle. I can reboot, I can hit the reset switch, the drive works just fine.

Until I turn off the machine. After that has been done, the drive will never ever detect ever again. Until I physically remove and reinstall the drive, then it works completely perfectly for exactly one power cycle.

What the eff? This screams BIOS issue to me, though MSI just blames the SSD.

I have a working Linux/SystemRescueCD environment on the machine, so if any PCIe geniuses here have any ideas, I’m all ears.

I’ve compared lspci outputs in three states - ‘not installed’, ‘installed and working’, and ‘installed and not working’. The ‘installed and not working’ outputs are absolutely identical to the ‘not installed’ outputs. The relevant host bridge registers are exactly the same between ‘installed and working’ and ‘installed and not working’.

2 Likes

lspci tree outputs
broken-tree.txt (3.4 KB)
working-tree.txt (3.5 KB)

lspci -vvv outputs
broken-details.txt (161.1 KB)
working-details.txt (170.3 KB)

Does it reset to working again if, rather than removing the drive, you switch the PSU off/pull the plug for a while? I.e. is it the complete loss of power that makes it work again?

Or do you have to boot the computer without the drive installed once before it starts working again after reinstall?

If the latter it clearly points to an UEFI bug IMO. Otherwise, it’s perhaps more difficult to say where the problem is? Either way, bizzare, as you say!

1 Like

Nope. From a not-working state, just turning the machine off, pulling the drive and plugging it back in is all that’s needed to return to working order. For one power cycle.

I started following this line of thought by pulling the BIOS button cell and disconnecting the ATX power supply, but this board seems to have some quite considerable capacitance somewhere. I was quite surprised to measure 1.9V across some of the pins on the ATX plug in this state.

I think my KVM switch was supplying some kind of voltage to the board via the HDMI and USB plugs. Once I disconnected those these phantom voltages went away, and the board was completely inert. I then reconnected the PSU and KVM, and the SSD works. (For one power cycle)

Is this some kind of phantom voltage keeping the SSD with just enough voltage to crash the controller, but not low enough to trigger a reset signal of some kind?

I think this line of inquiry is going somewhere. After shutting down the computer, ensuring the coin cell is not present, and disconnecting the 24pin ATX and 8-pin ATX12V for the CPU, there’s 1.9V consistently present on the 3.3V pins on the 24-pin ATX socket (Ergo all the 3.3v logic and devices on the board are getting 1.9V, unless my understanding is off) until I disconnect the HDMI cable from my KVM, at which point it almost instantly goes to zero.

Plugging the HDMI cable back in does not restore the 1.9V - The 3.3v line remains at zero… Is disconnecting the HDMI cable between power cycles all that’s needed here?

Yes. Yes it is. And the fault isn’t limited to my KVM. I took the KVM out of the picture and just connected my HP ZR24w monitor (via DVI->HDMI adapter) directly to the HDMI socket on the motherboard, and the issue persists. The issue does not occur when the monitor is connected via DisplayPort.

I’d be really curious to know if my board is somehow defective, or if MSI made an oopsie on this design? Hilariously, if I wasn’t simplifying for troubleshooting by relying on the integrated graphics instead of an add-on board, this problem would not have occurred.

2 Likes

I suspect a problem with BIOS. Might sound stupid, but do you have the latest BIOS version?

I’ve tried several recent BIOSes, none of them have made any difference. This seems like an electrical design fault on the board, they’ve got some transistor that NPNs when it should have PNPd when the board is turned off, or something.

1 Like

For anyone finding this thread in the future, this was cross-posted over on Hacker News and prompted some additional discussion over there - PCIe trouble with 4TB Crucial T500 NVMe SSD for >1 power cycle on MSI PRO X670-P | Hacker News

For anyone curious, this doesn’t seem to affect all MSI boards. I’ve got a new X870-P WIFI, and it does not experience this phantom 3V3 with a connected HDMI display.

I have a similar problem with my Crucial T500 2TB in combination with an Asrock B550m Steel Legend mainboard. When i cold boot the PC after i had the power plugged off, it will boot into UEFI and the SSD is not recognized.

I found two solutions:
(1) Pressing the power button again and again to turn off the PC and turn it on again will work after ~8-15 attempts.
(2) Removing the power cord after the first failure and replugging it OR using the power toggle switch at the PSU will make the SSD work almost instantly, sometimes a second attempt is required.

SSD will work normally then. When i just reboot the PC from that point, the SSD is reconized instantly.

Issues started when i replaced my GTX 1070TI with a GTX 3070. I am using 2x DP to connect my displays. Have already acquired new DP cables, didn’t change anything.

I have similar problem with my Lenovo Thinkstation P520 and Crucial T500 2tb. Drive is not detected after powering on. I my case Computer can work for weeks turning on and off without problems as long as it’s not going into hibernation (this causes no BIOS detection right away). It appears that it randomly has this issue every so often. In my case the only way to recover is to pull PSU from the system. Fortunately its one of the proprietary ones that slots directly into the motherboard with no cables - so its quick. I suspect that the drive sometimes exceeds some max time BIOS is trying to detect drives causing it to be skipped. It does it quick enough when starting from completely unpowered state. Once working I had not problems with drive. Decided to purchase new Samsung drive after all attempts to resolve for year and half (latest BIOS, drive firmware, drivers) will see and update how that ends.

This resolved my issue. Appears to be compatibility - issue. Strangely since I mounted T500 as second drive in PCIe to M2 adapter - it works perfectly. Not sure if this would work flawlessly as boot drive this way.

If your HDMI cable is connecting the shield ground on both sides, this can allow a dc bias between two devices to leak into the system. I’m pretty sure this breaks the spec, but by itself doesn’t prevent the cable from working, just invites these gremlins in. So a lot of cheaper or careless cables are made like that. What kind of cable is it?