NVME Controller Reset Under Load (Mushkin Vortex)

I picked up a new Mushkin Vortex 2TB NVME gen 4 drive (amazon product/B09T5DGV6R) back in Feb, to replace my aging 1TB of the previous version of the drive.

A couple weeks later, I started getting minor errors on the old drive, so I migrated everything to the new drive. Not a problem, and lucky timing. Or so I thought. Shortly after, I started having serious instability issues, which I eventally traced down to the CPU (single core workload spiking temps to 90c, on a 3800x, voltage staying too high, repasting didn’t help) and replaced it. I’d always thought I’d just lost the silicon lottery with it, but from some investigation elsewhere, it seems likely the TIM was defective and has likely separated from the IHS internally. Supposedly, the early 3800x were prone to that. Regardless, with the new CPU the system is happy as can be, temps and voltages normal, and no random memory corruption.

But that left me with an install disk that has seen several weeks of abuse. Not a huge deal, just need to identify which files got damaged by comparing checksums and restore them from backups. So I started a scrub, and the system promptly crashed. Queue several hours of investigation. I finally have some answers, and some questions. I’m hoping someone here can help with the latter.

From the dmesg log, something goes wrong and the controller resets itself. The reset fails and the kernel removes the device and emergency remounts the system. The error implied that ASPM might be involved, but it is not. The drive is under active use, so no chance that it’s trying to sleep when the issue shows up. Regardless, I tried disabling ASPM and nvme sleep states with no effect.

I also checked the drive temperature. According to the data sheet, it is good for up to 70C, not sure if that is 70C ambient or 70C internal, but in either case, the warmer of the two sensors topped out at 68C before a crash. Still, thinking it might be temperature related, I gave the drive a dedicated fan, improved its contact with its heat spreader, and ultimately attached a water block to the drive. This helped, but did not solve the crashing. At 65C it crashes in seconds, at 55C it crashes in minutes. I manually paused and resumed the process, keeping the controller temperature below 50, and it still crashed. A more natural workflow (compiling the latest available kernel) did not crash, despite having a fair bit of IO.

No, the updated kernel did not help (6.2.8 and 6.3.5 both affected).

I also tried limiting the interface to gen 3, or forcing it to gen 4. No change.

nvme error log and smart log show no issues.

So I decided to see if the issue was an interplay with the FS drivers and the hardware, and used a blkio cgroup to limit the read iops to something low. And the problem has gone away. Even if I reduce the cooling applied and let the disk warm up, it’s okay at 2500 iops. Setting it to 10k iops crashes in about 5 minutes.

Obviously, this is not ideal, as that runs somewhere in the 300 MiB/s range, on a disk that should handle about 10x that in semi-random reads, but till I have a proper solution I’ve set a global iop limit on the device and my system appears to be stable.

It is possible the device simply needs a quirks table entry for something in the kernel, as the drive is relatively new, from a relatively minor vendor. So I’ve ordered another of the drives. If it has the same behavior out of the box, then it’s a Linux / device firmware issue and I’ll pursue getting that sorted (and probably swap to a different dram gen4 drive since they’ve come down in price). If I can’t, then I’ll probably RMA the one from Feb.

The question I have here is if there is some other kernel parameter that might matter, or some other test or diagnostic I missed, which would be useful either to the RMA people or for writing the quirks entry for the drive? Also, assuming it’s not just a bad drive, what are people recommending for the roughly 2 TB, gen 4, DRAM cached nvme drives these days?

1 Like

Alright, I have some answers…

First, some of the difficulty in my initial troubleshooting was caused by a faulty ram stick. Gotta love when multiple components semi-break at the same time.
memtester helped me track down which stick was failing, and it failed even at its SPD speed. Yanking it left the system relatively stable, but did not fix the nvme controller reset.

New memory and another Vortex drive arrived today. The new memory is successfully running at its advertised speed. For cost reasons, I did give up CL16, the new stuff is CL18. Fortunately, the cache on the x3d should mitigate the impact of that . Once the RMA completes on the old ram, I may see if it is happy running at CL18 and get extra capacity for my trouble, or I may stick it on the shelf to pass along to someone later.

I considered simply trying to dd or ddrescue the questionable drive, but before writing possibly bad data to the new drive, and having it fail halfway through, I started with dd to /dev/null from the old drive (I know, more I use it, higher risk of it completely failing, but I trust my backups). Surprise of all surprises, it cannot successfully read sequentially. The same sequential end-to-end read on the new drive is still in progress, but is well past where the other drive failed. It is also running cooler. So I think we can safely say this is a bad drive, rather than a linux driver quirk issue.

Looks like I’ll need to secure-erase it and RMA it too.

Bottom line, if anyone else is having random drive issues, make sure to run memtester before wasting too much time, but it might be a bad drive too.

2 Likes

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.