I picked up a new Mushkin Vortex 2TB NVME gen 4 drive (amazon product/B09T5DGV6R) back in Feb, to replace my aging 1TB of the previous version of the drive.
A couple weeks later, I started getting minor errors on the old drive, so I migrated everything to the new drive. Not a problem, and lucky timing. Or so I thought. Shortly after, I started having serious instability issues, which I eventally traced down to the CPU (single core workload spiking temps to 90c, on a 3800x, voltage staying too high, repasting didn’t help) and replaced it. I’d always thought I’d just lost the silicon lottery with it, but from some investigation elsewhere, it seems likely the TIM was defective and has likely separated from the IHS internally. Supposedly, the early 3800x were prone to that. Regardless, with the new CPU the system is happy as can be, temps and voltages normal, and no random memory corruption.
But that left me with an install disk that has seen several weeks of abuse. Not a huge deal, just need to identify which files got damaged by comparing checksums and restore them from backups. So I started a scrub, and the system promptly crashed. Queue several hours of investigation. I finally have some answers, and some questions. I’m hoping someone here can help with the latter.
From the dmesg log, something goes wrong and the controller resets itself. The reset fails and the kernel removes the device and emergency remounts the system. The error implied that ASPM might be involved, but it is not. The drive is under active use, so no chance that it’s trying to sleep when the issue shows up. Regardless, I tried disabling ASPM and nvme sleep states with no effect.
I also checked the drive temperature. According to the data sheet, it is good for up to 70C, not sure if that is 70C ambient or 70C internal, but in either case, the warmer of the two sensors topped out at 68C before a crash. Still, thinking it might be temperature related, I gave the drive a dedicated fan, improved its contact with its heat spreader, and ultimately attached a water block to the drive. This helped, but did not solve the crashing. At 65C it crashes in seconds, at 55C it crashes in minutes. I manually paused and resumed the process, keeping the controller temperature below 50, and it still crashed. A more natural workflow (compiling the latest available kernel) did not crash, despite having a fair bit of IO.
No, the updated kernel did not help (6.2.8 and 6.3.5 both affected).
I also tried limiting the interface to gen 3, or forcing it to gen 4. No change.
nvme error log and smart log show no issues.
So I decided to see if the issue was an interplay with the FS drivers and the hardware, and used a blkio cgroup to limit the read iops to something low. And the problem has gone away. Even if I reduce the cooling applied and let the disk warm up, it’s okay at 2500 iops. Setting it to 10k iops crashes in about 5 minutes.
Obviously, this is not ideal, as that runs somewhere in the 300 MiB/s range, on a disk that should handle about 10x that in semi-random reads, but till I have a proper solution I’ve set a global iop limit on the device and my system appears to be stable.
It is possible the device simply needs a quirks table entry for something in the kernel, as the drive is relatively new, from a relatively minor vendor. So I’ve ordered another of the drives. If it has the same behavior out of the box, then it’s a Linux / device firmware issue and I’ll pursue getting that sorted (and probably swap to a different dram gen4 drive since they’ve come down in price). If I can’t, then I’ll probably RMA the one from Feb.
The question I have here is if there is some other kernel parameter that might matter, or some other test or diagnostic I missed, which would be useful either to the RMA people or for writing the quirks entry for the drive? Also, assuming it’s not just a bad drive, what are people recommending for the roughly 2 TB, gen 4, DRAM cached nvme drives these days?