NVMe SSD disappears/disconnects from the system

I’ve been struggling from this for a few months and could use some help.

There are two systems with the same issue:

  • Gigabyte Gigabyte MZ32-AR0 (rev. 1.0, with rev. 3.0 BIOS) + Epyc 7302P
  • ASUS Pro WS TRX50-SAGE WIFI + Threadripper 7970X

On both systems OS NVMe SSD periodically disappears from the system (not visible in lspci or BIOS until turned off completely, unplugged from the wall and cold started again). When this happens motherboard’s disk LED is typically blinking with fixed pattern or shine continuously, on Gigabyte system doesn’t turn off even if system is shut down until I unplug it from the wall.

Both systems are running Ubuntu 24.04 (one server and one desktop edition).

Looks something like this:

[ 3181.444715] nvme nvme5: I/O tag 262 (2106) opcode 0x2 (I/O Cmd) QID 12 timeout, aborting req_op:READ(0) size:16384
[ 3185.732599] nvme nvme5: I/O tag 15 (f00f) opcode 0x2 (I/O Cmd) QID 11 timeout, aborting req_op:READ(0) size:4096
[ 3185.732631] nvme nvme5: I/O tag 16 (1010) opcode 0x1 (I/O Cmd) QID 11 timeout, aborting req_op:WRITE(1) size:20480
[ 3185.732646] nvme nvme5: I/O tag 17 (a011) opcode 0x1 (I/O Cmd) QID 11 timeout, aborting req_op:WRITE(1) size:4096
[ 3185.732708] nvme nvme5: I/O tag 854 (2356) opcode 0x2 (I/O Cmd) QID 14 timeout, reset controller

I had these issues with Solidigm P44 Pro 2TB and SK Hynix P41 Platinum 2TB drives (one on each system), which I though were defective/incompatible in some way, so I swapped them with Samsung 990 Pro 4TB drives and the result is basically the same, maybe less frequent.

I tried swapping PSUs, changing various BIOS options (like disabling AES, downgrading PCIe version to 3.0 for the slot), but nothing seems to help.

Also tried nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off and it doesn’t help either.

The SSDs have good temperatures (under 60C most of the time), do not report any internal errors.

I reported issue to SkHynix and Solidigm, SkHynix support said to contact their Amazon shop (that I can’t do due to being second hand user), with Solidigm they offered RMA, but that didn’t work for me either.
Local Samsung was unwilling to help, so I contacted US/Global support and yet to receive a response from them.

It seems that more full/heavily loaded SSD is more easily tripped by this.

Given this is 4 drives across 2 machines I doubt it is hardware related. Maybe firmware or just Linux kernel, so I posted in these two places (they have kernel logs attached as well):

My computers are barely usable, any kind of help is highly appreciated.

Yeah, with the error appearing on two totally different systems with several different drive models, I’d certainly look at the OS (and the kernel in particular).

Here’s someone with what looks like the same issue. Either downgrading to linux-6.7 or playing with module params, e.g. for that poster nvme_core.default_ps_max_latency_us=100 nvme_core.io_timeout=3000 is a workaround?

I have 6.11 kernel on one system and stock Ubuntu 6.8 on a different one, so I’m not sure what could be there too. I guess I can downgrade the kernel, but I usually prefer the latest instead.

I already tried setting nvme_core.default_ps_max_latency_us to different values, but have not seen nvme_core.io_timeout mentioned in various issues online yet, I guess I can try that next, thanks for the hint!

No, nvme_core.io_timeout=3000 changed nothing. Disk disconnected after scrubbing disk for 0:09:41:

[  775.624418] nvme nvme5: I/O tag 8 (1008) opcode 0x2 (Admin Cmd) QID 0 timeout, reset controller
[  857.059312] nvme nvme5: Device not ready; aborting reset, CSTS=0x1

Response from Samsung Memory Services:

Unfortunately, we are unable to support Linux based systems.

Looks like I’ll have to try older kernels and they try to bisect from there if successful.

I’ll add to the confusion: I had a similar issue with a Crucial T705 2TB NVMe (PCIe Gen 5).

This was the OS drive and it would just vanish after reboot. It would work long enough for the live installer to install my OS, then, same as you, it would vanish (not visible in lspci or BIOS until turned off and cold-restarted).
The weird thing is that this was intermittent. The drive would (some times) reappear, but there was no logic I could find to make it disappear from the system. No intense IO (like reported by others) or system-specific (or OS specific) modifiers.

I tried downgrading PCIe speed, kernel options, and different OS (Fedora 40 and Pop_OS 22.04). Rebooting the system meant a coin toss that the OS drive would disappear (I actually found the probability of disappearing closer to 80%) .

I had 2 NVMe drives in this system (both Crucial T705), so I just removed the 2TB one and sent it for replacement, and used the 1TB one for OS, which has been working fine ever since. No disappearing acts ever. I eventually added a second T705 drive (4TB) for storage, also solid performance and 100% reliability even after multiple reboots.

System specs:
AUSU TRX50 + Threadripper 7970x
4 * 64GB RAM

Okay, so we have at least 4 drive models (2 of them are very similar) all on AMD platform that have this behavior, very suspicious :thinking:

It does disappear randomly, but in my case higher sustained load (even if just reading) definitely trips Samsung 990 Pro (already sold Solidigm and SkHynix drives) in 100% of cases.

My drives worked fine for a while, I think filling them up (and using for databases) help with this issue.

One of the kernel developers replied at 216809 – nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, looks like they’re pointing fingers at drive manufacturers, which point fingers to Linux kernel and unwilling to do anything about it at all.

Any pointers to debug this will be highly appreciated :pray:

For what it is worth, lots of people have had issues with Samsung NVMe drives (even Pro ones) in any kind of long uptime server/workstation applications.

Something very much like this happened with one of my Samsung 980 Pro’s in my Epyc server. It’s like the firmware of the thing just hard locked. The drive would not come back unless power cycled.

The thread above is related to Windows Server, but I had the same issue in Linux. (Proxmox install) In my application there were two 512GB Samsung 980 Pro drives in a simple ZFS mirror it was booting from.

All of the heavy I/O work on this server is done on other storage pools, so these two boot drives saw relatively light loads. Even so, with longer uptimes I ran into this issue.

I only ever had it occur with uptimes greater than a month or two. I ahve since sued these drives (once removing them from the server) on a client machine with more typical client loads that involve regular reboots, and I never saw an issue. But others seem to have seen it with heavy database type workloads even with shorter uptimes.

The conclusion over at ServeTheHome forums seems to be that Samsung NVMe drives (at least anything after the 970 Pro) just aren’t good for anything but consumer workloads.

Which shouldn’t be surprising. They are consumer drives after all, and using them for server-like applications are something they are not validated for.

But it is odd that you had similar issues with actual enterprise drives.

I really won a lottery getting a few drives like this in a row. What a luck!

I guess I’ll be looking for replacement then, don’t have patience dealing with them.

Thanks for the link!