I’ve been struggling from this for a few months and could use some help.
There are two systems with the same issue:
- Gigabyte Gigabyte MZ32-AR0 (rev. 1.0, with rev. 3.0 BIOS) + Epyc 7302P
- ASUS Pro WS TRX50-SAGE WIFI + Threadripper 7970X
On both systems OS NVMe SSD periodically disappears from the system (not visible in lspci or BIOS until turned off completely, unplugged from the wall and cold started again). When this happens motherboard’s disk LED is typically blinking with fixed pattern or shine continuously, on Gigabyte system doesn’t turn off even if system is shut down until I unplug it from the wall.
Both systems are running Ubuntu 24.04 (one server and one desktop edition).
Looks something like this:
[ 3181.444715] nvme nvme5: I/O tag 262 (2106) opcode 0x2 (I/O Cmd) QID 12 timeout, aborting req_op:READ(0) size:16384
[ 3185.732599] nvme nvme5: I/O tag 15 (f00f) opcode 0x2 (I/O Cmd) QID 11 timeout, aborting req_op:READ(0) size:4096
[ 3185.732631] nvme nvme5: I/O tag 16 (1010) opcode 0x1 (I/O Cmd) QID 11 timeout, aborting req_op:WRITE(1) size:20480
[ 3185.732646] nvme nvme5: I/O tag 17 (a011) opcode 0x1 (I/O Cmd) QID 11 timeout, aborting req_op:WRITE(1) size:4096
[ 3185.732708] nvme nvme5: I/O tag 854 (2356) opcode 0x2 (I/O Cmd) QID 14 timeout, reset controller
I had these issues with Solidigm P44 Pro 2TB and SK Hynix P41 Platinum 2TB drives (one on each system), which I though were defective/incompatible in some way, so I swapped them with Samsung 990 Pro 4TB drives and the result is basically the same, maybe less frequent.
I tried swapping PSUs, changing various BIOS options (like disabling AES, downgrading PCIe version to 3.0 for the slot), but nothing seems to help.
Also tried nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off and it doesn’t help either.
The SSDs have good temperatures (under 60C most of the time), do not report any internal errors.
I reported issue to SkHynix and Solidigm, SkHynix support said to contact their Amazon shop (that I can’t do due to being second hand user), with Solidigm they offered RMA, but that didn’t work for me either.
Local Samsung was unwilling to help, so I contacted US/Global support and yet to receive a response from them.
It seems that more full/heavily loaded SSD is more easily tripped by this.
Given this is 4 drives across 2 machines I doubt it is hardware related. Maybe firmware or just Linux kernel, so I posted in these two places (they have kernel logs attached as well):
- 216809 – nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting
- Bug #1910866 “nvme drive fails after some time” : Bugs : linux package : Ubuntu
My computers are barely usable, any kind of help is highly appreciated.