Hello,
When my Asus FlashStor 12 is idle a random drive will drop offline from my RAID-6 array.
I have been working with ASUSSTOR for the past 6-8 months and they noted only 3 people have reported this issue and I am one of them.
They are all on the latest firmware and I’ve also logged 2 cases with WD they noted they can RMA them with me but if each time a drive drops offline it is random I don’t think RMA will help.
Has anyone else seen issues with the WD Red SN700 NVME drives? I am beyond frustrated and next I will write a canary to automatically shutdown the system and WOL to make it rebuild so I can stop doing this manually. However this is just a workaround.
I haven’t been able to follow hardware issues as closely as I had been years ago and was wondering if anyone on L1 forums may have more insight into this?
Below is what it looks like (kernel dmesg) when the drive decides to drop off:
[642985.151226] nvme nvme1: I/O 2 QID 0 timeout, reset controller
[643057.092841] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[643067.614213] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[643067.620833] nvme nvme1: Removing after probe failure status: -19
[643078.134211] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[643078.141064] nvme1n1: detected capacity change from 7814037168 to 0
[643078.147752] asustor remove disk dev nvme1n1
[643083.967301] md/raid1:md0: Disk failure on nvme1n1p2, disabling device.
[643083.967301] md/raid1:md0: Operation continuing on 11 devices.
[643083.990205] md: remove_and_add_spares bdev /dev/nvme1n1p2 not exist
[643083.990213] md: remove_and_add_spares rdev nvme1n1p2 remove sucess
[655967.698332] md/raid:md1: Disk failure on nvme1n1p4, disabling device.
[655967.705136] md/raid:md1: Operation continuing on 11 devices.
[655967.717105] md: remove_and_add_spares bdev /dev/nvme1n1p4 not exist
[655967.717114] md: remove_and_add_spares rdev nvme1n1p4 remove sucess
[676077.270628] rejecting I/O to as-locked device
[676077.270653] rejecting I/O to as-locked device
[676077.270656] Buffer I/O error on dev mmcblk0p1, logical block 496, async page read
[676077.278888] rejecting I/O to as-locked device
[676077.278903] rejecting I/O to as-locked device
[676077.278904] Buffer I/O error on dev mmcblk0p2, logical block 62448, async page read
[676077.287339] rejecting I/O to as-locked device
[676077.287352] rejecting I/O to as-locked device
[676077.287354] Buffer I/O error on dev mmcblk0p3, logical block 62448, async page read
There is a similar review on Amazon in relation to these drives having issues below. Interestingly below this customer is having issues when they are under load. Whereas for me they drop offline when completely idle.
The review below is from Danny V on Amazon, I take no credit for the review below, the purpose is to provide further context as there may be a larger issue with these drives…?
Capacity: 4TB
I had 16 of these installed in my storage cluster (Dell servers) and started to get weird critical failures every night only on nodes with WD Red 4TB SN700 NVMe installed. Further investigation revealed that these drives have catastrophic failures under load which leads to a complete disconnect from the PCle bus. Even a reboot does not help and you need to power-cycle the entire machine.
Needless to mention that all failures went away when these drives were replaced with an NVMe from a different vendor…
Honestly, I expected more from a product specifically targeting storage systems… A huge disappointment.