Asus FlashStor 12 - WD Red 4TB SN700 NVME drives going offline

jpiszcz · October 20, 2024, 10:53am

Hello,

When my Asus FlashStor 12 is idle a random drive will drop offline from my RAID-6 array.

I have been working with ASUSSTOR for the past 6-8 months and they noted only 3 people have reported this issue and I am one of them.

They are all on the latest firmware and I’ve also logged 2 cases with WD they noted they can RMA them with me but if each time a drive drops offline it is random I don’t think RMA will help.

Has anyone else seen issues with the WD Red SN700 NVME drives? I am beyond frustrated and next I will write a canary to automatically shutdown the system and WOL to make it rebuild so I can stop doing this manually. However this is just a workaround.

I haven’t been able to follow hardware issues as closely as I had been years ago and was wondering if anyone on L1 forums may have more insight into this?

Below is what it looks like (kernel dmesg) when the drive decides to drop off:


[642985.151226] nvme nvme1: I/O 2 QID 0 timeout, reset controller
[643057.092841] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[643067.614213] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[643067.620833] nvme nvme1: Removing after probe failure status: -19
[643078.134211] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[643078.141064] nvme1n1: detected capacity change from 7814037168 to 0
[643078.147752] asustor remove disk dev nvme1n1 
[643083.967301] md/raid1:md0: Disk failure on nvme1n1p2, disabling device.
[643083.967301] md/raid1:md0: Operation continuing on 11 devices.
[643083.990205] md: remove_and_add_spares bdev /dev/nvme1n1p2 not exist
[643083.990213] md: remove_and_add_spares rdev nvme1n1p2 remove sucess
[655967.698332] md/raid:md1: Disk failure on nvme1n1p4, disabling device.
[655967.705136] md/raid:md1: Operation continuing on 11 devices.
[655967.717105] md: remove_and_add_spares bdev /dev/nvme1n1p4 not exist
[655967.717114] md: remove_and_add_spares rdev nvme1n1p4 remove sucess
[676077.270628] rejecting I/O to as-locked device
[676077.270653] rejecting I/O to as-locked device
[676077.270656] Buffer I/O error on dev mmcblk0p1, logical block 496, async page read
[676077.278888] rejecting I/O to as-locked device
[676077.278903] rejecting I/O to as-locked device
[676077.278904] Buffer I/O error on dev mmcblk0p2, logical block 62448, async page read
[676077.287339] rejecting I/O to as-locked device
[676077.287352] rejecting I/O to as-locked device
[676077.287354] Buffer I/O error on dev mmcblk0p3, logical block 62448, async page read

There is a similar review on Amazon in relation to these drives having issues below. Interestingly below this customer is having issues when they are under load. Whereas for me they drop offline when completely idle.

The review below is from Danny V on Amazon, I take no credit for the review below, the purpose is to provide further context as there may be a larger issue with these drives…?

Capacity: 4TB

I had 16 of these installed in my storage cluster (Dell servers) and started to get weird critical failures every night only on nodes with WD Red 4TB SN700 NVMe installed. Further investigation revealed that these drives have catastrophic failures under load which leads to a complete disconnect from the PCle bus. Even a reboot does not help and you need to power-cycle the entire machine.

Needless to mention that all failures went away when these drives were replaced with an NVMe from a different vendor…

Honestly, I expected more from a product specifically targeting storage systems… A huge disappointment.

adrianTNT · November 5, 2024, 10:15pm

Hi.
I found your topic while searching similar issues about the WD RED SN700
My setup:

Motherboard: Gigabyte B550M DS3H
Drive: 1TB WD RED SN700
Operating system: Rocky Linux 9.2

Around every 4 weeks or so, I can see the OS offline, if I restart it from button, it goes into bios and shows no M.2 drive, it only shows up after full shutdown.

In last ~10 months the system is just iddling with minimal power consumption.

I have 4 similar setups with same motherboard, OS, same PSU just different drives, just this one has problems

I replaced the PSU but it was not it. Out of my 4 similar computers, this one first had a cheaper PSU, it was Inter-tech, then I replaced it with a Be Quiet Gold series , now all 4 have the “gold” PSUs but the WD RED SN700 still goes offline.

I am glad I found this topic and others, I wanted to create an 8TB Raspberry Pi NAS with 2x 4TB of these WD RED SN700. And I already got a Samsung 990 PRO 4TB and I could not decide between the WD Red and the Samsung. With this problem in mind, even if WD red is optimised for NAS, I might go with the Samsung.
Both brands would run slow-ish in Raspberry board, I wanted one of these SSDs for reliability.

Less important details:

drive is around half full if it matters
SSD wear is low, at 4%
power on hours: 1 year and 3 months

jpiszcz · November 6, 2024, 12:04pm

Around every 4 weeks or so, I can see the OS offline, if I restart it from button, it goes into bios and shows no M.2 drive, it only shows up after full shutdown.

Yes, I had been experiencing the same thing. After troubleshooting with ASUSSTOR for several months, we are trying an upgraded (90W->120W) power supply. I replaced it a week ago or so and I am waiting to see if the issue presents itself again.

I also went the WD route and they claim there is nothing wrong with the NVME SSDs, they do offer replacement under RMA, but this would be very unhelpful as when the drives had been going offline in my case they were typically different ones each time pointing to a power/firmware or other unknown issue, not related to any one specific drive.

I agree with write endurance on these drives is quite good that is why I went with them. Really hoping to see stability over the next few months after the PSU upgrade, if not, not sure where to go from here. At one point I was considering 2 x SONNET 8 port NVME PCI-e cards and going that route but if the drives themselves are the issue, that would not help either.

adrianTNT · November 12, 2024, 5:31am

Must be related to idle, since in both our cases it happen at idle, I think maybe it gets to a point where it doesn’t get enough power and goes offline, or cannot properly go in and out of various sleep / power save states. My computer was not set to go into sleep, but maybe SSD goes in some power save when is in iddle then cannot resume.

I vaguely remember I seen something about a feature on some PSUs (or a BIOS setting) that prevents computer to go into a power save that is too low. It said something like some computers have trouble resuming from a low power state.

I will try to compare bios power settings in all the 4 identic computers (just NVME is different), maybe this bios has some different power savings enabled.

OR … maybe previously used PSU damaged the drive and now doesn’t work with higher end PSU, but less likely, previous PSU worked well in other computers and caused no problems.

jpiszcz · November 12, 2024, 9:01am

Yes, ASUSSTOR just wrote me back, they tried sending a 120W PSU to another customer and they had the same issue, I am waiting to hear back if they had the same drives (I assume that is the case). This is very frustrating, WD says no issues, ASSUSTOR has not been able to reproduce (this issue happens around once a month).

jpiszcz · November 12, 2024, 9:25am

ASSUSTOR just got back to me, they noted that only the SN700 drives and LEVEN JPS850 drives are affected by this issue. There is no other explanation for all of these issues. The team at ASSUSTOR has been unable to reproduce the issue, even under full load.

MetalizeYourBrain · November 12, 2024, 9:38am

Did you check with them if the drive batch is the same as yours? Maybe WD made some changes to the supply chain and some NAND or controller revision plays better with their PCIe switch.

jpiszcz · November 12, 2024, 12:09pm

Not sure on the batch but from what I recall they matched the same drive and firmware… This is really frustrating if these drives are the root cause of the issue and not the PCI-e switch.

MetalizeYourBrain · November 12, 2024, 1:49pm

Since you’re so deep in this issue I’d say it’s worth trying to figure that out.

I saw another post in the thread that mentioned someone reporting the drives dropping out in a Dell cluster. So it might be drive related and not something that has to do with the Asustor device.

jpiszcz · November 12, 2024, 3:59pm

I’ve followed up with ASUSSTOR to ask them about this, thanks.

hbogert · March 15, 2025, 3:35pm

These things are the bane of my existence. Around 1200 o’clock these things crash and idd take down my whole pc. Tried 3 different PCs all linux with ceoh workload. Can’t correlate it to a single Cron job. Gotta be a combination of multiple things crashing this. It’s consistent to 1, or 2 minutes every night.

zmezoo · March 16, 2025, 4:09am

This sounds like a pcie power management issue. Have you checked the pcie aspm(active state power management), and tried forcing the drives into a always active state and not letting them sleep? Doing so would confirm its a power state issue. Then its down to a drive or os issue.

hbogert · March 17, 2025, 8:46pm

I feel like i’ve really explored everything. Different m2 slots, 1 directly to CPU, other via southbridge (if i understand the schematics correctly)

Yes, disabled APST and ASPM, both in bios and on linux level, to no avail. When APST is active i do see a last console message with (paraphrased): “could not go from d3cold to d0”, however I think that’s just a symptom, when apst is disabled you simply don’t see that message and it still takes down the whole system.

The_Riddick · March 19, 2025, 2:44am

It’s a Linux Kernel bug issue. My KLEVV C910 4TB NVMe has the same issue. Works 100% in windows, but only in Linux after a boot into Win11 first %50 of time.

Sometimes it will just work fine for a while. It’s a intermittent issue.

Considering opening up a serious bug report for Linux kernel about this sort of problem.

PS. Yes I’ve tried EVERYTHING as well. Outside of manually going in and tweaking kernel code myself.

NOTE: My drive isn’t so much dropping off mid-use, but upon boot it time’s out to become ready, very similar error tho.

The_Riddick · March 19, 2025, 2:50am

Bet is your drive doesn’t drop off under Windows.

hbogert · March 20, 2025, 3:28pm

Yeah probably, also doesn’t happen if i disable the ceph osd service on the disk. So it’s the combination of the ceph workload and (probably) some cron. But it’s incredibly difficult to reproduce because if I trick the systemclock ceph immediately notices and stops being functionally active, as ceph is only functional if all the systemclocks of all machines in the clusters are synced to <50ms (or something small at least.)

The_Riddick · March 26, 2025, 3:50am

Currently my drive is working fine, atm at least.

I noticed when I went into windows and removed all partitions off the drive it stopped being detected EVER under Linux. So I went back into windows, made a ntfs partition and it showed up under Linux again.

SO from here I decided to delete that under linux and put a BTRFS partition on instead and so far Linux has detected it each time. It’s a very perplexing and weird issue and no doubt this won’t be the end of it.

Was very close to doing a RMA, fortunately I probably won’t need to and am currently looking into the issue with a kernel bug report, so I’ll get to the bottom of it one day. If I RMA’d the drive it would have cost me a little bit because currency and prices are continuously getting worse each passing month in Australia. Things getting real bad here.