Raid 5 'at risk'

Dear All, I’m hoping someone can help me.

I have a file server in our small office which I set up with Windows Server 2019 essentials. I repurposed a small old intel 3rd gen machine by putting 5 ssds in it. One ssd is for boot, the other 4 were set up as a software raid 5 ReFS using disk manager in windows. It keeps database data for clients’ financial files which computers on the network can access.

I have an UPS connected to it, which showed as healthy through the software and is only a year and a half old, with load only around 10%, however we had a power outage in the area and generally unstable electricity for a few days here and the UPS didn’t keep the computer on long enough for a graceful shutdown; only a few seconds and off. This unfortunately happened a few times with the electricity here.

When I’ve checked the disks in disk manager, it first showed the disks resynching as expected. However after some time, it stopped with errors and then with the message on the array as ‘at risk’. One of the disks has an exclamation mark next to it, and when I went to reactivate to try to resynch, after some time I get the same error. A restart did not help.

I’ve checked event viewer and under windows logs → system, there are multiple error entries with the description: The device, \Device\Harddisk3\DR3, has a bad block. Also, on the regular backup of data from the array to the NAS, windows backup gives an error for one file with the description: Error in backup of Z:\Data\Sage Data\Payroll\Company_049\ARCHIVE\W19072020_16-13-36.001 during read: Error [0x80070017] Data error (cyclic redundancy check).

My question is if I bring this disk 3 ‘offline’ through disk manager, wipe it, check disk it to ‘repair’ the bad blocks, and then reassociate it with the raid5 array, can this work? Is this a viable solution?
In the meantime, I have ordered a couple of more SSDs to arrive tomorrow so I can add it to the system if it needs replacing.

Should I be doing anything else? My data does have a backup, but I’d rather not have to wipe everything and restore if I can possibly help it.

Many thanks for the time of anybody who reads and replies.

It certainly sounds like the disk is bad. Sadly I don’t have any experience with ReFS to be able to help you with anyhting particular to ReFS, so I will be speaking in general terms only.

Is the ReFS pool telling you it has dropped the drive, or is it still connected, but just reporting a bad CRC check?

If it is Raid 5, you should be able to survive a single drive failure, though if it is telling you it is at risk, you may already have a failed drive, so the rest of your data no longer has any redundant copies in the pool, so if anything else goes wrong, you have nothing to protect you, and if there are any bad sectors on the rest of the drives, you will not catch any bit rot that may have occurred when you rebuild it with a new drive.

This is why most have recommended against RAID5 since ~2009 or so. It’s better to step it up to RAID6, so you still have some redundancy during the rebuild if a drive fails.

Here are my thoughts (keeping in mind that I have zero ReFS experience. I usually use ZFS.)

If the drive is still actively part of the storage pool, you might still be protected from bad sectors. As soon as you remove a drive, the rest of the drives will still retain data, but now you lack the checksum data, so there is no redundancy.

If this data is important (and financial files/databases do sound like they are) I’d try to make a new back up the data before you take any drives offline just to make sure you have a current one. That way you will likely still have checksum data to catch any flipped bits. Back it up to something with redundancy. Even if it is just a two-way mirror using hard drives for now.

Then start trouble shooting the drive.

It’s always better to have a backup, and hope you don’t need it.

Otherwise what you are proposing seems reasonable, to detach the drive from the pool, check it, and if good re-connect it, but I am guessing it is on its way out. I’m actually a little surprised, as SSD’s in my experience usually fail hard when they fail, they usually don’t just have random bad blocks. That’s more hard drive-like behavior.

All of this said, if you do this when you do rebuild the pool by adding the drive back in, you are still going to run the risk of flipped bits becoming permanent, as there are no checksums left.

Your best option may be to try to just replace the drive directly with one of those new drives you bought without first removing and troubleshooting it (I don’t know how ReFS does this, but on ZFS this is trivial, and your redundancies will be mostly intact during the rebuild process)

Then you can troubleshoot the removed drive, and if it is good, either replace ti back in again, OR just keep it as a spare.

Have these drives seen a lot of writes? Is it possible you have used up your NAND write cycles?

If this were my data, I’d probably replace this storage pool with RAID6 going forward, so you can still have redundancy data even if you have a drive fail, so you are protected during the rebuild process. But that is just me. I’m a little paranoid like that. Yes, I do have backups as well, but they are a pain to go back to, and can require downtime.