I have a six disk, software RAID 10 array. Six SATA drives, with partition 2 on each drive being the RAID member. ie. sda2,sdb2,sdc2, etc.
LVM is on top of the RAID 10 array. The host OS is not on this array and it is not used for booting.
We had some power issues last weekend (brownouts followed by an outage). The server is on a UPS, but I’m not sure if it shut down cleanly.
There is now an issue where a certain amount of I/O on the array causes MD to kick out one or two disks as failed.
I can re-add them OK though. I cross swapped disks, and I thought it was the hot-swap slot that was bad, because the disk in the same slot was the one to fail both times, even after swapping. But I moved them all to a different back-plane, different SATA card that was known working, and MD still kicks one or two disks under IO.
I have tried swapping SFF cables. All drives appear at boot, and when they get kicked, they still show in lsblk.
I ran smartctl -t long on all six physical disks with the RAID array stopped. No errors reported on any of them from the test. (although some disks have errors logged, they were not part of the test).
/proc/mdstat:
md124 : active (auto-read-only) raid10 sdf2[0] sdg2[5] sdh2[4] sdi2[3] sdk2[2] sdj2[1]
41010455040 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU]
bitmap: 0/306 pages [0KB], 65536KB chunk
I can reproduce the array failure reliably by running
dd if=/dev/md124 of=/dev/null.
It always fails after 44GB:
dd: error reading '/dev/md124': Input/output error
85953664+0 records in
85953664+0 records out
44008275968 bytes (44 GB, 41 GiB) copied, 91.4055 s, 481 MB/s
In dmesg
I get:
Buffer I/O error on dev md124, logical block 10744208, async page read
I always get that same block number, always at 44GB.
On a single disk, this would indicate a bad block, wouldn’t it? But on an MD device is seems like something that needs a deep scan and fix routing, like fsck. But at the RAID level, not the filesystem level. The rebuild that runs after --re-add
ing a drive doesn’t seem to resolve the issue.
Is there a process for fixing bad logical blocks or correcting this issue? The host is on Debian 10.