Is there something like fsck for MD RAID arrays?

lightnb · February 5, 2021, 9:55pm

I have a six disk, software RAID 10 array. Six SATA drives, with partition 2 on each drive being the RAID member. ie. sda2,sdb2,sdc2, etc.

LVM is on top of the RAID 10 array. The host OS is not on this array and it is not used for booting.

We had some power issues last weekend (brownouts followed by an outage). The server is on a UPS, but I’m not sure if it shut down cleanly.

There is now an issue where a certain amount of I/O on the array causes MD to kick out one or two disks as failed.

I can re-add them OK though. I cross swapped disks, and I thought it was the hot-swap slot that was bad, because the disk in the same slot was the one to fail both times, even after swapping. But I moved them all to a different back-plane, different SATA card that was known working, and MD still kicks one or two disks under IO.

I have tried swapping SFF cables. All drives appear at boot, and when they get kicked, they still show in lsblk.

I ran smartctl -t long on all six physical disks with the RAID array stopped. No errors reported on any of them from the test. (although some disks have errors logged, they were not part of the test).

/proc/mdstat:

md124 : active (auto-read-only) raid10 sdf2[0] sdg2[5] sdh2[4] sdi2[3] sdk2[2] sdj2[1]
      41010455040 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU]
      bitmap: 0/306 pages [0KB], 65536KB chunk

I can reproduce the array failure reliably by running

dd if=/dev/md124 of=/dev/null.

It always fails after 44GB:

dd: error reading '/dev/md124': Input/output error
85953664+0 records in
85953664+0 records out
44008275968 bytes (44 GB, 41 GiB) copied, 91.4055 s, 481 MB/s

In dmesg I get:

Buffer I/O error on dev md124, logical block 10744208, async page read

I always get that same block number, always at 44GB.

On a single disk, this would indicate a bad block, wouldn’t it? But on an MD device is seems like something that needs a deep scan and fix routing, like fsck. But at the RAID level, not the filesystem level. The rebuild that runs after --re-adding a drive doesn’t seem to resolve the issue.

Is there a process for fixing bad logical blocks or correcting this issue? The host is on Debian 10.

ArgH · February 5, 2021, 10:31pm

Just a thought crossed my mind:
HDD’s have a cache onboard (usually 64 mb or so).
On SATA-drives you could disable the read & write-cache with hdparm - if I remember correctly.
Maybe rebuild the array without any disk cache?

Update - also do the dd read on each /dev/hd(X) of this array - might shed some light, which disk(s) causes the read-error.

Also dmesg - any ATA-messages or errors?

anon27075190 · February 6, 2021, 1:28am

Can you please post the output of mdadm -D /dev/md124 first?

What you are looking for to resolve your problem is a manual resync. Please read up on this on your own since I have not too much experience with mdadm, but it should work something like this:

# stop array
mdadm --stop /dev/md124
# force resync
mdadm --assemble --run --force --update=resync /dev/md124 /dev/sdf2 /dev/sdg2 /dev/sdh2 /dev/sdi2 /dev/sdk2 /dev/sdj2

A resync should try to restore errors on disk from parity.

lightnb · February 7, 2021, 11:39pm

/dev/md124:
           Version : 1.2
     Creation Time : Mon Oct 19 03:11:19 2020
        Raid Level : raid10
        Array Size : 41010455040 (39110.62 GiB 41994.71 GB)
     Used Dev Size : 13670151680 (13036.87 GiB 13998.24 GB)
      Raid Devices : 6
     Total Devices : 6
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sun Feb  7 18:17:55 2021
             State : clean 
    Active Devices : 6
   Working Devices : 6
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : archiso:md2
              UUID : 3e50715e:3cd2172c:2969e12a:d41b5e5e
            Events : 18829

    Number   Major   Minor   RaidDevice State
       0       8       82        0      active sync set-A   /dev/sdf2
       1       8      146        1      active sync set-B   /dev/sdj2
       2       8      162        2      active sync set-A   /dev/sdk2
       3       8      130        3      active sync set-B   /dev/sdi2
       4       8      114        4      active sync set-A   /dev/sdh2
       5       8       98        5      active sync set-B   /dev/sdg2

lightnb · February 8, 2021, 12:07am

So I tried to do this, but I don’t think it did anything. I stopped the array, and ran the command:

mdadm --assemble --run --force --update=resync /dev/md124 /dev/sdf2 /dev/sdj2 /dev/sdk2 /dev/sdi2 /dev/sdh2 /dev/sdg2

It put the array in read only mode, and /proc/mdstat said:

md124 : active (auto-read-only) raid10 sdf2[0] sdg2[5] sdh2[4] sdi2[3] sdk2[2] sdj2[1]
  41010455040 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU]
  	resync=PENDING
  bitmap: 0/306 pages [0KB], 65536KB chunk

I read that you need to put the array into readwrite mode for the resync to occur, so I did:

vgchange -an hdd_14tb_r10_main

(because LVM turns itself on automatically on RAID assemble, and I wanted it off), and then:

mdadm --readwrite  /dev/md124

Immediately, cat /proc/mdstat/ says:

md124 : active raid10 sdf2[0] sdg2[5] sdh2[4] sdi2[3] sdk2[2] sdj2[1]
      41010455040 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU]
      bitmap: 0/306 pages [0KB], 65536KB chunk

So it doesn’t look like it re-synced anything. But dmesg did say:

[ 2154.518049] md: resync of RAID array md124
[ 2154.518309] md: md124: resync done.

mdadm -D /dev/md124 still shows clean:

/dev/md124:
           Version : 1.2
     Creation Time : Mon Oct 19 03:11:19 2020
        Raid Level : raid10
        Array Size : 41010455040 (39110.62 GiB 41994.71 GB)
     Used Dev Size : 13670151680 (13036.87 GiB 13998.24 GB)
      Raid Devices : 6
     Total Devices : 6
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sun Feb  7 18:53:08 2021
             State : clean 
    Active Devices : 6
   Working Devices : 6
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : archiso:md2
              UUID : 3e50715e:3cd2172c:2969e12a:d41b5e5e
            Events : 18832

    Number   Major   Minor   RaidDevice State
       0       8       82        0      active sync set-A   /dev/sdf2
       1       8      146        1      active sync set-B   /dev/sdj2
       2       8      162        2      active sync set-A   /dev/sdk2
       3       8      130        3      active sync set-B   /dev/sdi2
       4       8      114        4      active sync set-A   /dev/sdh2
       5       8       98        5      active sync set-B   /dev/sdg2

When I run dd if=/dev/md124 of=/dev/null again with the volume group turned off, it fails in the same spot again:

dd: error reading '/dev/md124': Input/output error
85953664+0 records in
85953664+0 records out
44008275968 bytes (44 GB, 41 GiB) copied, 84.9385 s, 518 MB/s

dmesg:

[ 2604.996225] Buffer I/O error on dev md124, logical block 10744208, async page read

Since the resync didn’t take any time, I think it’s using some type of index, rather than checking the entire drive pool byte-by-byte. These are 12.7 TiB partitions on mechanical drives. A full integrity check should take at least 24 hours to run.

anon27075190 · February 8, 2021, 5:19pm

Do I understand you correctly that after running the assemble command the resync got stuck in the PENDING mode? Because as far as I understood the instructions I read, it should have started on it’s own and require no further intervention. I would even guess that putting it into readwrite mode may abort the process.
But to be honest, as I said before, I do not have much experience with mdadm.

lightnb · February 8, 2021, 10:31pm

Correct. When I looked online, someone on another site said it waits until the array is put into read-write before running the resync. There was something in dmesg indicating a sync occurred, but it didn’t take any time to occur, so i think the resync is being based on a journal, not a full drive byte-by-byte comparison.

anon27075190 · February 9, 2021, 6:32pm

Sorry, I am out of ideas then.

oO.o · February 9, 2021, 8:30pm

Did you scrub it?

https://wiki.archlinux.org/index.php/RAID#Scrubbing

Definitely stop writing data to it until it’s healthy.

Have you run badblocks or vgck? I assume you ran fsck since you mentioned it.

thro · February 9, 2021, 10:59pm

Could be the disks and/or controller are actually damaged, given you mention power issues being involved.

All the file system repair and scrubs etc in the world won’t help if the hardware is now flaky. In fact they may actually hurt.

oO.o · February 9, 2021, 11:14pm

Disks are passing smart, so unlikely a disk issue, but it might be a good idea to put a known good drive onto the controller and write/read directly to it. If you get I/O errors there, then it’s probably the controller (or a cable, but I don’t see how a power outage would kill a cable).

thro · February 9, 2021, 11:18pm

Yeah controller maybe? Dodgy capactitors on the board? Either way… servers (any machine, but especially servers as they rarely power down) plus power cycling… bad juju.

If hardware is likely to fail or degrade, in my experience its most likely during a power cycle. It seems to be when hardware gets most stressed or where initialisation type problems rear their head. Also, unless your air conditioning is also on UPS, it maybe got hot?

Point being though - if your hardware claims it is having problems… try (if possible) to validate the hardware is actually OK before trying to do too much in the way of software side repair.