Mdadm and Zombie disks

I have a RAID 6 that is managed by mdadm. I saw a failed disk and it was marked as removed from the array. This was done automatically. I may not have removed it properly. I tried the “–fail” and “–remove”. It complained. The disk was just “gone”. So, I eventually found the problem was at the backplane and rerouting the SATA cable directly to a new disk allowed the RAID to recover. The trouble is solved. But wait? I hear you ask. What is your question?

When I look at the array status I see:
md1 : active raid6 sde1[8] sdc1[1] sdg1[4] sdh1[7] sdb1[0] sdd1[6] sdf1[3] sda1[5]

It’s driving me nuts that there is no device “sdX1[2]”. i.e., the devices have number 0-1 and 3-8 instead of 1-7 as it was before the failure.

Is this a problem? Can I fix that? Or maybe it’s just best to accept and move on?

Thanks for the considerations!

Does this behaviour persist after a reboot?

My thinking is that every drive in the RAID has a UUID, mdadm removes said UUID from the array on failure and when the drives comes back via a different controller, its UUID has changed and thus can be added to the array. After a reboot, the old UUID is gone from the mdadm config so UUID’s can once more be sequentially numbered.

1 Like

Hmmmm. That would be great! I am waiting for a new backplane so I won’t be able to test this out until I change it out with the faulty, never breaking, backplane. It’s on a 24/7/365 system. So rebooting can only happen with great planning.

It just maybe that I like order and consistency. When it does not happen it drives me nuts!

Is there a way to list all the devices that mdadm expects to be there? Everything I’ve tried just shows what is “hooked up”.

Ah, RAIDs, so much redundancy. So much confusion!

Thank you so much for your help! This community is so very helpful!

You may want to investigate the mdadm.conf file (it’s in the /etc/mdadm/ directory) but you can find the UUID’s with the blkid command:

blkid /dev/sdX  #always run blkid as root or with sudo if you must!

If you specify only the block-address (sdX) it’ll list said block device and all partitions on it, but you can specify the partition too (sdX1) to list just that partition. Replace the X with a * (like /etc/sd* ) to list all block devices and partitions on the system. In that case it’s wise to redirect the output to a file:

blkid /dev/sd* > blkid.txt

it’s convenient to have a record of it when there’s more then a pair of drives involved.

HTH!

Nice! I’ll do that tomorrow.

The /proc/partitions only shows active partitions. Maybe this will show more. Also I may just be just reacting to my paranoia. A reboot might just make this go away. But I have constraints. I can’t just reboot ( or reboost as we joke ) at will.

This helps a lot. Thank you! I hope this thread will help others who have the same paranoia I do.

Thank you Dutch_Master and thank you all who have read this!

It’s been nagging me all day now, so I might as well just say it :stuck_out_tongue: You may want to investigate setting up a high-availability cluster if your infra is that important. Not only does it allow for data replication (primary method of prevention against data loss) but it also allows easier maintenance as a 2nd or even 3rd server takes over when the primary fails or is shut down for repairs/upgrades. Doesn’t have to be expensive: US$1000 will get you a decent server case, Supermicro H11SSL-i mainboard, EPYC 73551P CPU, 128GB ECC LDIMM RAM and 2U EPYC cooler with fan, with ample room for expansion (PCIe Gen 3 is still pretty fast!) and probably some budget left over.

1 Like

Yes, I am now looking for a high availability setup.

But, I’ve rebooted the server and the “device” number two is not there. 0,1,3,4,5,6,7,8 are. But all looks good. I’m just going to ignore it.