Hello what happened was that, there was one disk in the array of 8 drives that was giving read errors, prior to replacing that drive I ran the command to delete the failing drive from the array(there was enough free space to do so), and it has been running for the last couple days, today the computer the array was on unexpectedly shut down, and when restarted, the system was in this status:
Status of disk array:
The disk array does not mount normally.
Attempting to mount with degraded/recovery options fail as well.
dmesg | tail shows that parent transid failed, and that there is a bad tree block.
it also fails to read chunk tree on the array.
When rescue/recovery operations were attempted, errors were shown stating that the chunk tree could not be read, and that the checksum verify fails.
When chunk-recovery was attempted the error Segmentation fault( core dumped) is given.
When superblock-recovery was attempted, it gave the same error as the other options, but it did show that the bad superblocks are on /dev/sdd(the one that is failing, and the one we intended to replace).
Any ideas would be appreciated, and please help there is 35 TB of data at stake.
Thanks.
@wendell
Might have some thoughts on this topic.
The first thing you should do is DD (or ddrescue) each failing disk to a new disk of the same size or larger. A lot of the time recovery operations can actually make it worse if/when they don't work. It may be too late if the recovery tools have written changes to the disk.
With the ddrescue'd copy then you are free to experiment a bit more and attempt recovery. I am not clear on if this is a multiple drive failure, or a revenge-of-the-removed-disk type problem?
2 Likes
After following wendell's advice of backing up your data (i assume that 35TB is worth the time, effort and money to save)
If debug symbols are enabled, you might try and run chunk recovery with gdb and see if you might be able to figure out where it's segfaulting.
gdb --args ./btrfs rescue chunk-recover -v /dev/sdd
Additionally, building btrfs-progs from github master might possibly have a fix for the segfault.
Sadly, that's all I got. I still use hardware raid controllers...cause i like to play with fire in different ways.
The drives are being copied, as for the cause I don't think that the drives are physically failing(except the one that already was) I think the error has something to do with the chunktree being corrupted when power was lost during the drive delete.