This video is about a disk crash in Logan's travel/editing NAS . It turns out this is also where he kept the latest copies of the new album!
So we were in a situation where we had 2/3 disk failures, totally unmountable zfs, everything is on fire, and the data is gone. Fortunately we're nerds, and we turned it up to 11 to try and get everything back.
The first problem we had was a hardware problem. A dual disk crash so severe that we had to send it off to recovery specialists. Our friends at Gillware were able to get us going again, though one of our disks was a lost cause:
https://gillware.com?referral=17296
Big thank-you to them for their assistance. If you haven’t seen the burnisher in action, take a look: https://blog.gillware.com/data-recovery/data-recovery-101-burnishing-platters
Once we got the drives operational, it became a challenge to actually extract the data from ZFS.
If you have a damaged ZFS pool stop everything right now. You may need to offline your pool and clone the constituent disks in order to be able to do these things.
It is very unlikely that you have a pool that is actually damaged enough to warrant the hackery you see in this video. If your data is important, please consult experts.
The exercises you see on the video come from a clone of the zfs pool made with DDRescue. Learn more about cloning the disks in your pool here:
https://www.youtube.com/watch?v=ddrPnuvFV6E
The fact that we’re using clones means that if (when) we screw up, we can go back to the original drives and try again. This is substantially less convenient if you have a large number of disks.
Some of the commands do here seem to operate in a read-only mode but from looking at the source code I can’t be sure that manipulating the transaction log pointers during a mount are not messing with some of the meta data, so please be careful with that.
Our setup was a relatively simple 3 drive RaidZ1 pool where one drive had failed spectacularly and ddrescue was reporting about 46 read errors on one of the two remaining disks.
After doing numerous experiments, echoing stuff into /proc/ to turn off verifying metadata and a lot of other fun exercises, we decided that using zdb was the best course of action. Unfortunately, zdb crashes. A lot. Even with the –AAAAAAA commands (which are supposed to bypass safety checks) we needed to update the zdb source code to continue.
Once we were able to get zdb to dump a reasonably complete list of zdb entries, we were able to spot entries that were below corrupt (Input/Output error) directory entries. For example /mp3 was completely inaccessible, but we could use zdb to see files that were located under the /mp3/whatever directory.
Once we could see them, extracting them was a problem. Turns out zdb allows you to dump file blocks, but not individual entries. Does anyone know a command for this? It seems weird there wouldn’t be a command for this built-in. The problem is perhaps that I can’t read.
Anyway, with a good bit help from a friend, we were able to construct some perl scripts to extract each chunk, and then reassemble each chunk back into a usable file.
Do you have a ZFS war story? if so, come over to the forums and let's swap stories of triumph and defeat. If anything our experience in this instance only makes us love ZFS more.
This is a companion discussion topic for the original entry at https://teksyndicate.com/videos/zfs-data-recovery-logan-got-his-groove-back