Resilvering a Frenas drive

Back story:

So I've been running two FreeNas boxes one is a file server the other is a media server, both have been running for a little over two years 24-7/365, the file server sees casual use as far as file movement but the media server has a ton of data added, removed, and a ton of reads because it is our only source for TV viewing, all of our TVs have either WD TV Live boxes or older Asus media boxes connected to them, we do not have cable, satellite, but do subscribe to Hulu.

Main Story:

So for about 6 months FreeNas has been sending me email everyday about the errors on one of the drives in it's pool, this is a 5 disk RAID Z2 pool so it has redundancy built right in... for several weeks I watched the errors get worse in frequency but the pool wasn't degraded, I ordered a replacement drive WD 3tb NAS (RED), the drive came in in a few days and I put it on a shelf and continued on to other projects.

About a month ago I get a nasty email from the server telling me that the drive is offline and that the status of the pool is degraded.....that gets my attention, but the server kept on humming along serving up TV whenever we asked but I knew the drive change was now on my radar as something I needed to plan on doing sooner than later.

So fast-forward to Saturday, we have some type of power surge that flickered the lights and set off all the UPS alarms in the shop that houses the servers, all of the UPSs reset with the exception of the one on the two FreeNas boxes, I go over to check it and hit the button to manually reset it and it does reset and stops beeping.

I look at the two servers connected to the UPS and both are off, so I go around to the front of the rack and push the power button for the file server and it starts to boot up, then I push the button on the media server and nothing happens...weird, so I go around to the back and flip the switch on the power supply in the media server and something inside arcs sending a bright flash inside the case...this ain't looking good.

Meanwhile the file server boots up just fine, so I have to pull the media server out of the rack and put it on the bench, big ass 4U case with a tiny Asrock Atom server MB, the case weighs a ton but I get it out, pull the cover off and start looking for burn marks, finding nothing on the mother board I figure lets hook it up on the bench and see what happens....it's dead, totally dead.

So all I can think of is that 8tb+ of data is gone for ever and that I'm going to have to start from scratch rebuilding my library, I pull the PSU (700w OCZ)(yeah not the best PSU on the planet) and it has the smell of smoke being let out of a component, I set it aside and find another PSU to test with, all I have is a old 400w Rosewell but it has the necessary connections to try to power the system back up.....nada, it's dead, the cooling fans spin up, I can hear the drives spin up, but the motherboard has all it's green lights (diagnostic) on...but no video output.

At this point I step back and really start to ponder the fact that my data might be totally lost, I go outside have a smoke and a drink while trying to figure out my next move, so I scrounge around and find a used mother board that has a CPU and memory try to figure out why I have it in the first place since it seems complete (is it bad? I just can't remember) and decide to just give it a try, I hook it up outside the case to see if I can get it to power up and look at the BIOS to see if it can boot from USB (my FreeNas boxes boot from USB sticks), luckily it supports booting from USB and has enough SATA connections for the drive pool + the drive that needs to be added for the resilvering (thinking ahead).

So I pull the Asrock board out and install the used MB, connect everything to it power it up and head to the BIOS to check things out, change it to USB boot device which it sees (thanks God), save the config and reboot, it takes FreeNas awhile to boot and it looks like when it went down there were processes that got truncated and errors created that had to be resolved, but 10 min later it was back up, my data was still intact. (thanks God again!)

So I look around and find the serial numbers for the drives, FreeNas is nice enough to show you this info, but the drive that was off line is still off line, and it shows a serial number that doesn't correspond to any in the pool, through the powers of deduction eliminating the drives that it does show the correct serial numbers for I find the bad drive (second to the last one I looked at) I leave it connected because I need to go through the resilvering process and it needs to know which drive I'm replacing.

I shut it back down and re-position a few things, hook up the replacement drive and boot it back up, a little quicker this time but it still has network to parse and jails to load so it takes it about 6 minuets to get up and stable, but the data is still intact....I'm ecstatic!

So I followed this guide.....

And 26 hours later the system is finished resilvering the new drive, the pool state is healthy, and all my data is intact, the interesting part of this is that the server continued all of it's processes as far as the programs in the jails running, data being added, and data being served up on demand, yes the pool was degraded but continued to function which is really astonishing.


So I have a new MB, PSU, and another replacement 3tb drive to put on the shelf ordered, I'll rebuild the server with the new components and put it back in the rack, the sucky thing is the Asrock MB I lost was about $250 and out of warranty so it'll go in the wall of shame as a reminder, hopefully the 16g of RAM that is in that board is ok, if not then it might be a good lesson to buy good high quality PSUs and not scrimp.

And that is my diagnoses, the power surge somehow caused a MOSFET in the PSU to explode (found the evidence when I took it apart) which caused the MB to die, which prompted me to replace the dead drive that I've been putting off for weeks, funny how things happen sometimes.

Sorry no TL;DR.....suck it up and read the whole story, hope you get a chuckle out of it and if your not using FreeNas use this story as a compelling reason why data redundancy in a RAID system is a very good idea.

7 Likes

Good read.

Maybe it's time to start thinking about a cold offsite backup?

(also, want pics of fried psu)

1 Like

Yeah, if it was anything but TV shows I'd have another plan, but honestly I delete shows/seasons after we watch them so keeping a current backup would be time consuming, I do have about about 3tb of what I call "core programs" that we watch over and over again, but it's mostly old stuff that was hard to come by because it was ripped from DVDs or some other source I don't have access to so it is backed up to removable drives.

LOL...the fried PSU I tried to take pictures of the MOSFET that blew apart but it was on the side of a heat sink and under the wire bundle I could never get a clear picture, really when I took it apart I saw a little black square fall on the table, I picked it up and noticed it had writing on one side, so started the hunt, it took about 5 min to find the chip because it was buried......it all went in the dumpster except I salvaged the fan.

I'm just so impressed.......I mean I procrastinated for a long time because I didn't want to pull the case out of the rack, then was forced to, I hate losing the Asrock MB it was pretty sweet and sipped power, but as they say that's how it goes.

1 Like

Indeed, even with RAIDZ2 you want backups on external drives.
I learnt that lesson last year when my HBA developed a firmware/driver mismatch after a Freenas update, putting all my data at risk.

1 Like

Yep..... I guess it shouldn't impress me so much that the pool operated for as long as it did with one dead drive, it was degraded but still functioned in the role I use it for with no ill effects, I've never had that type of experience with RAID and a drive failure.

what thread?

This one.

Also, My server thread.