Back story:
So I've been running two FreeNas boxes one is a file server the other is a media server, both have been running for a little over two years 24-7/365, the file server sees casual use as far as file movement but the media server has a ton of data added, removed, and a ton of reads because it is our only source for TV viewing, all of our TVs have either WD TV Live boxes or older Asus media boxes connected to them, we do not have cable, satellite, but do subscribe to Hulu.
Main Story:
So for about 6 months FreeNas has been sending me email everyday about the errors on one of the drives in it's pool, this is a 5 disk RAID Z2 pool so it has redundancy built right in... for several weeks I watched the errors get worse in frequency but the pool wasn't degraded, I ordered a replacement drive WD 3tb NAS (RED), the drive came in in a few days and I put it on a shelf and continued on to other projects.
About a month ago I get a nasty email from the server telling me that the drive is offline and that the status of the pool is degraded.....that gets my attention, but the server kept on humming along serving up TV whenever we asked but I knew the drive change was now on my radar as something I needed to plan on doing sooner than later.
So fast-forward to Saturday, we have some type of power surge that flickered the lights and set off all the UPS alarms in the shop that houses the servers, all of the UPSs reset with the exception of the one on the two FreeNas boxes, I go over to check it and hit the button to manually reset it and it does reset and stops beeping.
I look at the two servers connected to the UPS and both are off, so I go around to the front of the rack and push the power button for the file server and it starts to boot up, then I push the button on the media server and nothing happens...weird, so I go around to the back and flip the switch on the power supply in the media server and something inside arcs sending a bright flash inside the case...this ain't looking good.
Meanwhile the file server boots up just fine, so I have to pull the media server out of the rack and put it on the bench, big ass 4U case with a tiny Asrock Atom server MB, the case weighs a ton but I get it out, pull the cover off and start looking for burn marks, finding nothing on the mother board I figure lets hook it up on the bench and see what happens....it's dead, totally dead.
So all I can think of is that 8tb+ of data is gone for ever and that I'm going to have to start from scratch rebuilding my library, I pull the PSU (700w OCZ)(yeah not the best PSU on the planet) and it has the smell of smoke being let out of a component, I set it aside and find another PSU to test with, all I have is a old 400w Rosewell but it has the necessary connections to try to power the system back up.....nada, it's dead, the cooling fans spin up, I can hear the drives spin up, but the motherboard has all it's green lights (diagnostic) on...but no video output.
At this point I step back and really start to ponder the fact that my data might be totally lost, I go outside have a smoke and a drink while trying to figure out my next move, so I scrounge around and find a used mother board that has a CPU and memory try to figure out why I have it in the first place since it seems complete (is it bad? I just can't remember) and decide to just give it a try, I hook it up outside the case to see if I can get it to power up and look at the BIOS to see if it can boot from USB (my FreeNas boxes boot from USB sticks), luckily it supports booting from USB and has enough SATA connections for the drive pool + the drive that needs to be added for the resilvering (thinking ahead).
So I pull the Asrock board out and install the used MB, connect everything to it power it up and head to the BIOS to check things out, change it to USB boot device which it sees (thanks God), save the config and reboot, it takes FreeNas awhile to boot and it looks like when it went down there were processes that got truncated and errors created that had to be resolved, but 10 min later it was back up, my data was still intact. (thanks God again!)
So I look around and find the serial numbers for the drives, FreeNas is nice enough to show you this info, but the drive that was off line is still off line, and it shows a serial number that doesn't correspond to any in the pool, through the powers of deduction eliminating the drives that it does show the correct serial numbers for I find the bad drive (second to the last one I looked at) I leave it connected because I need to go through the resilvering process and it needs to know which drive I'm replacing.
I shut it back down and re-position a few things, hook up the replacement drive and boot it back up, a little quicker this time but it still has network to parse and jails to load so it takes it about 6 minuets to get up and stable, but the data is still intact....I'm ecstatic!
So I followed this guide.....
And 26 hours later the system is finished resilvering the new drive, the pool state is healthy, and all my data is intact, the interesting part of this is that the server continued all of it's processes as far as the programs in the jails running, data being added, and data being served up on demand, yes the pool was degraded but continued to function which is really astonishing.
So I have a new MB, PSU, and another replacement 3tb drive to put on the shelf ordered, I'll rebuild the server with the new components and put it back in the rack, the sucky thing is the Asrock MB I lost was about $250 and out of warranty so it'll go in the wall of shame as a reminder, hopefully the 16g of RAM that is in that board is ok, if not then it might be a good lesson to buy good high quality PSUs and not scrimp.
And that is my diagnoses, the power surge somehow caused a MOSFET in the PSU to explode (found the evidence when I took it apart) which caused the MB to die, which prompted me to replace the dead drive that I've been putting off for weeks, funny how things happen sometimes.
Sorry no TL;DR.....suck it up and read the whole story, hope you get a chuckle out of it and if your not using FreeNas use this story as a compelling reason why data redundancy in a RAID system is a very good idea.