Superblock Not found: All Drives lost?

StanSmith · March 20, 2017, 5:04pm

Well I knew something like this would be bound to happen on my server, yet last time I knew what the cause of it was. This time, I do not. I have been running a Nas4Free install, running FreeBSD 11.0.4, for two years now and not had any issues yet today I heard the server chirp and then.....a drive error occured and I couldn't repair it.

I tried booting to a LiveCD and running FSCK on the unmounted drives yet the same error I kept getting was "Superblock Not found, attemping to find alternate superblock" then that would fail and that was it. I couldn't run FSCK on any drives. All 4 drives, of various sizes, where reporting the same error. What sort of thing would cause this? It was pretty random to my eyes.

My first thought, since I'm not running ECC memory a bit or something was flipped and couldn't correct itself and then BAM! corruption written straight to disk. Only thing that I would question is why since I am not running the drives in any sort of raid array.

Thank fully i have a backup of the most important media and documents, yet all the work setting everything up is out the window. Is it time to upgrade my hardware to a proper Xeon and ECC memory? What should I be on the look-out for next time this sort of thing happens? All I know to do now is salvage what I can, reformat and start over.

Dynamic_Gravity · March 20, 2017, 5:21pm

if your ram was faulty then use memtest86 to test the ram.

However, read this comment.

And, if you "routinely have failed hard drives," something's wrong at your workplace! Even in continuous duty, a drive mechanism should last for years.

('Scuse me ... what's that noise ... 'click click click' ... uh oh ...)

Easily the most common cause of disk-drive failures is an even-very-slight instability in the incoming electrical power. Drives use direct-current synchronous motors which are absolutely intolerant of any variation in current. You should have beefy UPS = Uninterruptible Power-Supply boxes on everything. Yes, buy one for every secretary.

Their purpose is actually not to keep the lights on when the lights go out. The battery is there to provide the means to stabilize the current, in much the same way that a buffer-tank or column suppresses "hammer" surges in a water supply line.

Have an electrician review the wiring. Be sure, for example, that photocopiers, laser printers, and other power-hungry devices are on separate circuits from the computers. (They don't need UPSes, and will very-quickly chew them up.) These machines can send a nasty surge down the line every time they're start to do something ... especially the older copier-equipment that is based on mirror-optics and not scanners. (Which you should get rid of, anyway.) In poorly-designed and especially "repurposed" office buildings, the culprit could be the tenant next door.

The only reason why "superblocks" might be seen as "going bad," is that they're (of course) the blocks most-frequently written. Therefore, if the drive is going fishy, this is the block that you are most likely to realize has been corrupted ... especially given that corruption in that block can cut-off access to everything else.

Modern disk-drives have a built-in diagnostic capability, called SMART, in which the drive's own electronics keeps statistics, chooses on its own to "spare-out" questionable blocks, and so on. Linux {and all the others) does have tools that can read this information, and there are hardware-monitors (e.g. Nagios plug-ins) that can monitor it. You can often, in this way, discover a pending failure well before it happens. If you are proactively looking for such things . . .

Therefore, I need to know, was your server connected to a UPS?

Source for above quote

StanSmith · March 20, 2017, 6:03pm

It was connected to a UPS, I've had it connected for a while now. However, maybe it's time to replace it or the battery? I did have a powersupply fail a few months back and something in the wall that caused a TV to fry that was on the same circuit. Unfortunately I live in an apartment and the landlords/electric company wont inspect the apartment as they claim everything is fine.

WTFShelley · March 20, 2017, 7:30pm

have you tried test disk, its availble for most distros

StanSmith · March 20, 2017, 7:47pm

I have not, currently only run through fsck from what I could find around on google. I have a liveUSB key running linux mint that I am trying to use gpart to read the filesystem and mount it readonly to start fresh.

Would test disk be able to repair the filesystem?

grosgood · March 21, 2017, 12:00am

I'm attracted to the theory that some nasty electrical phenomenon fried the drives. Simultaneously. All of them. Maybe irreparable damage.

fsck works at the file system level and assumes that the physical media is sound and the electronics are in good shape. If the drives are physically damaged, you may be getting noise instead of data off the platters. Then fsck would be hopelessly confused.

if the drives are SMART capable (almost certainly true) then smartctl can talk to the physical devices, if that is still possible (https://www.smartmontools.org/). Very likely its on your linux-mint-on-a-thumbdrive, in /usr/sbin, so it may not be visible to the unprivileged. If it is not on the usb-linux-mint stick, I would highly recommend the System Rescue CD which is Linux on a USB specifically geared toward examining systems that won't boot. It has the smartmontool set and other goodies useful for reconstructing distressed file systems.

The game would be to see if you can get the drives to SMART test themselves, which they can do if they're still reasonably alive, and they would do so independently of any OS or BIOS - using drive-internal firmware. My play would be to boot the box with system rescue, find out how that linux-on-a-stick is identifying drives (cat /proc/partitions) and then smartctl -a /dev/sd"x" ("x": some letter) to see if the drives still want to talk to the outside world. If you get a reasonable dump from the drive, then smartctl -t short /dev/sd"x" to kick off a five minute self assessment, then smartctl --xall /dev/sd"x" for a thorough report of the test.

If you work your way through that, and the drives pass, then they have a future as something other than paperweights.

Hope this helps. Good luck.
(edited to remove silly typos - other silly typos left undisturbed for amusement)

StanSmith · March 21, 2017, 12:45am

I've been able to smarttest all the drives and see that the data is still intact. The issue now is mainly getting this superblock issue figured out. Last time I did this, borked it messing with UnRaid, I used UFS file explorer to retrieve the data and then dump it to another drive. Now I don't have that program anymore so I'm more at a loss. Honestly losing this data here makes me rethink is Nas4Free and UFS is the way to go. I can't seem to repair much of it in the way of the command line.

Also my apartment complex came to check the electrical out....the guy takes one look at the wall and the surge protector plugged into it and says "Yup that's your issue, too many devices plugged in right there...." Terrible electrical wireing and no way to fix it or move it to another breaker.

Dynamic_Gravity · March 21, 2017, 1:28am

at this point I'd either leave or I would run a long ass power cable.

StanSmith · March 21, 2017, 1:31am

That's very true, thankfully my lease is up on 31 May.

Should I look into replacing my UPS? All the self-run stats on the machine say it's fine yet it didn't stop the surge that busted the powersupply or this one that caused this issue....

Dynamic_Gravity · March 21, 2017, 1:32am

Before you replace the UPS;

How old is it?
What is its modal?

StanSmith · March 21, 2017, 1:35am

It's maybe 1.75 years old, it's a AVR825

Dynamic_Gravity · March 21, 2017, 1:41am

I have one just above it (EC850LCD) and another one. But 2 years should still be fine. Usually, UPS are supposed to protect against the ripples that would cause these sorts of issues. In the specs it says it does.

My next thing to say then is, did you make sure that your sever was connected to one of the actual battery backup ports and not just one that was surge protected? Because even if the UPS helps against brown outs it would only do that if it was connected to one the battery backup spots.

StanSmith · March 21, 2017, 1:51am

It was connected to the battery backup/surge port. At this point, to recover the data with data recovery software (UFS File explorer) it would take 4-5 days. Each drive would take roughly 10-18 hours to decode and find all the files. What's not making sense is when I use testdisk to look at the disks, I see the file structure and all the data is there, the filesystem is read correctly and everything. Yet I can't get it back to a mount point in linux that would allow me to easily backup the new data and just rebuild. If I could do that then I'd hold out until I move and hope this doesn't happen at my new place.

Dynamic_Gravity · March 21, 2017, 1:52am

maybe just the root file system for your server is whats borked?

From the live cd are you able to mount the data drives in say /media?

StanSmith · March 21, 2017, 1:58am

I wasn't able to mount them on a separate machine even. I took them out of the server and tried to mount them individually on another spare laptop I had around. Still produced the same errors though, complained about bad magic number/superblock.

grosgood · March 21, 2017, 2:15am

A caution on SMART testing: if you've used smartmontools and smartctl as I suggested, you've performed short tests. They take a couple of minutes to run and test the basic integrity of drive mechanics. These short tests do not address the integrity of the media. More comprehensive long tests are needed for that - such as smartctl -t long /dev/sd"x" but these take awhile to run - 240 minutes (four hours) on a 1TB drive. If you perform that kind of test, and it passes (no bad LBA blocks), then you can say with authority " that the data is still intact." Otherwise it is a busted flush. (strictly speaking, you can only say the "media is still intact" - who knows at the higher level if bits have been flipped?)

WTFShelly suggested Test Disk (http://www.cgsecurity.org/wiki/TestDisk) which is included on the System Rescue CD. Alas, it seems that its UFS support is a bit thin - the web site suggest that it can find lost UFS partitions; no promise to repair those which may have been corrupted. I've not used these tools; my FreeBSD and UFS experience tends to the theoretical, so there is not much I can say practically about fixing such.

If the data are very valuable then this may be the time to purchase fresh media and image copy the drives at the low-level blocks to the brand new media -- then attempt to recover from the images. If you screw up the recovery, you still have the originals that are no more screwed up than they already are. This all seems like it will shoot a three-day weekend. Venerable dd can do an image copy - eventually. It can also stall on bad blocks, performing endless retries that can further damage injured media. Partimage tries to do the same image copying more intelligently, but it requires file system knowledge, and if you are having trouble reading super blocks, Partimage may not work. Also, Partimage is generally weak on UFS - it calls its support of that file system "beta." (dd is everywhere on Unix - Partimage is also on the System Rescue CD).

Hope this helps.

StanSmith · March 21, 2017, 2:22am

All of this is still theoretical as I don't know the cause of it. I would buy new disks, say two 8tb drives and consolidate my media. Although this is expensive currently and no guarantee the cause wasn't something else.

I'll toss the drives back in and boot up a freebsd live usb and run smartctl on them. This time the long test and see what comes out of it. If the very least I lose some data but the drives are okay, I'll purchase a 10tb nas drive from seagate and a 4bay enclosure to start expanding cold storage backup.

grosgood · March 21, 2017, 2:52am

cd /usr/src/linux
bertha linux # cat .config | grep UFS
# CONFIG_UFS_FS is not set
bertha linux

cat /proc/filesystems ?

The spare laptop knows UFS? Not every kernel knows about every type of file system (mine doesn't - see above). Sorry to be niggly, but I myself am the kind of person to test the integrity of a UFS file system on a UFS incapable system. Methinks if you're mounting a UFS file system on a linux laptop that doesn't speak UFS, you may get those kind of error messages.

StanSmith · March 21, 2017, 2:53am

I've loaded the box back up with a live/usb image of Nas4free with all the proper kernal code and such to read UFS. Gonna see what I can make of it now.

I'm going to issue the commands to run the smart -long tests on all drives and then review the results. Should be done by tomorrow night.

StanSmith · March 21, 2017, 2:29pm

Update: All Smart Tests -long form passed with no errors. On a wim I tried to mount the drives in the liveUSB for Nas4Free and they gave an error, okay I figured that would happen, then I ran fsck from the webgui....and they were fixed! I do not know the difference between running them here in the webgui vs running them in the shell as I did yesterday in the first place.

Now I have no idea still what caused it but I am backing up all my files as we speak and making sure my backups are up-to-date.