Hard Drive issue on FreeNAS 9.10.2-U6

Hostile · February 20, 2020, 11:25pm

Hi folks. I logged into my freeNAS server today via the web interface and I was greeted with this message:
“Device: /dev/ada0, 16 Currently unreadable (pending) sectors”

It seems like that one of my hard drives (ada0) needs to be replaced. The question I have is, how do I determine which physical drive in my server is the malfunctioning one? I have 10 WD Red 6TB drives (RAID Z2) in there and I would like to find out if there is an easy way to determine which one is the faulty drive.

noenken · February 21, 2020, 1:09am

Sure, thats easy. You disable that drive in freenas and the little light on your hot swap bays stops blinki…

Oh… let me guess.

Hostile · February 21, 2020, 1:48am

I’m not using Hot swapable bays. It’s a regular ATX case. That gives me an idea though. I disabled the drive, gonna wait a few minutes for it to cool down. My hand should be able to tell which drive is cooler to the touch.

freqlabs · February 21, 2020, 3:27am

FreeNAS 9.10 is very old so I don’t know how one would do this in the GUI, but diskinfo -s ada0 will tell you the serial number of the drive.

Dynamic_Gravity · February 21, 2020, 3:54am

Add then if he powers off the system he can just check each drive’s s/n!

Hostile · February 21, 2020, 4:31am

Yeah, thanks guys, I found the drive and made sure to write down the name, number, and location of every drive in case something similar happens in the future. Turns out I had HGST drives. 10 x 6TB. it’s been 4 years since I built this NAS and my last NAS had WD Reds and I forgot these were HGSTs. Anyways, Warranty is out of the question. Standard is 3 years and this drive was bought 4 years ago.

Hostile · February 21, 2020, 4:37am

Advertised at 1 million hours MTBF.
34K on my clock, almost a million.

Airstripone · February 21, 2020, 7:03am

Just a point of principle, that error message doesnt mean the drive is dead yet, just that it has some duff sectors. 16 is not enough to pull the disk and a scrub should deal with it, bit keep an eye on the drive. True it is a sign that it is on its way out, but you can live with it for a while if you have RAIDZ2 and hotspare.

You can migrate any “prevail” disks down to a lower priority array like the CCTV or secondary backup and keep them going a bit longer.

For future reference the error message for device unreadable Is that the pool will say “Degraded” and tell you which drive is out of action. Guidance here:

+1. This is no longer in the active maintenance train. Backup and upgrade to 11.3. it is stable.

Trooper_ish · February 21, 2020, 8:21am

Can OP export the pool, then re-import using /diskidfor more meaningful name?

I was thinking of the Linux way of using /dev/disk/by-id but the internet’s says maybe?

Shukaze · February 21, 2020, 10:22am

I’ve had a similar situation and all it was is just a bad SATA cable, easy fix.

freqlabs · February 21, 2020, 12:56pm

The error is coming from SMART not ZFS as I understand it.

Airstripone · February 21, 2020, 3:38pm

That is a ZFS error message, standard alert solved with regular pool scrubs or drive replacement

Smart errors are a little more dramatic.

Edit: this is not accurate - ignore

freqlabs · February 21, 2020, 3:46pm

I just checked, that’s definitely an error message in smartd:

github.com

smartmontools/smartmontools/blob/151ab7e65253165d96c2a15c0463df415ab8925c/smartmontools/smartd.cpp#L3534


  PrintOut(LOG_CRIT, "Device: %s, failed to read SMART Attribute Data\n", name);
  MailWarning(cfg, state, 6, "Device: %s, failed to read SMART Attribute Data", name);
  state.must_write = true;
}
else {
  reset_warning_mail(cfg, state, 6, "read SMART Attribute Data worked again");


  // look for current or offline pending sectors
  if (cfg.curr_pending_id)
    check_pending(cfg, state, cfg.curr_pending_id, cfg.curr_pending_incr, curval, 10,
                  (!cfg.curr_pending_incr ? "Currently unreadable (pending) sectors"
                                          : "Total unreadable (pending) sectors"    ));


  if (cfg.offl_pending_id)
    check_pending(cfg, state, cfg.offl_pending_id, cfg.offl_pending_incr, curval, 11,
                  (!cfg.offl_pending_incr ? "Offline uncorrectable sectors"
                                          : "Total offline uncorrectable sectors"));


  // check temperature limits
  if (cfg.tempdiff || cfg.tempinfo || cfg.tempcrit)
    CheckTemperature(cfg, state, ata_return_temperature_value(&curval, cfg.attribute_defs), 0);

Hostile · February 21, 2020, 4:31pm

Yes, it is a SMART error and not a ZFS error. I know that this drive might be fine for years to come and I’m not gonna throw it away. I’ll just use it as a tertiary backup. I won’t take the chance of it screwing with my zpool and that’s why I’m replacing it. It could die tomorrow or last another 20 years, but why risk it.

disarrayer · February 21, 2020, 6:31pm

That is really not enough. Is it very hot in there? Lots of vibrations? Brown-outs? Or do you have power managment enabled so HDDs power down when not used?

Nevertheless: contact HGST with that screenshot. If they’re in a good mood, they might actually replace it.

Hostile · February 21, 2020, 7:18pm

It’s a well ventilated case with rubber mounts for the HDD trays. The system is connected to a UPS so it’s well protected all around.

Hostile · February 21, 2020, 7:25pm

The product isn’t even listed on their website:

https://www.westerndigital.com/support.hgst.data-center-drives

Mine is a “HGST Deskstar NAS H3IKNAS600012872SN (0S03839)”

disarrayer · February 21, 2020, 7:47pm

Even stranger. I mean; shit happens, but this is way too low…

Might be a Model that was not ported to the new DB when WD faded out the brand. Still: Try it. Heard some good things about their handling of stuff like that. Maybe you at least get a discount. Worth a mail, don’t you think?

Hostile · February 21, 2020, 8:11pm

Couldn’t hurt I suppose. I shot them an e-mail.

Hostile · February 22, 2020, 11:10pm

Interesting development. I used CCleaner’s Drive Wiper tool and did a advanced overwrite (3 passes). Took about 30 hours to finish, but after that, the smart error is gone.

Is that normal? I wonder if FreeNAS’s scrub function would have done the same.