FreeNAS: Dealing with 'FAULTED' errors going away on later scrubs?

Hi all,

I had 1x 10TB WD Red Pro drive show up as faulted for the entire week, but it has now gone away in FreeNAS alerts with the following explanation

  pool: big-primary
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 0 days 00:00:58 with 0 errors on Mon Apr 20 00:00:59 2020
config:

    NAME                                                STATE     READ WRITE CKSUM
    big-primary                                         ONLINE       0     0     0
      raidz2-0                                          ONLINE       0     0     0
        gptid/c3da3c92-199f-11ea-bef6-b4969130e724.eli  ONLINE       0   214     1
        gptid/b4cde05e-0187-11ea-9ba8-b4969130e724.eli  ONLINE       0     0     0
        gptid/b54a0478-0187-11ea-9ba8-b4969130e724.eli  ONLINE       0     0     0
        gptid/b5bf4293-0187-11ea-9ba8-b4969130e724.eli  ONLINE       0     0     0
        gptid/b63637ad-0187-11ea-9ba8-b4969130e724.eli  ONLINE       0     0     0
        gptid/b6bde417-0187-11ea-9ba8-b4969130e724.eli  ONLINE       0     0     0
        gptid/b7437013-0187-11ea-9ba8-b4969130e724.eli  ONLINE       0     0     0
        gptid/b8b2f2b1-0187-11ea-9ba8-b4969130e724.eli  ONLINE       0     0     0

errors: No known data errors

What should I do? Replace the drive or wait till it gets super dead?

Right now I canā€™t order replacements due to COVID19 lockdown as well, so Iā€™m stuck there as well!

Iā€™ve got 3x drives showing similar issues in 3 separate pools (inclusive of this one). EEK!

Thanks, Mike

I would check the bay/backplane/cable for any crud, then do zpool clear, and a scrub, letting it complete all the way?

Is the pool basically empty? Scrub complete in 1 hour on 10TB drives? Sounds pretty good!

I presume you are in an exotic location that gets sunshine and heat? It is not likely a heat issue, but you could try swapping 2 drives in car one in a corner does not get much cooling, but I doubt that would be the issue.

1 Like

Cables are good, this has been running trouble free for a couple months now actually.

My two main pools are at ~45% usage; what I call the ā€˜dead-poolā€™ is closer to 80% but itā€™s made up of old 16x 4TB drives. My configuration is as follows:

(Host 1) big-primary // separate rack & hardware, separate 3KVA UPS backup

(Host 2) big-backup // separate rack & hardware, separate 3KVA UPS backup

big-primary syncs to big-backup via ZFS snapshots daily. dead-pool is on the same host as big-backup and this is a local-sync thanks to FreeNAS11.3

Yes - Iā€™m in Colombo, Sri Lanka, but since getting this hardware in place, Iā€™ve been running the AC in the ā€œserver roomā€ continuously, although it is turned off during the night when it is very cool. This way the room is fairly humidity free and cool during the hotter day time.

big-primary has an average temp of 32-33degC across the drives. Host 2 runs much hotter thoughā€¦

1 Like

take the drive out and use a badblocks scan. Note: it runs 4 scans and it takes around 36 hours to run 1 scan on a 10 TiB drive. Might want to run that in the background or in a tmux session.

# sdX = where x is the drive letter
sudo badblocks -b 4096 -nsv /dev/sdX > badsectors.txt

The test is non-destructive but I would not do the test with the drive in the pool.

1 Like

Point to note, when I purchased the 8x 10TB set, I ran them through a burn in process, which included badblocks. What youā€™re suggesting is a good idea since Iā€™ve archived the badblocks report from the initial burn in at first-power-on and can compare with the current run as youā€™ve suggested.

Does badblocks mark bad-sectors to be ignored by the drive in future?

No. Badblocks identifies the bad parts and the drive controller remembers them. The controller is then supposed to know about areas which are bad and not use them.

Kudos for doing a burn in btw!

1 Like

Thanks! Also, this is my modified burn-in script. The final report is quite detailed :slight_smile:

1 Like

Going to steal that if you donā€™t mind.

1 Like

Not at all :slight_smile:

Hereā€™s a tweaked version,

DEVICE=/dev/sdb && SERIAL="$(smartctl -i ${DEVICE} | grep "Serial" | awk '{ print $3 }')"
badblocks -b 4096 -nsv ${DEVICE} > badsectors_${SERIAL}.txt

Got this running in tmux at the moment, thanks!

1 Like

How have I never thought of that?
ā€¦ Iā€™m so gonna steal that name! :stuck_out_tongue_closed_eyes:

1 Like

Quick update - the WD 12TB UltraStar is very very dead, ended up with a 400MB+ file listing badblocks, so I cancelled running it. didnā€™t get past 2% on the first run even.

# Serial 8CKJ18VE

[root@mike-fedora postmortems]# badblocks -b 4096 -nsv ${DEVICE} > badsectors_${SERIAL}.txt
Checking for bad blocks in non-destructive read-write mode
From block 0 to 2929721343
Checking for bad blocks (non-destructive read-write test)
Testing with random pattern: badblocks: Input/output error during test data write, block 8444785
badblocks: Input/output error during test data write, block 25156459
badblocks: Input/output error during test data write, block 41868133
badblocks: Input/output error during test data write, block 58579807
badblocks: Input/output error during test data write, block 75291481
badblocks: Input/output error during test data write, block 92003155
3.43% done, 18:56:25 elapsed. (100485100/0/0 errors)

The other WD Red Pro 10B is doing much better, so I think this can be re-used once the badblocks run is finishedā€¦

Testing with random pattern:  79.51% done, 37:37:01 elapsed. (0/0/0 errors)
1 Like

Oh my, I hope you can get that drive replaced? Does it still have a warranty on it?

1 Like

Yes, although thereā€™s a caveat. Since Iā€™m way out in practically ā€œno where importantā€ (Sri Lanka), I source my disks from the US (typically B&H Photo, since they ship via DHL or Amazon). When I register the serial with WD, they flag this as ā€˜Out of regionā€™, which may or may not mean that theyā€™ll satisfy the warranty.

For the past few years theyā€™ve been OK with the Red/Red Pros, but with the UltraStars they were trying to pull a fast one one me. They basically wrote back saying theyā€™ll offer warranty replacement if I sign a waiver to release making any future claims during the same year (WUT!).

Also, given the recent SMR shingled drive mess, I personally feel the warranty replacement drives are good as DOA (or dead within 3-months). Itā€™s CHEAP money for WD to provide crappy warranty replacement drives, just good business.

Regarding the 10TB WD Red Pro, I ran badblocks non-destructive twice and there were 0 badblocks reported.

Just gave it a quick-wipe and itā€™s resilvering now

  scan: resilver in progress since Thu Apr 30 18:54:05 2020
	2.17T scanned at 2.01G/s, 102G issued at 282M/s, 35.6T total
	11.7G resilvered, 0.28% done, 1 days 12:40:11 to go

Itā€™s 10:30am now ~15 hrs later, and things have sped up

  pool: big-primary
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Apr 30 18:54:05 2020
    32.7T scanned at 727M/s, 30.4T issued at 675M/s, 35.6T total
    3.71T resilvered, 85.22% done, 0 days 02:16:21 to go

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.