I had 1x 10TB WD Red Pro drive show up as faulted for the entire week, but it has now gone away in FreeNAS alerts with the following explanation
pool: big-primary
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 0 in 0 days 00:00:58 with 0 errors on Mon Apr 20 00:00:59 2020
config:
NAME STATE READ WRITE CKSUM
big-primary ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/c3da3c92-199f-11ea-bef6-b4969130e724.eli ONLINE 0 214 1
gptid/b4cde05e-0187-11ea-9ba8-b4969130e724.eli ONLINE 0 0 0
gptid/b54a0478-0187-11ea-9ba8-b4969130e724.eli ONLINE 0 0 0
gptid/b5bf4293-0187-11ea-9ba8-b4969130e724.eli ONLINE 0 0 0
gptid/b63637ad-0187-11ea-9ba8-b4969130e724.eli ONLINE 0 0 0
gptid/b6bde417-0187-11ea-9ba8-b4969130e724.eli ONLINE 0 0 0
gptid/b7437013-0187-11ea-9ba8-b4969130e724.eli ONLINE 0 0 0
gptid/b8b2f2b1-0187-11ea-9ba8-b4969130e724.eli ONLINE 0 0 0
errors: No known data errors
What should I do? Replace the drive or wait till it gets super dead?
Right now I canāt order replacements due to COVID19 lockdown as well, so Iām stuck there as well!
Iāve got 3x drives showing similar issues in 3 separate pools (inclusive of this one). EEK!
I would check the bay/backplane/cable for any crud, then do zpool clear, and a scrub, letting it complete all the way?
Is the pool basically empty? Scrub complete in 1 hour on 10TB drives? Sounds pretty good!
I presume you are in an exotic location that gets sunshine and heat? It is not likely a heat issue, but you could try swapping 2 drives in car one in a corner does not get much cooling, but I doubt that would be the issue.
Cables are good, this has been running trouble free for a couple months now actually.
My two main pools are at ~45% usage; what I call the ādead-poolā is closer to 80% but itās made up of old 16x 4TB drives. My configuration is as follows:
(Host 1) big-primary // separate rack & hardware, separate 3KVA UPS backup
(Host 2) big-backup // separate rack & hardware, separate 3KVA UPS backup
big-primary syncs to big-backup via ZFS snapshots daily. dead-pool is on the same host as big-backup and this is a local-sync thanks to FreeNAS11.3
Yes - Iām in Colombo, Sri Lanka, but since getting this hardware in place, Iāve been running the AC in the āserver roomā continuously, although it is turned off during the night when it is very cool. This way the room is fairly humidity free and cool during the hotter day time.
big-primary has an average temp of 32-33degC across the drives. Host 2 runs much hotter thoughā¦
take the drive out and use a badblocks scan. Note: it runs 4 scans and it takes around 36 hours to run 1 scan on a 10 TiB drive. Might want to run that in the background or in a tmux session.
# sdX = where x is the drive letter
sudo badblocks -b 4096 -nsv /dev/sdX > badsectors.txt
The test is non-destructive but I would not do the test with the drive in the pool.
Point to note, when I purchased the 8x 10TB set, I ran them through a burn in process, which included badblocks. What youāre suggesting is a good idea since Iāve archived the badblocks report from the initial burn in at first-power-on and can compare with the current run as youāve suggested.
Does badblocks mark bad-sectors to be ignored by the drive in future?
No. Badblocks identifies the bad parts and the drive controller remembers them. The controller is then supposed to know about areas which are bad and not use them.
Quick update - the WD 12TB UltraStar is very very dead, ended up with a 400MB+ file listing badblocks, so I cancelled running it. didnāt get past 2% on the first run even.
# Serial 8CKJ18VE
[root@mike-fedora postmortems]# badblocks -b 4096 -nsv ${DEVICE} > badsectors_${SERIAL}.txt
Checking for bad blocks in non-destructive read-write mode
From block 0 to 2929721343
Checking for bad blocks (non-destructive read-write test)
Testing with random pattern: badblocks: Input/output error during test data write, block 8444785
badblocks: Input/output error during test data write, block 25156459
badblocks: Input/output error during test data write, block 41868133
badblocks: Input/output error during test data write, block 58579807
badblocks: Input/output error during test data write, block 75291481
badblocks: Input/output error during test data write, block 92003155
3.43% done, 18:56:25 elapsed. (100485100/0/0 errors)
The other WD Red Pro 10B is doing much better, so I think this can be re-used once the badblocks run is finishedā¦
Testing with random pattern: 79.51% done, 37:37:01 elapsed. (0/0/0 errors)
Yes, although thereās a caveat. Since Iām way out in practically āno where importantā (Sri Lanka), I source my disks from the US (typically B&H Photo, since they ship via DHL or Amazon). When I register the serial with WD, they flag this as āOut of regionā, which may or may not mean that theyāll satisfy the warranty.
For the past few years theyāve been OK with the Red/Red Pros, but with the UltraStars they were trying to pull a fast one one me. They basically wrote back saying theyāll offer warranty replacement if I sign a waiver to release making any future claims during the same year (WUT!).
Also, given the recent SMR shingled drive mess, I personally feel the warranty replacement drives are good as DOA (or dead within 3-months). Itās CHEAP money for WD to provide crappy warranty replacement drives, just good business.
Regarding the 10TB WD Red Pro, I ran badblocks non-destructive twice and there were 0 badblocks reported.
Just gave it a quick-wipe and itās resilvering now
scan: resilver in progress since Thu Apr 30 18:54:05 2020
2.17T scanned at 2.01G/s, 102G issued at 282M/s, 35.6T total
11.7G resilvered, 0.28% done, 1 days 12:40:11 to go
Itās 10:30am now ~15 hrs later, and things have sped up
pool: big-primary
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu Apr 30 18:54:05 2020
32.7T scanned at 727M/s, 30.4T issued at 675M/s, 35.6T total
3.71T resilvered, 85.22% done, 0 days 02:16:21 to go