HDD burn-in testing, or how i learned to fail fast and love the RMA

Hey there!

Today, i woke up to a degraded raidz1 pool, which was still scrubbing. So, i stopped the scrub, zpool replaceed the failed drive with the cold spare, and let it resilver. (It just finished resilvering as i’m typing this - now i’m scrubbing it again, to make sure everything is fine.)

I think i should’ve burned-in my HGST Ultrastars, before using them. It’s “only” a home server, but it runs the whole show with firewall/router and service VMs, and i’d be terribly unpleasant to lose any of my precious bits.

Back on topic, by burning 'em in prior to putting 'em into production, i would’ve satisfied the “Fail Fast” design pattern. In one of the many tech talks i’ve consumed, someone talked about failure modes of spinning storage - those either fail early, or die of old age, and there’s not much happening in between. And indeed mine failed at ~1100 Power_On_Hours with reallocated sectors, unable to complete both, the short and the extended SMART self-test.

Since ZFS also keeps a record of drive performance, degrades the pool, and reports it, one might as well use this additional layer of diagnostics.

With compression on, dumping /dev/zero into a file probably wouldn’t do much, other than rapidly incrementing a few counters, wihtout actually writing. Afaik, /dev/random or urandom doesn’t have the throughput that might be desired. I think disabling deduplication, and duplicating a whole bunch of ISO images over and over might be what i’m looking for.

So, how do you burn-in your spinning storage?
…for how long?
…do you do a verification read test, with checksumming?

2 Likes

I badblocks every new drive with the full destructive rw rw r test… Because my drives are always above 2tb its easy to manually switch the block size to something huge so it doesn’t get held up… On 8 tb drives or thereabouts it usually takes 48-72 hours to complete… If I’m doing multiples I’ll use a tmux session per disk to do it all parallel…

2 Likes

You can use openssl to generate incompressible (effectively random) data much faster than /dev/urandom:

openssl enc -rc4 -nosalt -pass pass:"$(dd if=/dev/urandom bs=1k count=1 2>/dev/null|base64)" </dev/zero

4 Likes

as i’m reading the badblocks manpage, i think you’re referring to the -w parameter. The manpage states This option may not be combined with the -n option, as they are mutually exclusive.

(-n = Use non-destructive read-write mode.)

i.e. sudo badblocks -w /dev/disk/by-id/scsi-MAKE_SURE_ITS_THE_RIGHT_ONE #lol

EDIT:
i’ve stumbled over the 32 bit issue in badblocks. Now, i’m running the following command on my failed 6TB Ultrastar
badblocks -sw -t random -b 4096 /dev/disk/by-id/scsi-MAKE_SURE_ITS_THE_RIGHT_ONE

  • -s show status
  • -w write destructively
  • -t random test pattern
  • -b block size, should be specified if you run into “must be 32 bit” issues

…i know, it’s futile, since it’s already diagnosed via SMART and the RMA has been filed - i just want to see what a fail looks like, and you can’t break what’s already broken
…for science :vulcan_salute: :nerd_face:

EDIT 2:
…sorry, i forgot to post the result of badblocksing that failed Ultrastar; you’ll get a long list of failed sectors/LBAs - if badblocks finishes without lots of numbers, it’s fine. About a month ago, i’ve finished my “hot-prod-burn-in” procedure.

hot-prod-burn-in procedure
Burning-in HDDs after provisioning 'em to a hot-production array.
It involved using a cold spare and zpool replaceing every HDD in the 4-way-Z1-pool, badblocksing it, and zpool replaceing the next one with the previous “validation PASS”.

fun fact
after zpool replaceing every single HDD in that 4-way-Z1-pool zpool scrub finishes much earlier, with higher overall throughput.

4 Likes

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.