HDD SATA Advice for ZFS, many dead WD UltraStar drives

bsodmike · July 8, 2022, 8:15am

Hi all,

I’m running 3x vdevs of 8x HDDs each and for my primary backup array I’ve been changing the drives to UltraStar 16TB units (SATA). Since I live on the boonies, I order these from Amazon or B&H Photo, usually as single-pack “Retail” drives.

Since I’m not in the US, Warranty support gets voided as “Out of region”. In the past year I’ve probably had 3-4 drives present errors. The most recent has been 2x units with read/write errors just after 2-months of operation.

Each pair of drives run ~$700 and this is getting expensive, really fast!

I’m starting to pull my hair out - Any suggestions? Should I try something from Seagate?

Thanks!
Cheers, Mike.

MadMatt · July 8, 2022, 8:20am

In general, for the same class of HDD failure rate will be the same, the exception being particular bad batches/designs of which seagate used to be the champion (10+ years ago) but nowadays, looking at stats it’s not the case aymore

You’re running 24 drives, so proper mounting/vibrations/power/temperature may be a factor in the failure rate …
Also, I am not familiar with latest generation of Ultrastars so I am assuminf they are not SMR and you’re just using them for a purpose they were not made for?

bsodmike · July 8, 2022, 8:27am

These are the drives that I’ve noticed issues with:

They are marked as CMR. I’ve now started to shift the replacement drives to different slots on my HBA to potentially isolate any “bad” connections as well.

Having run these arrays for over 2-years now, I also had my very first “data corruption” event and ZFS nicely corrected it. Was pretty sweet to finally see this in action.

Previously, all my issues have been read/write errors in the zpool report.

bsodmike · July 8, 2022, 8:29am

Oh, I’m also running my drives like this:

Primary host (Norco chassis): 8x drives (1x vdev)
Secondary host (Supermicro): 16x drives (2x vdevs). I’ve got 8 at the top and 8 at the bottom, with 4x2 (2x rows) empty in the middle.

The primary replicates to backup 1, then it makes a local replication to backup 2 on the secondary host.

**Yes, I understand that these aren’t 3-2-1 backups, but given the volume of data, it’s the best I can do; with super-critical data being pushed into BackBlaze.

Log · July 8, 2022, 8:38am

What specific errors are you getting?

I’d be suspicious of the cables, the backplane (if any), and the HBA first and foremost.

anon7678104 · July 8, 2022, 8:47am

i would try going to a shop.
picking them up and taking them home yourself.
one of the biggest killers of drives is home delivery.

not kidding, every drive i have had delivered to my door has failed within warranty more often long before the warranty was out.
8 drives 4 builds in 1 year failed
one of them had 4 drives, 2 in raid 0, all 4 failed within 4 months.

every one i picked up from the shop.
still working as far as i know.

tried seagate and toshiba. both have been decent brands for longevity on 1-2 tb drives in 20+ builds.
but again i picked all of em up from the store, and while only a few of them have been for nas/jbod.
they do the job well enough for my mates and friends of family customers needs.

higher capacity’s i cant speak on.

bsodmike · July 8, 2022, 8:51am

A lot of the time the errors look like this - sadly I haven’t saved the system log but typically it’s a read/write type error. Next time I stumble on this I’ll make sure to share that here.

	NAME                                                STATE     READ WRITE CKSUM
	big-backup                                          DEGRADED     0     0     0
	  raidz2-0                                          DEGRADED     0     0     0
	    gptid/959a3de9-c80f-11ea-974f-ac1f6bbcf2fa.eli  ONLINE       0     0     0
	    gptid/10c68366-10c7-11eb-a6e2-ac1f6bbcf2fa.eli  ONLINE       0     0     0
	    gptid/421c7812-f968-11ea-974f-ac1f6bbcf2fa.eli  FAULTED     22   397     0  too many errors
	    gptid/cee7ea93-aeba-11ea-854f-ac1f6bbcf2fa.eli  ONLINE       0     0     0
	    gptid/fbc55005-bfba-11ea-974f-ac1f6bbcf2fa.eli  ONLINE       0     0     0
	    gptid/ea6a92ba-258d-11eb-a6e2-ac1f6bbcf2fa.eli  ONLINE       0     0     0
	    gptid/6e56c621-1ab8-11eb-a6e2-ac1f6bbcf2fa.eli  ONLINE       0     0     0
	    gptid/e8e5a1d3-4afb-11ea-b7e7-ac1f6bbcf2fa.eli  ONLINE       0     0     0

bsodmike · July 8, 2022, 8:53am

All my drives are shipped from the US and they make a LONG trip to Sri Lanka in a box via DHL or UPS - it’s basically a trip half-way across the globe.

The most recent pair had the drives inside a LARGE box, then each drive inside its own box, with plastic “ears” to hold the drives. One of the “ears” was fully shattered like glass.

So …you’re definitely onto something!!

Log · July 8, 2022, 9:49am

Look into the smart data on the drives for the errors, that can help clarify what kind of an error it is.

If the drive is reallocating sectors, then yeah it’s a drive problem.

Things like cables problem generate transmission errors that show up under things like Ultra DMA CRC Error Count.

Also note that the values for smart data can’t be taken at face value. They could be a percentage, and absolute value, or just straight up weird and arbitrary bullshit.

Also, what do you mean by “failure”? As in the drive completely stops working, or you pulled it because of errors in ZFS?

anon7678104 · July 10, 2022, 2:59pm

well that sucks… guess theres no local newegg or similar close by…
also just saw what’s going on over there.
stay safe mate.

one thing that can save some wear on drives is the way you mount em.
the other real killer of mechanical drives is sympathetic vibrations.
so make sure your cages are snug and the drives are snug in them.

dampen the drives with rubber grommets of various densities, as they will change the vibration each drive will absorb/emit from and to other drives in the box.
the aim being to stop the drives forcing sync up which adds wear to the moving parts.

much like what’s going on here. a similar effect is happening inside your drives with the motors and actuators having to absorb the vibrations till they hit a sync harmonic.
as i said this adds extra wear to components and if the harmonics aren’t dampened to a frequency that the drives wont sync to, you can destroy a whole array pretty quickly through harmonic resonance.

wayland · July 11, 2022, 4:51pm

I think Hitachi drives are better than Seagate or at least of the two types I’m using. I swapped out the Seagates from one of my arrays with Hitachi and it’s a lot better now, cooler and no error buildup in ZFS.

I don’t think those errors mean the drive is gone, you can probably resilver them away, which is what I was doing until one broke and I decided to replace them all one one at a time.

SAS drives are better because they are designed for arrays, see the metronome post.

I would suggest you start again building a new server using what you’ve learned and maybe this time the drives will be long lived.

thro · July 12, 2022, 12:45am

I’m running WD Red Plus at the moment, no complaints so far, other than WD as a company being traitorous fucks who keep screwing people with firmware changes, submarined SMR drives in the normal red line, etc.

Recent (lol) drive history for me: ran a bunch of WD blacks in ZFS for 6-7 years just fine. Replaced with a bunch of Seagate drives (forget which), died after 4 years or so. Replaced with the red plus a few months ago.

I’m not sure, and you may want to check - but wouldn’t the 16 TB drives be SMR? If so… bad for ZFS. If not (CMR encoding), play on… but definitely check the spec sheet of any spinning rust you buy these days. SMR doesn’t work with RAID or ZFS properly.

With both previous sets of drives, ZFS has done its thing and scrubbed/fixed/worked just fine until one of the drives totally died and/or I replaced the pool.

twin_savage · July 14, 2022, 7:00pm

I’m with @anon7678104 on this, rough delivery can permanently reduce the life of a HDD significantly; I’m sure we’ve all heard the yahoo shopping cart horror story about this.
My suggestion if you can’t get the HDDs locally would be to order a large number of the drives all at once, most reputable sellers will ship larger numbers of HDDs in their original packing that is designed to protect the drives better than “normal” packing.

Here’s the original box of some Toshiba MG08’s, basically saying that as long as more than 11 drives are shipped at once, they get the superior packing:

Exard3k · July 14, 2022, 7:16pm

My MG08s came with exactly the same packaging. I didn’t get the full box (5 drives), but shittons of extra padding. Always good when your retailer and manufacturer know how to do properly.

slightly offtopic: How are your MG08s doing? Got 6 drives running for 9 months now (around 5.5k hours) without any complains. I’m really happy with them so far.

twin_savage · July 14, 2022, 7:39pm

Nice. I bet the sellers are using the OEM packing when they have extras of them laying around from high amounts of low drive count orders… hopefully this will be the case for my next order.

I’ve only had them ~5 months so far but everything has been good. I’ve got 12 running currently and and going to purchase some more to expand to 16 soon.
I had been researching what drives to get after my HGST’s started to get long in the tooth and after the Q1 backblaze report came out saying the MG08s were superior to other contemporary drives I pulled the trigger.

bsodmike · July 17, 2022, 9:55am

I’ve checked, the 16TB UltraStar drives are CMR. I don’t touch SMR stuff, if I can help it!

bsodmike · July 17, 2022, 10:02am

Recently, I’ve been a tad lazy and just replace them whenever ZFS read/write errors are posted. I also tried the wipe and resilver, but then within a few days errors would stack up on the same drive.

In the past I used to be on the ball and anytime I pull a drive I would subject them to stringent script, it would

Run a short SMART test
Extended SMART test
dd burn 0’s
full badblocks run
short SMART test
Extended SMART test

Log everything to a super long detailed log file. Over the course of a few years I just built up a git repo with all these detailed logs. Most times I would throw a drive back into circulation and they would most likely throw up issues ~6 months later. Since then, I’ve stopped doing this.

bsodmike · July 17, 2022, 10:03am

Quick example - I’ve got two 16TB in the same vdev with two issues. One is failing SMART checks. the other, has these errors as resent as 13th July. From the 14th onwards this issue is not in the long and ZFS hasn’t raised any warnings (yet!).

Jul 13 22:09:46 freenas-backup 	(da11:mpr0:0:84:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 915 Aborting command 0xfffffe000189f350
Jul 13 22:09:46 freenas-backup mpr0: Sending reset from mprsas_send_abort for target ID 84
Jul 13 22:09:46 freenas-backup 	(da11:mpr0:0:84:0): READ(16). CDB: 88 00 00 00 00 05 66 a7 e8 18 00 00 00 08 00 00 length 4096 SMID 581 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Jul 13 22:09:46 freenas-backup 	(da11:mpr0:0:84:0): READ(16). CDB: 88 00 00 00 00 01 ab 34 ca e0 00 00 00 10 00 00 length 8192 SMID 274 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): READ(16). CDB: 88 00 00 00 00 05 66 a7 e8 18 00 00 00 08 00 00 
Jul 13 22:09:46 freenas-backup 	(da11:mpr0:0:84:0): READ(16). CDB: 88 00 00 00 00 05 66 a7 e7 90 00 00 00 08 00 00 length 4096 SMID 845 terminated ioc 804b l(da11:mpr0:0:84:0): CAM status: CCB request completed with an error
Jul 13 22:09:46 freenas-backup oginfo 31130000 scsi 0 state c xfer 0
Jul 13 22:09:46 freenas-backup 	(da11:mpr0:0:84:0): READ(10). CDB: 28 00 c7 4f 30 10 00 00 10 00 length 8192 SMID 489 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Jul 13 22:09:46 freenas-backup 	(da11:mpr0:0:84:0): READ(16). CDB: 88 00 00 00 00 01 3c be a5 88 00 00 00 28 00 00 length 20480 SMID 383 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Jul 13 22:09:46 freenas-backup 	(da11:mpr0:0:84:0): READ(16). CDB: 88 00 00 00 00 01 3c be a5 58 00 00 00 28 00 00 length 20480 SMID 145 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Jul 13 22:09:46 freenas-backup 	(da11:mpr0:0:84:0): READ(16). CDB: 88 00 00 00 00 01 3c be a5 20 00 00 00 08 00 00 length 4096 SMID 224 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Jul 13 22:09:46 freenas-backup 	(da11:mpr0:0:84:0): READ(16). CDB: 88 00 00 00 00 01 3c be a4 20 00 00 01 00 00 00 length 131072 SMID 986 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Jul 13 22:09:46 freenas-backup mpr0: Unfreezing devq for target ID 84
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): Retrying command
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): READ(16). CDB: 88 00 00 00 00 01 ab 34 ca e0 00 00 00 10 00 00 
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): CAM status: CCB request completed with an error
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): Retrying command
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): READ(16). CDB: 88 00 00 00 00 05 66 a7 e7 90 00 00 00 08 00 00 
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): CAM status: CCB request completed with an error
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): Retrying command
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): READ(10). CDB: 28 00 c7 4f 30 10 00 00 10 00 
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): CAM status: CCB request completed with an error
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): Retrying command
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): READ(16). CDB: 88 00 00 00 00 01 3c be a5 88 00 00 00 28 00 00 
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): CAM status: CCB request completed with an error
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): Retrying command
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): READ(16). CDB: 88 00 00 00 00 01 3c be a5 58 00 00 00 28 00 00 
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): CAM status: CCB request completed with an error
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): Retrying command
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): READ(16). CDB: 88 00 00 00 00 01 3c be a5 20 00 00 00 08 00 00 
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): CAM status: CCB request completed with an error
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): Retrying command
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): READ(16). CDB: 88 00 00 00 00 01 3c be a4 20 00 00 01 00 00 00 
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): CAM status: CCB request completed with an error
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): Retrying command
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): CAM status: Command timeout
Jul 13 22:09:46 freenas-backup (da11:mpr0:0:84:0): Retrying command
Jul 13 22:09:47 freenas-backup (da11:mpr0:0:84:0): READ(16). CDB: 88 00 00 00 00 01 3c be a5 20 00 00 00 08 00 00 
Jul 13 22:09:47 freenas-backup (da11:mpr0:0:84:0): CAM status: SCSI Status Error
Jul 13 22:09:47 freenas-backup (da11:mpr0:0:84:0): SCSI status: Check Condition
Jul 13 22:09:47 freenas-backup (da11:mpr0:0:84:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jul 13 22:09:47 freenas-backup (da11:mpr0:0:84:0): Retrying command (per sense data)

system · April 17, 2023, 4:03am

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.