ZFS Z2, 2 Drives Faulted. What next?

Hi everyone,

Not my first time I have had drives go bad… but it is the first time I have had 2 fault at the same time. Thankfully its a Z2 array, and I have a cold spare waiting to go in. But with 2 drives down, I am a little weary of how to approach this.

Would it make sense to do a zpool clear to force it to think everything is ok, and then replace one of the drives with my cold spare? See how the resilver goes, do a scrub, and monitor the situation? I don’t want to get myself into a worse situation by jumping to any conclusions prematurly.

I have had a few SMART errors pop up over the past few SMART tests, I guess I didn’t dig deep enough into them because I had thought the drives were still working fine (I have had eronious errors in the past that didn’t actually result in bad drives or any corruption). Looking at the SMART status of da5 (Drive with 62 faults according to zpool status), I am seeing:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       43
  3 Spin_Up_Time            0x0027   186   161   021    Pre-fail  Always       -       5700
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       234
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   028   028   000    Old_age   Always       -       52765
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       228
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       224
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       131
194 Temperature_Celsius     0x0022   121   105   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   055   000    Old_age   Always       -       608
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

da7 (Drive with 61 faults):

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   184   159   021    Pre-fail  Always       -       5766
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       235
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   028   028   000    Old_age   Always       -       52758
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       228
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       224
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       107
194 Temperature_Celsius     0x0022   120   105   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

Zpool status:

  pool: pergamum
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
	repaired.
  scan: scrub repaired 1.89M in 06:25:16 with 0 errors on Thu Dec 21 06:25:36 2023
config:

	NAME                                            STATE     READ WRITE CKSUM
	pergamum                                        DEGRADED     0     0     0
	  raidz2-0                                      DEGRADED     0     0     0
	    gptid/ab0351e8-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
	    gptid/abbfceac-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
	    gptid/e3c7752a-1fc4-11ea-8e70-000c29cab7ac  ONLINE       0     0     0
	    gptid/6ebdcf54-ac93-11ec-b2a3-279dd0c48793  ONLINE       0     0     0
	    gptid/ae0d7e64-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
	    gptid/aeca106f-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
	    gptid/af89686d-44ea-11e8-8cad-e0071bffdaee  FAULTED     61     0     0  too many errors
	    gptid/b04ad4fc-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
	    gptid/b10b6452-44ea-11e8-8cad-e0071bffdaee  FAULTED     62     0     0  too many errors
	    gptid/b1d949c1-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0

errors: No known data errors

What should my next steps here be?

To stave off any questison about hardware, it can all be found in my signature, but the controller is a H310 to a SAS expander, and has been in good working order for 7+ years (minus a few drive failures over the years). Its possible a SAS → SATA cable (or two) is going bad, its happened to me before. But I am thinking this is not that sort of situation.

yes, I’d do this and replace da5 first since it has the higher read error rate.

Out of curiosity, how old/fragmented is the pool? This can significantly affect the success of resilvers.

1 Like

The pool is ~6 years old, but I do scrubs every 2 weeks and its typically in good health - 2 faults back to back during the same scrub seems stastically insane…

Both of these drives are original drives to this pool, both at 6 years old.

Snapshot of some of the info for fun:


/dev/da9	WD-	PASSED	28*C	1y 8m 23d 4h	63	0	0	0	0	0	0	100%	0	Short
/dev/da4	WD-	PASSED	30*C	6y 0m 8d 9h	234	0	0	0	0	0	0	100%	0	Short
/dev/da0	WD-	PASSED	29*C	5y 7m 13d 19h	214	0	0	0	0	0	0	100%	0	Short
/dev/da2	WD-	PASSED	29*C	6y 0m 8d 12h	239	0	0	0	0	0	0	100%	0	Short
/dev/da8	WD-	PASSED	28*C	3y 11m 28d 2h	92	0	0	0	0	0	0	100%	0	Short
/dev/da1	WD-	PASSED	27*C	6y 0m 7d 16h	235	0	0	0	0	0	0	100%	0	Short
/dev/da3	WD-	PASSED	27*C	6y 0m 8d 0h	239	0	0	0	0	0	0	100%	0	Short
/dev/da5	WD-	PASSED	29*C	6y 0m 8d 11h	234	0	0	0	0	0	608	100%	0	Short
/dev/da6	WD-	PASSED	30*C	6y 0m 7d 7h	227	0	0	0	0	0	0	100%	0	Short
/dev/da7	WD-	PASSED	30*C	6y 0m 8d 4h	235	0	0	0	0	0	0	100%	0	Shor

How do I check fragmentation exactly…? 12% fragmented, 56% capacity.

And what would you suggest I do exactly? Do a zpool clear (will this online all of the drives? Do I need to do anything to tell ZFS to use the faulted out drives?). Once they are all online, I would use the GUI to replace da5?

This is good, I’ve run across so many people that think ZFS doesn’t need to scrub because of the inline integrity checks during read.

I realize now that I can tell from your scrub times that your data isn’t badly fragmented (I’m assuming your drives are >4TB, that number would actually be pretty bad for a <4TB drive). My concern was that badly fragmented ZFS pools need to do a high proportion of random seeking in order to resilver as compared to a “fresh” pool that mostly needs to do sequential reads–lots of random seeks can be very hard on hdds and even push older healthy-ish ones over the edge into failure.

ZFS can’t detect how fragmented it is, but you can check freespace fragmentation as a very very loose proxy. An even better proxy is how much your scrub times increase with age/use.

Are you on truenas?

Yea, I know a lot of folks over the last ~5 years have jumped into ZFS not really knowing what they are doing. I only sort of know what I am doing, thus why I am unsure how to address this issue properly. I know enough to know enough, but… always good to seek additional advice.

They are 4 TB drives, 5400 ROM, WD Reds (yes, CMR reds, not SMR reds, fucking WD…)

Yes, Truenas Core 13.

Hmm, I went to reboot truenas as I added my cold spare but the GUI didn’t seem to detect it. So I went for a reboot. It seems to be hanging, and looking in proxmox concole, I am now seeing this:

I didn’t look at the conlole prior to issuing the reboot command from Truenas GUI, but this makes it seems like like maybe I do have a fialing controller? It seems awefully mad about something. Looks like da0 also was removed from the array?

I don’t have any previous history as I can’t scroll up… I am at a bit of a less. Truenas is still hung trying to reboot.

hmmm… I didn’t factor in 5400RPM into my guestimating. I think your still doing okay-ish on fragmentation.

Yeah you’re thinking the same as me. what is the controller the drives are connected to? HBA or directly to the motherboard?

SAS HBA → SAS expander. HBA is a H310. Just rebooted, and proxmox consol is showing this for truenas VM:

This is curious, upon reboot, all drives are reporting correctly (even with the above errors being shown on console), and apparently it was 71% done with a resilver… and still trying to resilver? I shut it down, if there is a bad HBA throwing whatever these eorrors above are, I don’t think trying to resilver is a good idea.This is curious, upon reboot, all drives are reporting correctly (even with the above errors being shown on console), and apparently it was 71% done with a resilver… and still trying to resilver? I shut it down, if there is a bad HBA throwing whatever these eorrors above are, I don’t think trying to resilver is a good idea. Unless I am reading that wrong and that is just spitting out errors for a particular drive? I don’t know, I am starting be a bit confused as to what is happening here.

But at time of shutdown, all drives reported online, pool reported healthy. I am not entirely sure I trust that tho?

Next step is leave truenas off, get myself a new HBA, bring it up, and see what it reports?

oh thats right, I remember reading about the expander earlier. A SAS expander complicates troubleshooting because its another possible failure point you need to account for.

100% agree that until you’ve made sure it isn’t the HBA, the expander or cables, you shouldn’t try to resilver.
My first step at troubleshooting would be to unplug everything and then plug it back in, hopefully reseating an iffy cable connection if it existed. If that first step doesn’t work then I’d probably try replacing the HBA.

I did that (unplugged and reseated both cards and all cables) and upon boot I got the error posted a few posts above (the screenshot with da6 throwing scsi errors).

Maybe it’s just a bad drive throwing all sorts of errors and not the HBA?

Zpool replace is my recommendation. You have to resilver one at a time anyway, so you may as well start with the first disk now, order the second one and continue after the resilver is done.

If you feel particularly uncomfortable with a vulnerable array (no redundancy left for possibly a week or more), you can copy the most valuable stuff to a backup disk. Pool runs fine in degraded mode, it’s just slower and you’re basically running a RAID0 now.

I wouldn’t tinker, scrub and do other questionable things until the resilver is done.

If its a hardware error on the cable/controller side of things…drives may be fine once you solve the issue.

1 Like

Is it consistently da6 every reboot that throws the errors? if so I would think it less likely that the HBA or cabling is the issue and more likely the drive or possibly the SAS expander is the problem.

I am certainly uncomfortable with no redundency, this why the system is currently off pending next steps. I have got a few bits of advice to do a zpool replace, so I think that is the best bet.

I have enough spare cables and an extra slot to plug the cold spare in, in addition to all current drives. Is there a way to leave ALL current drives in the pool, and let ZFS add the new drive to the array? Obviously I can’t add a new drive to increase capacity, but is there a mechanism to rebuilt the array with the current drives that are online, since when I did boot up, it showed all 10 drives as online?

I am not sure if its the same drive each time (the daX number would change, but I have not verified if its the same serial number. I do agree, this would help me identify if its the drive, or potentially a controller/cable/expander.

Well, assuming the Truenas GUI is not non-sensical, I am replacing the worst of the 2 drives with my cold spare now. It looks like it was already resilving the current worst drive (it is again listed as da6), so once that is finished it will start the replacement with the cold spare.

I am not sure why it is trying to resilver the currently bad drive since it was faulted out… I guess when i rebooted it earlier it threw it into a resilver function since the drive did come back online? Eitehr way, the pool is currently listed as online, no data errors across the pool itself, but the worst offending drive is showing 10 read errors currently.

  pool: pergamum
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Dec 21 11:48:40 2023
	19.5T scanned at 30.0G/s, 16.6T issued at 25.6G/s, 20.5T total
	32K resilvered, 81.23% done, 00:02:33 to go
config:

	NAME                                              STATE     READ WRITE CKSUM
	pergamum                                          ONLINE       0     0     0
	  raidz2-0                                        ONLINE       0     0     0
	    gptid/ab0351e8-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0
	    gptid/abbfceac-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0
	    gptid/e3c7752a-1fc4-11ea-8e70-000c29cab7ac    ONLINE       0     0     0
	    gptid/6ebdcf54-ac93-11ec-b2a3-279dd0c48793    ONLINE       0     0     0
	    gptid/ae0d7e64-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0
	    gptid/aeca106f-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0
	    gptid/af89686d-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0
	    gptid/b04ad4fc-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0
	    replacing-8                                   ONLINE       0     0     0
	      gptid/b10b6452-44ea-11e8-8cad-e0071bffdaee  ONLINE      10     0     0  (resilvering)
	      gptid/a1020a2d-a04d-11ee-8a53-0002c95458ac  ONLINE       0     0     0  (awaiting resilver)
	    gptid/b1d949c1-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

errors: No known data errors

Thanks for the advice everyone, we shall see how this goes. New drive to replace the potentially second bad drive shows up tomorrow, assuming all is well and nothing is worse, I plan to badblocks is as I normally do prior to deploying a drive, and then repeat this process. If the second suspected bad drive starts to yeet isetlf from the array, I may forgo badblocks as I think I would rather a likely working new drive vs a known bad drive in my array, we will see…

Any updates? I was going to suggest swapping the cables to different drives if possible, but it seems you went ahead with the replacement.