Storage server bugging out

Posting here in hopes of getting some idea of how to further troubleshoot this. This may be more of a hardware question than really a Linux question but we’ll see where we end up.

What I have here is a prosumer setup with mostly consumer hardware:

It was all working reasonably well for a while until one of the drives started failing. Allright, I thought, no big deal, I have a bunch of spares, resilver is ongoing, nothing to worry about, right?

Wrong.

The resilver would come to a grinding halt and zpool status would report:

errors: List of errors unavailable: pool I/O is currently suspended

Looking at the video output there were a number of kernel errors that I probably should have photographed, but they were all along the lines of:

ataXX: COMMRESET failed
Emask something (ATA bus error)

Stuff like that, aka communication errors with the drives, I guess. Next thing I tried was taking out the drive that went rogue initially (actually I also took out the spare, triggering a spare for the spare and the original – I have 4 spares, which I probably should not have done but anyway).

That didn’t really change anything. Since then I’ve done several reboots. It works fine at boot, then it starts spitting out errors on the hmdi-out and in some cases just locks up literally everything, leaving me no recourse other than cutting the power.

1 Like

Do you get anything from zpool list and zpool status right after reboot?
Do you see controllers in lspci?
What dmesg/journalctl says?

Probably cable/controller problem, but needs more info.

Yeah I’m fearing it’s the backplane which would suck. One of those 2 controllers was already in use in a previous incarnation (without the backplane), with a different second controller. With that controller, the disks would not even be recognized.

Anyway I started the monster up again and had a look at journalctl for one of the previous boots and found this:

Nov 19 14:33:33 goliath kernel: ata23.00: exception Emask 0x0 SAct 0xc0000 SErr 0x0 action 0x6 frozen
Nov 19 14:33:33 goliath kernel: ata23.00: failed command: READ FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata23.00: cmd 60/20:90:b0:c3:9d/00:00:01:00:00/40 tag 18 ncq dma 16384 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata23.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata23.00: failed command: READ FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata23.00: cmd 60/20:98:d0:c3:9d/00:00:01:00:00/40 tag 19 ncq dma 16384 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata23.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata23: hard resetting link
Nov 19 14:33:33 goliath kernel: ata32.00: exception Emask 0x0 SAct 0x60000000 SErr 0x0 action 0x6 frozen
Nov 19 14:33:33 goliath kernel: ata32.00: failed command: READ FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata32.00: cmd 60/20:e8:d0:be:9d/00:00:01:00:00/40 tag 29 ncq dma 16384 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata32.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata32.00: failed command: READ FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata32.00: cmd 60/20:f0:18:bf:9d/00:00:01:00:00/40 tag 30 ncq dma 16384 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata32.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata32: hard resetting link
Nov 19 14:33:33 goliath kernel: ata30.00: exception Emask 0x0 SAct 0x6000000 SErr 0x0 action 0x6 frozen
Nov 19 14:33:33 goliath kernel: ata30.00: failed command: WRITE FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata30.00: cmd 61/40:c8:08:bc:9d/00:00:01:00:00/40 tag 25 ncq dma 32768 out
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata30.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata30.00: failed command: WRITE FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata30.00: cmd 61/a0:d0:50:bc:9d/00:00:01:00:00/40 tag 26 ncq dma 81920 out
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata30.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata30: hard resetting link
Nov 19 14:33:33 goliath kernel: ata31.00: exception Emask 0x0 SAct 0x30 SErr 0x0 action 0x6 frozen
Nov 19 14:33:33 goliath kernel: ata31.00: failed command: READ FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata31.00: cmd 60/20:20:c8:be:9d/00:00:01:00:00/40 tag 4 ncq dma 16384 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata31.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata31.00: failed command: READ FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata31.00: cmd 60/20:28:10:bf:9d/00:00:01:00:00/40 tag 5 ncq dma 16384 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata31.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata31: hard resetting link
Nov 19 14:33:33 goliath kernel: ata29.00: exception Emask 0x0 SAct 0x18 SErr 0x0 action 0x6 frozen
Nov 19 14:33:33 goliath kernel: ata29.00: failed command: READ FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata29.00: cmd 60/20:18:a8:c3:9d/00:00:01:00:00/40 tag 3 ncq dma 16384 in
                                         res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata29.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata29.00: failed command: READ FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata29.00: cmd 60/20:20:c8:c3:9d/00:00:01:00:00/40 tag 4 ncq dma 16384 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata29.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata29: hard resetting link
Nov 19 14:33:38 goliath kernel: ata32: link is slow to respond, please be patient (ready=0)
Nov 19 14:33:38 goliath kernel: ata30: link is slow to respond, please be patient (ready=0)
Nov 19 14:33:38 goliath kernel: ata23: link is slow to respond, please be patient (ready=0)
Nov 19 14:33:38 goliath kernel: ata31: link is slow to respond, please be patient (ready=0)
Nov 19 14:33:38 goliath kernel: ata29: link is slow to respond, please be patient (ready=0)
Nov 19 14:33:43 goliath kernel: ata31: COMRESET failed (errno=-16)
Nov 19 14:33:43 goliath kernel: ata31: hard resetting link
Nov 19 14:33:43 goliath kernel: ata32: COMRESET failed (errno=-16)
Nov 19 14:33:43 goliath kernel: ata32: hard resetting link
Nov 19 14:33:43 goliath kernel: ata23: COMRESET failed (errno=-16)
Nov 19 14:33:43 goliath kernel: ata23: hard resetting link
lines 2111-2155
Nov 19 14:33:33 goliath kernel: ata32.00: cmd 60/20:f0:18:bf:9d/00:00:01:00:00/40 tag 30 ncq dma 16384 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata32.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata32: hard resetting link
Nov 19 14:33:33 goliath kernel: ata30.00: exception Emask 0x0 SAct 0x6000000 SErr 0x0 action 0x6 frozen
Nov 19 14:33:33 goliath kernel: ata30.00: failed command: WRITE FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata30.00: cmd 61/40:c8:08:bc:9d/00:00:01:00:00/40 tag 25 ncq dma 32768 out
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata30.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata30.00: failed command: WRITE FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata30.00: cmd 61/a0:d0:50:bc:9d/00:00:01:00:00/40 tag 26 ncq dma 81920 out
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata30.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata30: hard resetting link
Nov 19 14:33:33 goliath kernel: ata31.00: exception Emask 0x0 SAct 0x30 SErr 0x0 action 0x6 frozen
Nov 19 14:33:33 goliath kernel: ata31.00: failed command: READ FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata31.00: cmd 60/20:20:c8:be:9d/00:00:01:00:00/40 tag 4 ncq dma 16384 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata31.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata31.00: failed command: READ FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata31.00: cmd 60/20:28:10:bf:9d/00:00:01:00:00/40 tag 5 ncq dma 16384 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata31.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata31: hard resetting link
Nov 19 14:33:33 goliath kernel: ata29.00: exception Emask 0x0 SAct 0x18 SErr 0x0 action 0x6 frozen
Nov 19 14:33:33 goliath kernel: ata29.00: failed command: READ FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata29.00: cmd 60/20:18:a8:c3:9d/00:00:01:00:00/40 tag 3 ncq dma 16384 in
                                         res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata29.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata29.00: failed command: READ FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata29.00: cmd 60/20:20:c8:c3:9d/00:00:01:00:00/40 tag 4 ncq dma 16384 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata29.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata29: hard resetting link
Nov 19 14:33:38 goliath kernel: ata32: link is slow to respond, please be patient (ready=0)
Nov 19 14:33:38 goliath kernel: ata30: link is slow to respond, please be patient (ready=0)
Nov 19 14:33:38 goliath kernel: ata23: link is slow to respond, please be patient (ready=0)
Nov 19 14:33:38 goliath kernel: ata31: link is slow to respond, please be patient (ready=0)
Nov 19 14:33:38 goliath kernel: ata29: link is slow to respond, please be patient (ready=0)
Nov 19 14:33:43 goliath kernel: ata31: COMRESET failed (errno=-16)
Nov 19 14:33:43 goliath kernel: ata31: hard resetting link
Nov 19 14:33:43 goliath kernel: ata32: COMRESET failed (errno=-16)
Nov 19 14:33:43 goliath kernel: ata32: hard resetting link
Nov 19 14:33:43 goliath kernel: ata23: COMRESET failed (errno=-16)
Nov 19 14:33:43 goliath kernel: ata23: hard resetting link
Nov 19 14:33:43 goliath kernel: ata29: COMRESET failed (errno=-16)
Nov 19 14:33:43 goliath kernel: ata29: hard resetting link
Nov 19 14:33:43 goliath kernel: ata30: COMRESET failed (errno=-16)
Nov 19 14:33:43 goliath kernel: ata30: hard resetting link
Nov 19 14:33:43 goliath kernel: ata30: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 19 14:33:43 goliath kernel: ata29: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 19 14:33:43 goliath kernel: ata31: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 19 14:33:43 goliath kernel: ata23: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 19 14:33:43 goliath kernel: ata32: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 19 14:33:48 goliath kernel: ata31.00: qc timeout (cmd 0xec)
Nov 19 14:33:48 goliath kernel: ata31.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 19 14:33:48 goliath kernel: ata31.00: revalidation failed (errno=-5)
Nov 19 14:33:48 goliath kernel: ata31: hard resetting link
Nov 19 14:33:48 goliath kernel: ata32.00: qc timeout (cmd 0xec)
Nov 19 14:33:48 goliath kernel: ata32.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 19 14:33:48 goliath kernel: ata32.00: revalidation failed (errno=-5)
Nov 19 14:33:48 goliath kernel: ata32: hard resetting link
Nov 19 14:33:48 goliath kernel: ata30.00: qc timeout (cmd 0xec)
Nov 19 14:33:48 goliath kernel: ata30.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 19 14:33:48 goliath kernel: ata30.00: revalidation failed (errno=-5)
Nov 19 14:33:48 goliath kernel: ata30: hard resetting link
Nov 19 14:33:48 goliath kernel: ata23.00: qc timeout (cmd 0xec)
Nov 19 14:33:48 goliath kernel: ata23.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 19 14:33:48 goliath kernel: ata23.00: revalidation failed (errno=-5)
lines 2111-2179


Nov 19 14:33:33 goliath kernel: ata32.00: cmd 60/20:f0:18:bf:9d/00:00:01:00:00/40 tag 30 ncq dma 16384 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata32.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata32: hard resetting link
Nov 19 14:33:33 goliath kernel: ata30.00: exception Emask 0x0 SAct 0x6000000 SErr 0x0 action 0x6 frozen
Nov 19 14:33:33 goliath kernel: ata30.00: failed command: WRITE FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata30.00: cmd 61/40:c8:08:bc:9d/00:00:01:00:00/40 tag 25 ncq dma 32768 out
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata30.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata30.00: failed command: WRITE FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata30.00: cmd 61/a0:d0:50:bc:9d/00:00:01:00:00/40 tag 26 ncq dma 81920 out
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata30.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata30: hard resetting link
Nov 19 14:33:33 goliath kernel: ata31.00: exception Emask 0x0 SAct 0x30 SErr 0x0 action 0x6 frozen
Nov 19 14:33:33 goliath kernel: ata31.00: failed command: READ FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata31.00: cmd 60/20:20:c8:be:9d/00:00:01:00:00/40 tag 4 ncq dma 16384 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata31.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata31.00: failed command: READ FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata31.00: cmd 60/20:28:10:bf:9d/00:00:01:00:00/40 tag 5 ncq dma 16384 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata31.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata31: hard resetting link
Nov 19 14:33:33 goliath kernel: ata29.00: exception Emask 0x0 SAct 0x18 SErr 0x0 action 0x6 frozen
Nov 19 14:33:33 goliath kernel: ata29.00: failed command: READ FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata29.00: cmd 60/20:18:a8:c3:9d/00:00:01:00:00/40 tag 3 ncq dma 16384 in
                                         res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata29.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata29.00: failed command: READ FPDMA QUEUED
Nov 19 14:33:33 goliath kernel: ata29.00: cmd 60/20:20:c8:c3:9d/00:00:01:00:00/40 tag 4 ncq dma 16384 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 19 14:33:33 goliath kernel: ata29.00: status: { DRDY }
Nov 19 14:33:33 goliath kernel: ata29: hard resetting link
Nov 19 14:33:38 goliath kernel: ata32: link is slow to respond, please be patient (ready=0)
Nov 19 14:33:38 goliath kernel: ata30: link is slow to respond, please be patient (ready=0)
Nov 19 14:33:38 goliath kernel: ata23: link is slow to respond, please be patient (ready=0)
Nov 19 14:33:38 goliath kernel: ata31: link is slow to respond, please be patient (ready=0)
Nov 19 14:33:38 goliath kernel: ata29: link is slow to respond, please be patient (ready=0)
Nov 19 14:33:43 goliath kernel: ata31: COMRESET failed (errno=-16)
Nov 19 14:33:43 goliath kernel: ata31: hard resetting link
Nov 19 14:33:43 goliath kernel: ata32: COMRESET failed (errno=-16)
Nov 19 14:33:43 goliath kernel: ata32: hard resetting link
Nov 19 14:33:43 goliath kernel: ata23: COMRESET failed (errno=-16)
Nov 19 14:33:43 goliath kernel: ata23: hard resetting link
Nov 19 14:33:43 goliath kernel: ata29: COMRESET failed (errno=-16)
Nov 19 14:33:43 goliath kernel: ata29: hard resetting link
Nov 19 14:33:43 goliath kernel: ata30: COMRESET failed (errno=-16)
Nov 19 14:33:43 goliath kernel: ata30: hard resetting link
Nov 19 14:33:43 goliath kernel: ata30: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 19 14:33:43 goliath kernel: ata29: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 19 14:33:43 goliath kernel: ata31: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 19 14:33:43 goliath kernel: ata23: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 19 14:33:43 goliath kernel: ata32: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 19 14:33:48 goliath kernel: ata31.00: qc timeout (cmd 0xec)
Nov 19 14:33:48 goliath kernel: ata31.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 19 14:33:48 goliath kernel: ata31.00: revalidation failed (errno=-5)
Nov 19 14:33:48 goliath kernel: ata31: hard resetting link
Nov 19 14:33:48 goliath kernel: ata32.00: qc timeout (cmd 0xec)
Nov 19 14:33:48 goliath kernel: ata32.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 19 14:33:48 goliath kernel: ata32.00: revalidation failed (errno=-5)
Nov 19 14:33:48 goliath kernel: ata32: hard resetting link
Nov 19 14:33:48 goliath kernel: ata30.00: qc timeout (cmd 0xec)
Nov 19 14:33:48 goliath kernel: ata30.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 19 14:33:48 goliath kernel: ata30.00: revalidation failed (errno=-5)
Nov 19 14:33:48 goliath kernel: ata30: hard resetting link
Nov 19 14:33:48 goliath kernel: ata23.00: qc timeout (cmd 0xec)
Nov 19 14:33:48 goliath kernel: ata23.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 19 14:33:48 goliath kernel: ata23.00: revalidation failed (errno=-5)
Nov 19 14:33:48 goliath kernel: ata23: hard resetting link
Nov 19 14:33:48 goliath kernel: ata29.00: qc timeout (cmd 0xec)
Nov 19 14:33:48 goliath kernel: ata29.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 19 14:33:48 goliath kernel: ata29.00: revalidation failed (errno=-5)
Nov 19 14:33:48 goliath kernel: ata29: hard resetting link
Nov 19 14:33:49 goliath kernel: ata30: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 19 14:33:49 goliath kernel: ata31: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 19 14:33:49 goliath kernel: ata29: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 19 14:33:49 goliath kernel: ata23: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 19 14:33:49 goliath kernel: ata32: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 19 14:33:59 goliath kernel: ata32.00: qc timeout (cmd 0xec)
Nov 19 14:33:59 goliath kernel: ata32.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 19 14:33:59 goliath kernel: ata32.00: revalidation failed (errno=-5)
Nov 19 14:33:59 goliath kernel: ata32: limiting SATA link speed to 3.0 Gbps
Nov 19 14:33:59 goliath kernel: ata32: hard resetting link
Nov 19 14:33:59 goliath kernel: ata29.00: qc timeout (cmd 0xec)
Nov 19 14:33:59 goliath kernel: ata29.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 19 14:33:59 goliath kernel: ata29.00: revalidation failed (errno=-5)
Nov 19 14:33:59 goliath kernel: ata29: limiting SATA link speed to 3.0 Gbps
Nov 19 14:33:59 goliath kernel: ata29: hard resetting link
Nov 19 14:33:59 goliath kernel: ata23.00: qc timeout (cmd 0xec)
Nov 19 14:33:59 goliath kernel: ata23.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 19 14:33:59 goliath kernel: ata23.00: revalidation failed (errno=-5)
Nov 19 14:33:59 goliath kernel: ata23: limiting SATA link speed to 3.0 Gbps
Nov 19 14:33:59 goliath kernel: ata23: hard resetting link
Nov 19 14:33:59 goliath kernel: ata31.00: qc timeout (cmd 0xec)
Nov 19 14:33:59 goliath kernel: ata31.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 19 14:33:59 goliath kernel: ata31.00: revalidation failed (errno=-5)
Nov 19 14:33:59 goliath kernel: ata31: limiting SATA link speed to 3.0 Gbps
Nov 19 14:33:59 goliath kernel: ata31: hard resetting link
Nov 19 14:33:59 goliath kernel: ata30.00: qc timeout (cmd 0xec)
Nov 19 14:33:59 goliath kernel: ata30.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 19 14:33:59 goliath kernel: ata30.00: revalidation failed (errno=-5)
Nov 19 14:33:59 goliath kernel: ata30: limiting SATA link speed to 3.0 Gbps
Nov 19 14:33:59 goliath kernel: ata30: hard resetting link
Nov 19 14:33:59 goliath kernel: ata23: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
Nov 19 14:33:59 goliath kernel: ata29: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
Nov 19 14:33:59 goliath kernel: ata30: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
Nov 19 14:33:59 goliath kernel: ata31: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
Nov 19 14:33:59 goliath kernel: ata32: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
Nov 19 14:34:05 goliath kernel: nfsd: last server has exited, flushing export cache
Nov 19 14:34:30 goliath kernel: ata23.00: qc timeout (cmd 0xec)
Nov 19 14:34:30 goliath kernel: ata23.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 19 14:34:30 goliath kernel: ata23.00: revalidation failed (errno=-5)
Nov 19 14:34:30 goliath kernel: ata23.00: disabled
Nov 19 14:34:30 goliath kernel: ata23: hard resetting link
Nov 19 14:34:30 goliath kernel: ata31.00: qc timeout (cmd 0xec)
Nov 19 14:34:30 goliath kernel: ata31.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 19 14:34:30 goliath kernel: ata31.00: revalidation failed (errno=-5)

That goes on for a while but I think you get the idea.

This is from the last boot, just dmesg:

[   77.764898] docker_gwbridge: port 2(vethc25e5ec) entered blocking state
[   77.764904] docker_gwbridge: port 2(vethc25e5ec) entered forwarding state
[  297.991975] ata20.00: exception Emask 0x0 SAct 0x6 SErr 0x0 action 0x6 frozen
[  297.991995] ata20.00: failed command: READ FPDMA QUEUED
[  297.992004] ata20.00: cmd 60/08:08:10:56:48/00:00:02:00:00/40 tag 1 ncq dma 4096 in
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  297.992016] ata20.00: status: { DRDY }
[  297.992022] ata20.00: failed command: READ FPDMA QUEUED
[  297.992030] ata20.00: cmd 60/30:10:18:56:48/00:00:02:00:00/40 tag 2 ncq dma 24576 in
                        res 40/00:fe:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[  297.992041] ata20.00: status: { DRDY }
[  297.992049] ata20: hard resetting link
[  297.992060] ata14.00: exception Emask 0x0 SAct 0x18000000 SErr 0x0 action 0x6 frozen
[  297.992073] ata14.00: failed command: WRITE FPDMA QUEUED
[  297.992082] ata14.00: cmd 61/20:d8:e0:55:48/00:00:02:00:00/40 tag 27 ncq dma 16384 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  297.992094] ata14.00: status: { DRDY }
[  297.992099] ata14.00: failed command: WRITE FPDMA QUEUED
[  297.992108] ata14.00: cmd 61/08:e0:08:56:48/00:00:02:00:00/40 tag 28 ncq dma 4096 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  297.992119] ata14.00: status: { DRDY }
[  297.992124] ata14: hard resetting link
[  297.992135] ata19.00: exception Emask 0x0 SAct 0x30 SErr 0x0 action 0x6 frozen
[  297.992147] ata19.00: failed command: WRITE FPDMA QUEUED
[  297.992156] ata19.00: cmd 61/08:20:08:56:48/00:00:02:00:00/40 tag 4 ncq dma 4096 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  297.992168] ata19.00: status: { DRDY }
[  297.992173] ata19.00: failed command: WRITE FPDMA QUEUED
[  297.992182] ata19.00: cmd 61/08:28:08:fb:a5/00:00:59:00:00/40 tag 5 ncq dma 4096 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  297.992196] ata19.00: status: { DRDY }
[  297.992203] ata19: hard resetting link
[  297.992214] ata13.00: exception Emask 0x0 SAct 0x3800 SErr 0x0 action 0x6 frozen
[  297.992226] ata13.00: failed command: READ FPDMA QUEUED
[  297.992235] ata13.00: cmd 60/28:58:30:56:48/00:00:02:00:00/40 tag 11 ncq dma 20480 in
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  297.992247] ata13.00: status: { DRDY }
[  297.992252] ata13.00: failed command: READ FPDMA QUEUED
[  297.992260] ata13.00: cmd 60/08:60:58:56:48/00:00:02:00:00/40 tag 12 ncq dma 4096 in
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  297.992271] ata13.00: status: { DRDY }
[  297.992276] ata13.00: failed command: READ FPDMA QUEUED
[  297.992284] ata13.00: cmd 60/08:68:f0:57:48/00:00:02:00:00/40 tag 13 ncq dma 4096 in
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  297.992295] ata13.00: status: { DRDY }
[  297.992302] ata13: hard resetting link
[  298.306762] ata20: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  303.323973] ata20.00: qc timeout (cmd 0xec)
[  303.323981] ata20.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  303.323983] ata20.00: revalidation failed (errno=-5)
[  303.324000] ata20: hard resetting link
[  303.343966] ata19: link is slow to respond, please be patient (ready=0)
[  303.347952] ata14: link is slow to respond, please be patient (ready=0)
[  303.348046] ata13: link is slow to respond, please be patient (ready=0)
[  308.023964] ata19: COMRESET failed (errno=-16)
[  308.023983] ata19: hard resetting link
[  308.027965] ata14: COMRESET failed (errno=-16)
[  308.027976] ata14: hard resetting link
[  308.027984] ata13: COMRESET failed (errno=-16)
[  308.027995] ata13: hard resetting link
[  308.338998] ata19: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  308.342775] ata13: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  308.342901] ata14: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  308.675963] ata20: link is slow to respond, please be patient (ready=0)
[  313.355961] ata20: COMRESET failed (errno=-16)
[  313.355976] ata20: hard resetting link
[  313.563965] ata14.00: qc timeout (cmd 0xec)
[  313.563972] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  313.563973] ata14.00: revalidation failed (errno=-5)
[  313.563984] ata14: hard resetting link
[  313.563993] ata19.00: qc timeout (cmd 0xec)
[  313.564000] ata19.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  313.564001] ata19.00: revalidation failed (errno=-5)
[  313.564011] ata19: hard resetting link
[  313.564063] ata13.00: qc timeout (cmd 0xec)
[  313.564070] ata13.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  313.564071] ata13.00: revalidation failed (errno=-5)
[  313.564081] ata13: hard resetting link
[  313.670758] ata20: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  313.878791] ata14: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  318.915959] ata19: link is slow to respond, please be patient (ready=0)
[  318.915963] ata13: link is slow to respond, please be patient (ready=0)
[  323.595958] ata13: COMRESET failed (errno=-16)
[  323.595977] ata13: hard resetting link
[  323.604127] ata19: COMRESET failed (errno=-16)
[  323.604139] ata19: hard resetting link
[  323.803953] ata20.00: qc timeout (cmd 0xec)
[  323.803959] ata20.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  323.803960] ata20.00: revalidation failed (errno=-5)
[  323.803972] ata20: limiting SATA link speed to 3.0 Gbps
[  323.803974] ata20: hard resetting link
[  324.059949] ata14.00: qc timeout (cmd 0xec)
[  324.059955] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  324.059956] ata14.00: revalidation failed (errno=-5)
[  324.059967] ata14: limiting SATA link speed to 3.0 Gbps
[  324.059968] ata14: hard resetting link
[  324.207960] ata13: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  324.215958] ata19: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  329.155921] ata20: link is slow to respond, please be patient (ready=0)
[  329.415955] ata14: link is slow to respond, please be patient (ready=0)
[  333.839956] ata20: COMRESET failed (errno=-16)
[  333.839975] ata20: hard resetting link
[  334.103917] ata14: COMRESET failed (errno=-16)
[  334.103929] ata14: hard resetting link
[  334.299964] ata19.00: qc timeout (cmd 0xec)
[  334.299972] ata19.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  334.299973] ata19.00: revalidation failed (errno=-5)
[  334.299985] ata19: limiting SATA link speed to 3.0 Gbps
[  334.299987] ata19: hard resetting link
[  334.299999] ata13.00: qc timeout (cmd 0xec)
[  334.300006] ata13.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  334.300007] ata13.00: revalidation failed (errno=-5)
[  334.300017] ata13: limiting SATA link speed to 3.0 Gbps
[  334.300019] ata13: hard resetting link
[  334.332046] ata20: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[  334.418864] ata14: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[  339.651953] ata19: link is slow to respond, please be patient (ready=0)
[  339.655952] ata13: link is slow to respond, please be patient (ready=0)
[  344.335949] ata13: COMRESET failed (errno=-16)
[  344.335965] ata13: hard resetting link
[  344.335974] ata19: COMRESET failed (errno=-16)
[  344.335986] ata19: hard resetting link
[  344.650603] ata13: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[  349.687950] ata19: link is slow to respond, please be patient (ready=0)
[  354.367950] ata19: COMRESET failed (errno=-16)
[  354.367968] ata19: hard resetting link
[  354.682672] ata19: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[  365.531949] ata20.00: qc timeout (cmd 0xec)
[  365.531957] ata20.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  365.531958] ata20.00: revalidation failed (errno=-5)
[  365.531974] ata20.00: disabled
[  365.531988] ata20: hard resetting link
[  365.532033] ata14.00: qc timeout (cmd 0xec)
[  365.532039] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  365.532040] ata14.00: revalidation failed (errno=-5)
[  365.532050] ata14.00: disabled
[  365.532061] ata14: hard resetting link
[  370.883930] ata20: link is slow to respond, please be patient (ready=0)
[  370.884036] ata14: link is slow to respond, please be patient (ready=0)
[  375.563942] ata20: COMRESET failed (errno=-16)
[  375.563958] ata20: hard resetting link
[  375.563966] ata14: COMRESET failed (errno=-16)
[  375.563982] ata14: hard resetting link
[  375.771934] ata13.00: qc timeout (cmd 0xec)
[  375.771940] ata13.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  375.771942] ata13.00: revalidation failed (errno=-5)
[  375.771951] ata13.00: disabled
[  375.771964] ata13: hard resetting link
[  375.878713] ata20: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[  375.878727] sd 19:0:0:0: [sdm] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[  375.878729] sd 19:0:0:0: [sdm] tag#1 Sense Key : Not Ready [current] 
[  375.878731] sd 19:0:0:0: [sdm] tag#1 Add. Sense: Logical unit not ready, hard reset required
[  375.878733] sd 19:0:0:0: [sdm] tag#1 CDB: Read(16) 88 00 00 00 00 00 02 48 56 10 00 00 00 08 00 00
[  375.878735] print_req_error: I/O error, dev sdm, sector 38295056
[  375.878758] sd 19:0:0:0: [sdm] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[  375.878759] sd 19:0:0:0: [sdm] tag#2 Sense Key : Not Ready [current] 
[  375.878761] sd 19:0:0:0: [sdm] tag#2 Add. Sense: Logical unit not ready, hard reset required
[  375.878762] sd 19:0:0:0: [sdm] tag#2 CDB: Read(16) 88 00 00 00 00 00 02 48 56 18 00 00 00 30 00 00
[  375.878763] print_req_error: I/O error, dev sdm, sector 38295064
[  375.878773] ata20: EH complete
[  375.878857] sd 19:0:0:0: [sdm] tag#3 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.878861] sd 19:0:0:0: [sdm] tag#3 CDB: Read(16) 88 00 00 00 00 00 02 48 56 48 00 00 00 f8 00 00
[  375.878863] print_req_error: I/O error, dev sdm, sector 38295112
[  375.878875] sd 19:0:0:0: [sdm] tag#5 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.878903] sd 19:0:0:0: [sdm] tag#5 CDB: Read(16) 88 00 00 00 00 00 02 48 57 40 00 00 00 70 00 00
[  375.878905] print_req_error: I/O error, dev sdm, sector 38295360
[  375.878986] sd 19:0:0:0: [sdm] tag#7 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.878989] sd 19:0:0:0: [sdm] tag#7 CDB: Read(16) 88 00 00 00 00 00 d7 31 d1 80 00 00 00 08 00 00
[  375.878991] print_req_error: I/O error, dev sdm, sector 3610366336
[  375.879133] sd 19:0:0:0: [sdm] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.879136] sd 19:0:0:0: [sdm] Sense not available.
[  375.879216] sd 19:0:0:0: [sdm] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.879218] sd 19:0:0:0: [sdm] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.879219] sd 19:0:0:0: [sdm] Sense not available.
[  375.879222] sd 19:0:0:0: [sdm] Sense not available.
[  375.879232] sd 19:0:0:0: [sdm] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.879234] sd 19:0:0:0: [sdm] Sense not available.
[  375.879299] sd 19:0:0:0: [sdm] 0 512-byte logical blocks: (0 B/0 B)
[  375.879301] sd 19:0:0:0: [sdm] 4096-byte physical blocks
[  375.879321] sd 19:0:0:0: [sdm] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.879322] sd 19:0:0:0: [sdm] Sense not available.
[  375.879356] sd 19:0:0:0: [sdm] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.879357] sd 19:0:0:0: [sdm] Sense not available.
[  375.879379] sd 19:0:0:0: [sdm] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.879382] sd 19:0:0:0: [sdm] Sense not available.
[  375.879384] sd 19:0:0:0: [sdm] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.879387] sd 19:0:0:0: [sdm] Sense not available.
[  375.879388] sd 19:0:0:0: [sdm] 0 512-byte logical blocks: (0 B/0 B)
[  375.879390] sd 19:0:0:0: [sdm] 4096-byte physical blocks
[  375.879422] sd 19:0:0:0: [sdm] 0 512-byte logical blocks: (0 B/0 B)
[  375.879425] sd 19:0:0:0: [sdm] 4096-byte physical blocks
[  375.879486] sd 19:0:0:0: [sdm] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.879487] sd 19:0:0:0: [sdm] Sense not available.
[  375.879490] sd 19:0:0:0: [sdm] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.879492] sd 19:0:0:0: [sdm] Sense not available.
[  375.935950] ata14: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[  375.935962] sd 13:0:0:0: [sdj] tag#27 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[  375.935964] sd 13:0:0:0: [sdj] tag#27 Sense Key : Not Ready [current] 
[  375.935966] sd 13:0:0:0: [sdj] tag#27 Add. Sense: Logical unit not ready, hard reset required
[  375.935968] sd 13:0:0:0: [sdj] tag#27 CDB: Write(16) 8a 00 00 00 00 00 02 48 55 e0 00 00 00 20 00 00
[  375.935969] print_req_error: I/O error, dev sdj, sector 38295008
[  375.935991] sd 13:0:0:0: [sdj] tag#28 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[  375.935993] sd 13:0:0:0: [sdj] tag#28 Sense Key : Not Ready [current] 
[  375.935994] sd 13:0:0:0: [sdj] tag#28 Add. Sense: Logical unit not ready, hard reset required
[  375.935995] sd 13:0:0:0: [sdj] tag#28 CDB: Write(16) 8a 00 00 00 00 00 02 48 56 08 00 00 00 08 00 00
[  375.935996] print_req_error: I/O error, dev sdj, sector 38295048
[  375.936007] ata14: EH complete
[  375.936040] sd 13:0:0:0: [sdj] tag#29 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.936043] sd 13:0:0:0: [sdj] tag#29 CDB: Write(16) 8a 00 00 00 00 00 59 a5 fb 08 00 00 00 18 00 00
[  375.936044] print_req_error: I/O error, dev sdj, sector 1504049928
[  375.936151] sd 13:0:0:0: [sdj] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.936154] sd 13:0:0:0: [sdj] tag#0 CDB: Write(16) 8a 00 00 00 00 01 16 6c ef 58 00 00 00 08 00 00
[  375.936155] print_req_error: I/O error, dev sdj, sector 4671205208
[  375.936240] sd 13:0:0:0: [sdj] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.936242] sd 13:0:0:0: [sdj] Sense not available.
[  375.936291] sd 13:0:0:0: [sdj] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.936293] sd 13:0:0:0: [sdj] Sense not available.
[  375.936324] sd 13:0:0:0: [sdj] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.936327] sd 13:0:0:0: [sdj] Sense not available.
[  375.936332] sd 13:0:0:0: [sdj] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.936334] sd 13:0:0:0: [sdj] Sense not available.
[  375.936346] sd 13:0:0:0: [sdj] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.936347] sd 13:0:0:0: [sdj] Sense not available.
[  375.936360] sd 13:0:0:0: [sdj] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.936362] sd 13:0:0:0: [sdj] Sense not available.
[  375.936390] sd 13:0:0:0: [sdj] 0 512-byte logical blocks: (0 B/0 B)
[  375.936392] sd 13:0:0:0: [sdj] 4096-byte physical blocks
[  375.936404] sd 13:0:0:0: [sdj] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.936405] sd 13:0:0:0: [sdj] 0 512-byte logical blocks: (0 B/0 B)
[  375.936407] sd 13:0:0:0: [sdj] Sense not available.
[  375.936409] sd 13:0:0:0: [sdj] 4096-byte physical blocks
[  375.936412] sd 13:0:0:0: [sdj] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  375.936414] sd 13:0:0:0: [sdj] Sense not available.
[  375.936449] sd 13:0:0:0: [sdj] 0 512-byte logical blocks: (0 B/0 B)
[  375.936450] sd 13:0:0:0: [sdj] 4096-byte physical blocks
[  375.936459] sd 13:0:0:0: [sdj] 0 512-byte logical blocks: (0 B/0 B)
[  375.936461] sd 13:0:0:0: [sdj] 4096-byte physical blocks
[  376.086801] ata13: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[  376.086813] sd 12:0:0:0: [sdi] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[  376.086815] sd 12:0:0:0: [sdi] tag#11 Sense Key : Not Ready [current] 
[  376.086817] sd 12:0:0:0: [sdi] tag#11 Add. Sense: Logical unit not ready, hard reset required
[  376.086819] sd 12:0:0:0: [sdi] tag#11 CDB: Read(16) 88 00 00 00 00 00 02 48 56 30 00 00 00 28 00 00
[  376.086820] print_req_error: I/O error, dev sdi, sector 38295088
[  376.086848] ata13: EH complete
[  376.086938] sd 12:0:0:0: [sdi] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  376.086939] sd 12:0:0:0: [sdi] Sense not available.
[  376.086947] sd 12:0:0:0: [sdi] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  376.086948] sd 12:0:0:0: [sdi] Sense not available.
[  376.086954] sd 12:0:0:0: [sdi] 0 512-byte logical blocks: (0 B/0 B)
[  376.086955] sd 12:0:0:0: [sdi] 4096-byte physical blocks
[  376.087130] sd 12:0:0:0: [sdi] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  376.087133] sd 12:0:0:0: [sdi] Sense not available.
[  376.087229] sd 12:0:0:0: [sdi] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  376.087235] sd 12:0:0:0: [sdi] Sense not available.
[  376.087258] sd 12:0:0:0: [sdi] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  376.087260] sd 12:0:0:0: [sdi] Sense not available.
[  376.087286] sd 12:0:0:0: [sdi] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  376.087288] sd 12:0:0:0: [sdi] Sense not available.
[  386.011943] ata19.00: qc timeout (cmd 0xec)
[  386.011950] ata19.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  386.011952] ata19.00: revalidation failed (errno=-5)
[  386.011968] ata19.00: disabled
[  386.011981] ata19: hard resetting link
[  386.326677] ata19: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[  386.326687] scsi_io_completion: 4 callbacks suppressed
[  386.326691] sd 18:0:0:0: [sdl] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[  386.326693] sd 18:0:0:0: [sdl] tag#4 Sense Key : Not Ready [current] 
[  386.326695] sd 18:0:0:0: [sdl] tag#4 Add. Sense: Logical unit not ready, hard reset required
[  386.326697] sd 18:0:0:0: [sdl] tag#4 CDB: Write(16) 8a 00 00 00 00 00 02 48 56 08 00 00 00 08 00 00
[  386.326698] print_req_error: 4 callbacks suppressed
[  386.326699] print_req_error: I/O error, dev sdl, sector 38295048
[  386.326725] sd 18:0:0:0: [sdl] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[  386.326726] sd 18:0:0:0: [sdl] tag#5 Sense Key : Not Ready [current] 
[  386.326728] sd 18:0:0:0: [sdl] tag#5 Add. Sense: Logical unit not ready, hard reset required
[  386.326729] sd 18:0:0:0: [sdl] tag#5 CDB: Write(16) 8a 00 00 00 00 00 59 a5 fb 08 00 00 00 08 00 00
[  386.326730] print_req_error: I/O error, dev sdl, sector 1504049928
[  386.326742] ata19: EH complete
[  386.326756] sd 18:0:0:0: [sdl] tag#12 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.326757] sd 18:0:0:0: [sdl] tag#12 CDB: Read(16) 88 00 00 00 00 00 02 48 56 30 00 00 00 10 00 00
[  386.326758] print_req_error: I/O error, dev sdl, sector 38295088
[  386.326779] sd 18:0:0:0: [sdl] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.326788] sd 18:0:0:0: [sdl] tag#14 CDB: Write(16) 8a 00 00 00 00 00 59 a5 fb 10 00 00 00 10 00 00
[  386.326790] print_req_error: I/O error, dev sdl, sector 1504049936
[  386.326792] sd 18:0:0:0: [sdl] tag#13 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.326815] sd 18:0:0:0: [sdl] tag#16 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.326818] sd 18:0:0:0: [sdl] tag#16 CDB: Read(16) 88 00 00 00 00 00 d7 31 d1 80 00 00 00 08 00 00
[  386.326820] print_req_error: I/O error, dev sdl, sector 3610366336
[  386.326886] sd 18:0:0:0: [sdl] tag#13 CDB: Read(16) 88 00 00 00 00 00 02 48 57 f0 00 00 00 08 00 00
[  386.326887] print_req_error: I/O error, dev sdl, sector 38295536
[  386.326938] sd 18:0:0:0: [sdl] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.326940] sd 18:0:0:0: [sdl] Sense not available.
[  386.326988] sd 18:0:0:0: [sdl] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.326990] sd 18:0:0:0: [sdl] Sense not available.
[  386.327025] sd 18:0:0:0: [sdl] tag#11 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327036] sd 18:0:0:0: [sdl] tag#11 CDB: Read(16) 88 00 00 00 00 00 02 48 56 40 00 00 00 08 00 00
[  386.327038] sd 18:0:0:0: [sdl] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327042] sd 18:0:0:0: [sdl] 0 512-byte logical blocks: (0 B/0 B)
[  386.327044] sd 18:0:0:0: [sdl] Sense not available.
[  386.327048] sd 18:0:0:0: [sdl] 4096-byte physical blocks
[  386.327053] print_req_error: I/O error, dev sdl, sector 38295104
[  386.327083] sd 18:0:0:0: [sdl] tag#1 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327094] sd 18:0:0:0: [sdl] tag#1 CDB: Write(16) 8a 00 00 00 00 01 16 6c ef 58 00 00 00 08 00 00
[  386.327096] print_req_error: I/O error, dev sdl, sector 4671205208
[  386.327104] sd 18:0:0:0: [sdl] tag#8 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327140] sd 18:0:0:0: [sdl] tag#8 CDB: Read(16) 88 00 00 00 00 00 02 48 56 10 00 00 00 f8 00 00
[  386.327142] print_req_error: I/O error, dev sdl, sector 38295056
[  386.327153] sd 18:0:0:0: [sdl] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327158] sd 18:0:0:0: [sdl] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327160] sd 18:0:0:0: [sdl] Sense not available.
[  386.327179] sd 18:0:0:0: [sdl] Sense not available.
[  386.327192] sd 18:0:0:0: [sdl] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327195] sd 18:0:0:0: [sdl] tag#18 CDB: Read(16) 88 00 00 00 00 00 02 48 57 10 00 00 00 a0 00 00
[  386.327196] print_req_error: I/O error, dev sdl, sector 38295312
[  386.327296] sd 18:0:0:0: [sdl] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327302] sd 18:0:0:0: [sdl] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327305] sd 18:0:0:0: [sdl] Sense not available.
[  386.327340] sd 18:0:0:0: [sdl] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327343] sd 18:0:0:0: [sdl] Sense not available.
[  386.327370] sd 18:0:0:0: [sdl] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327371] sd 18:0:0:0: [sdl] Sense not available.
[  386.327379] sd 18:0:0:0: [sdl] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327381] sd 18:0:0:0: [sdl] Sense not available.
[  386.327411] sd 18:0:0:0: [sdl] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327413] sd 18:0:0:0: [sdl] Sense not available.
[  386.327443] sd 18:0:0:0: [sdl] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327445] sd 18:0:0:0: [sdl] Sense not available.
[  386.327457] sd 18:0:0:0: [sdl] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327460] sd 18:0:0:0: [sdl] Sense not available.
[  386.327466] sd 18:0:0:0: [sdl] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327468] sd 18:0:0:0: [sdl] Sense not available.
[  386.327497] sd 18:0:0:0: [sdl] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.327498] sd 18:0:0:0: [sdl] Sense not available.
[  386.328825] sd 18:0:0:0: [sdl] Sense not available.
[  386.328848] sd 18:0:0:0: [sdl] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.328850] sd 18:0:0:0: [sdl] Sense not available.
[  386.328935] sd 18:0:0:0: [sdl] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.328937] sd 18:0:0:0: [sdl] Sense not available.
[  386.328981] sd 18:0:0:0: [sdl] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  386.328984] sd 18:0:0:0: [sdl] Sense not available.
[  388.322302] sd 19:0:0:0: [sdm] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  388.322305] sd 19:0:0:0: [sdm] Sense not available.
[  388.322338] sd 19:0:0:0: [sdm] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  388.322339] sd 19:0:0:0: [sdm] Sense not available.
[  388.322400] sd 19:0:0:0: [sdm] Write Protect is on
[  388.322402] sd 19:0:0:0: [sdm] Mode Sense: 00 00 04 fb
[  388.323214] sd 18:0:0:0: [sdl] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  388.323216] sd 18:0:0:0: [sdl] Sense not available.
[  388.323247] sd 18:0:0:0: [sdl] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  388.323248] sd 18:0:0:0: [sdl] Sense not available.
[  388.323266] sd 19:0:0:0: [sdm] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  388.323269] sd 19:0:0:0: [sdm] Sense not available.
[  388.323294] sd 18:0:0:0: [sdl] Write Protect is on
[  388.323296] sd 18:0:0:0: [sdl] Mode Sense: 00 00 04 fb
[  388.323315] sd 19:0:0:0: [sdm] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  388.323317] sd 19:0:0:0: [sdm] Sense not available.
[  388.323393] sd 19:0:0:0: [sdm] Write Protect is off
[  388.323396] sd 19:0:0:0: [sdm] Mode Sense: 00 00 00 00
[  388.323656] sd 19:0:0:0: [sdm] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  388.323658] sd 19:0:0:0: [sdm] Sense not available.
[  388.323694] sd 19:0:0:0: [sdm] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  388.323696] sd 19:0:0:0: [sdm] Sense not available.
[  388.323961] sd 13:0:0:0: [sdj] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  388.323964] sd 13:0:0:0: [sdj] Sense not available.
[  388.323995] sd 13:0:0:0: [sdj] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  388.323996] sd 13:0:0:0: [sdj] Sense not available.
[  388.324007] sd 18:0:0:0: [sdl] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  388.324011] sd 18:0:0:0: [sdl] Sense not available.
[  388.324038] sd 13:0:0:0: [sdj] Write Protect is on
[  388.324039] sd 13:0:0:0: [sdj] Mode Sense: 00 00 04 fb
[  388.324066] sd 18:0:0:0: [sdl] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  388.324068] sd 18:0:0:0: [sdl] Sense not available.
[  388.324146] sd 18:0:0:0: [sdl] Write Protect is off
[  388.324148] sd 18:0:0:0: [sdl] Mode Sense: 00 00 00 00
[  388.324519] sd 12:0:0:0: [sdi] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  388.324521] sd 12:0:0:0: [sdi] Sense not available.
[  388.324544] sd 12:0:0:0: [sdi] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  388.324546] sd 12:0:0:0: [sdi] Sense not available.
[  388.324731] sd 13:0:0:0: [sdj] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  388.324733] sd 13:0:0:0: [sdj] Sense not available.
[  388.324748] sd 13:0:0:0: [sdj] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  388.324750] sd 13:0:0:0: [sdj] Sense not available.
[  388.324790] sd 13:0:0:0: [sdj] Write Protect is off
[  388.324791] sd 13:0:0:0: [sdj] Mode Sense: 00 00 00 00

This time it managed to shut down though.

Yeah, I’ve had same messages when my sata cable wasn’t up to snuff. It was working for a while and then I had link resets. I had to replace it and then all was good.

Kernel only knows it cant communicate with device, but if there’s no module for that device then you have to troubleshoot hardware.

So this can be backplane, port on backplane, cable, or even port on your controller.
That’s the main reason I switched to miniSAS, because troubleshooting more than few connections is PITA.

BTW, do you run smart monitoring? If not check smartctl on all drives, and configure smartd after you fix this issue.

Edit: Oh, and I would check fanout cable first, most likely and easiest to replace. Swap them even. But I would take out all drives first, and put in some spares.

Yeah I could swap out the sata hba’s for mini-sas hba’s, but that gets expensive pretty quick (like 600 euro), which is stupid because it’s the same tech with different connectors but what can we do.

I just moved around the cable connections, doesn’t change anything. If anything, the problem manifested earlier after boot. No smartmon, wanted to install but box already dead before I could.

Edit: also these cables were already pretty hard to come by. The “reverse” kind are not that widely available, it would suck if I would have to find a different source.

I’ve completely bypassed one of the controllers and the bottom tow rows on the backplane. So I am now running on motherboard sata and 1 of the controllers. It is holding out a lot longer than before.

I removed one of the spares to make this possible. But there’s now a lot of resilvering going on which I’d rather just stop and put the drives back as spares:

$ sudo zpool status -v
  pool: data
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Nov 19 13:37:26 2020
	455G scanned out of 15.6T at 150M/s, 29h20m to go
	83.7G resilvered, 2.85% done
config:

	NAME                                            STATE     READ WRITE CKSUM
	data                                            ONLINE       0     0     0
	  raidz2-0                                      ONLINE       0     0     0
	    wwn-0x5000c500b2b39f7b                      ONLINE       0     0     0
	    ata-ST4000DM000-1F2168_S300GPKC             ONLINE       0     0     0
	    ata-ST4000DM000-1F2168_Z300TQVA             ONLINE       0     0     0
	    ata-ST4000DM000-1F2168_Z300TMRG             ONLINE       0     0     0
	    spare-4                                     ONLINE       0     0     0
	      ata-ST4000VN008-2DR166_ZGY31ED3           ONLINE       0     0     0
	      ata-ST4000VN008-2DR166_ZGY7PJGC           ONLINE       0     0     0  (resilvering)
	      ata-ST4000VN008-2DR166_ZGY7PJKY           ONLINE       0     0     0  (resilvering)
	      ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K6UNP786  ONLINE       0     0     0  (resilvering)
	    ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K2JYANKV    ONLINE       0     0     0
	    ata-ST4000VN008-2DR166_ZM403X3Z             ONLINE       0     0     0
	    ata-ST4000VN008-2DR166_ZM40355X             ONLINE       0     0     0
	    ata-ST4000VN008-2DR166_ZGY3CCMN             ONLINE       0     0     0
	    ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K3CC3UZ5    ONLINE       0     0     0
	logs
	  wwn-0x5002538d702bc018-part3                  ONLINE       0     0     0
	cache
	  wwn-0x5002538d702bc018-part4                  ONLINE       0     0     0
	spares
	  ata-ST4000VN008-2DR166_ZGY7PJGC               INUSE     currently in use
	  ata-ST4000VN008-2DR166_ZGY7PJKY               INUSE     currently in use
	  ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K6UNP786      INUSE     currently in use
	  ata-ST4000DM000-1F2168_S300J2JV               UNAVAIL 

errors: Permanent errors have been detected in the following files:

        /data/Archive/home-2014/home/xxxxx/backup.tar.gz
        /data/Archive/xxxxxx.rar

I notice there are now also some files that marked as permanently damaged. How permanent is that?

Edit: Yeah, never mind. It took a while longer but the exact same errors are back.

Few years ago i bought couple Supermicro controllers, new for 100-150eu, and I flashed them to IT mode.
https://www.supermicro.com/en/products/accessories/addon/AOC-USAS2-L8i.php?TYP=E
Its UIO not ATX standard, but its pretty easy to fasten them inside rack case.
Also now you can get various used ones, based on LSI2008 for like 30eu. It can be RAID, you can flash it to HBA/IT, just check if there’s IT bios available beforehand.

So no need to spend tons of money if you just have mechanical drives. Cheap previous gen 6Gb/s is fine.

Well, there’s your clue. If moving cables magnified problem its pretty obvious they’re probably culprit.

You can check smart on disks one by one connecting them to another box with single sata cable.
Or even just unplug backplane, take out controllers and use onboard sata if mobo has them. Boot from usb if your boot partition was going trough backplane too.

Yeah, I know. I’ve had the same problem like 10 yo. Even if I found proper fanout it was way too expensive, or store couldn’t even specify if its normal or reverse.

On the other hand miniSAS SFF-8087 is always the same, and now you can get new one for 10-20 eurorubles.
Lack of headache - priceless :wink:

Don’t use partitions for the logs and cache as it looks to be a hdd and you will quickly exhaust your iops.

While zfs won’t scream at you if you do this it works best with complete full access to a drive.

1 Like

Actually those are partitions of a SSD

Just occured to me, because its few disks at once I would also check backplane power plug. If its 4 pin molex, then they aren’t the best, and sometimes don’t connect properly, especially when old and oxidized. And result would be pretty much the same what you see in journal.

Also do you have redundant psu in your box? If yes then also checking one part of psu at a time would rule out psu dying. Just few months ago I’ve had this situation, where dying side of psu didn’t switch to good one and strange things were happening.

Yeah I was also considering that. All the connectors are brand new but perhaps the splitters I’m using aren’t the best. There’s no redundant PSU, it’s a 1000w corsair. Should be enough juice, even with a 1950x and up to 24 HDDs, right?

Sure 1k is plenty. Without GPU you probably using less than 300w

Oh, for sure, splitters are no1 problem for me when somethings wrong with power. In my desktop last month I went through 3 molex->sata splitters because one of my drives was constantly thrown out from RAID.

Alright, update on this:

  • I took out the 3 ssds from the backplane and connected them with plain sata to one of the controllers
  • Only the first 3 rows on the backplane are now populated (2 spares are not connected)
  • These 3 rows have 2 things common with each other:
    1. All three of them are connected using a a shorter blue cable (I also have 3 longer grey cables that I had to source from somewhere else)
    2. 4 out of 5 molex sockets (don’t ask me why there are 5 for 3x6 drives) are connected using daisy chained 1>2 molex splitters (the rest of the molex sockets use a 1>6 splitter).
  • One of the rows is connected to both controllers, so if this attempt succeeds, the problem is clearly with the longer, grey cables.
  • A slowly increasing number of files are reported by zfs status as having “permanent errors”. I hope this is not as bad as it sounds. It’s not constantly increasing, I mean it has been increasing from 1 to 3 as I’ve been trying to debug this.
  • Ran for dev in /dev/sd?; do echo $dev; sudo smartctl -a $dev; done which didn’t yield any interesting results, but one of the disks does have a “degraded” state with too many errors in the cksum column. But I imagine this is probably some controller issues hangover. dmesg has stayed clean for 20 minutes which is much longer than before.

I’ll leave it for while longer, try to make sure it isn’t the molexes, and if it isn’t I guess I have to either source 3 new longer reverse breakout cables, or buy 2 of these:

and 3 or 4 or the appropriate cables…

Edit: running for about an hour now without any problems. Resilver rate is also double what it was. Reasonably sure now the problem is the grey cables but just to be sure I’ll also switch the molex splitters.

Well, glad you came to grips with it.
Finish scrubbing/resilvering, it will hopefully be ok.
if that degraded drive is ok then you can clear errors later.

I usually run smartctl -A for less output, and i’m mostly interested in Reallocated Sectors/Pending Sectors.

Remember to configure smart monitoring after you finish fixing it, because it often warns you to replace disk way before it really goes dead.

Also configure automatic scrubbing periodically. On my prod servers I go usually every weekend at night, because nobody is using it anyway. But some recommend like once a month. So pick somewhere between those.

1 Like

So the resilver finished overnight. This is what I was looking at, afterwards:

  pool: data
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: resilvered 4.45T in 10h45m with 10 errors on Sat Nov 21 04:07:19 2020
config:

	NAME                                            STATE     READ WRITE CKSUM
	data                                            DEGRADED     0     0    18
	  raidz2-0                                      DEGRADED     0     0   134
	    wwn-0x5000c500b2b39f7b                      ONLINE       0     0     0
	    ata-ST4000DM000-1F2168_S300GPKC             ONLINE       0     0     0
	    ata-ST4000DM000-1F2168_Z300TQVA             ONLINE       0     0     0
	    ata-ST4000DM000-1F2168_Z300TMRG             REMOVED      0     0     0
	    spare-4                                     ONLINE       0     0     0
	      ata-ST4000VN008-2DR166_ZGY31ED3           ONLINE       0     0     0
	      ata-ST4000VN008-2DR166_ZGY7PJGC           ONLINE       0     0     0
	      ata-ST4000VN008-2DR166_ZGY7PJKY           ONLINE       0     0     0
	      ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K6UNP786  ONLINE       0     0     0
	    ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K2JYANKV    ONLINE       0     0     0
	    ata-ST4000VN008-2DR166_ZM403X3Z             ONLINE       0     0     0
	    ata-ST4000VN008-2DR166_ZM40355X             ONLINE       0     0     0
	    15234179876330307149                        UNAVAIL      0     0     0  was /dev/disk/by-id/ata-ST4000VN008-2DR166_ZGY3CCMN-part1
	    ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K3CC3UZ5    DEGRADED     0     0     0  too many errors
	logs
	  wwn-0x5002538d702bc018-part3                  ONLINE       0     0     0
	cache
	  wwn-0x5002538d702bc018-part4                  ONLINE       0     0     0
	spares
	  ata-ST4000VN008-2DR166_ZGY7PJGC               INUSE     currently in use
	  ata-ST4000VN008-2DR166_ZGY7PJKY               INUSE     currently in use
	  ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K6UNP786      INUSE     currently in use
	  ata-ST4000DM000-1F2168_S300J2JV               UNAVAIL 

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x7e>
        /data/Archive/root-2020-08/usr/lib/jvm/java-11-openjdk-amd64/jmods/java.base.jmod

Interestingly, the metadata error stayed, 2 other files were removed and a new one was added to the list or files with “permanent errors”.

During the resilver, one of the drives experienced some errors. It’s that one that has state “REMOVED” here, ata4 in dmesg and /dev/sdd. This is most likely an unrelated problem though: probably this desktop grade disk just didn’t like the resilver. Also it’s one of the oldest ones in there I think.

Anyway, this specific disk failing does not worry me, but since another one is absent, I currently don’t have any parity disks. I need to be able to actually use 2 of the 3 spares currently “in use”.

Unfortunately though, I seem to be unable to detach them from their current use:

sudo zpool detach data ata-ST4000VN008-2DR166_ZGY31ED3
cannot detach ata-ST4000VN008-2DR166_ZGY31ED3: no valid replicas

To make matter worse, after running sudo zpool clear data ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K3CC3UZ5 to get rid of the vdevs DEGRADED state, a big resilvering started again:

  pool: data
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Nov 21 08:32:58 2020
	187G scanned out of 15.6T at 304M/s, 14h44m to go
	53.3G resilvered, 1.17% done
config:

	NAME                                            STATE     READ WRITE CKSUM
	data                                            DEGRADED     0     0    18
	  raidz2-0                                      DEGRADED     0     0   134
	    wwn-0x5000c500b2b39f7b                      ONLINE       0     0     0  (resilvering)
	    ata-ST4000DM000-1F2168_S300GPKC             ONLINE       0     0     0
	    ata-ST4000DM000-1F2168_Z300TQVA             ONLINE       0     0     0
	    ata-ST4000DM000-1F2168_Z300TMRG             REMOVED      0     0     0
	    spare-4                                     ONLINE       0     0     0
	      ata-ST4000VN008-2DR166_ZGY31ED3           ONLINE       0     0     0  (resilvering)
	      ata-ST4000VN008-2DR166_ZGY7PJGC           ONLINE       0     0     0  (resilvering)
	      ata-ST4000VN008-2DR166_ZGY7PJKY           ONLINE       0     0     0  (resilvering)
	      ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K6UNP786  ONLINE       0     0     0  (resilvering)
	    ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K2JYANKV    ONLINE       0     0     0
	    ata-ST4000VN008-2DR166_ZM403X3Z             ONLINE       0     0     0  (resilvering)
	    ata-ST4000VN008-2DR166_ZM40355X             ONLINE       0     0     0  (resilvering)
	    15234179876330307149                        UNAVAIL      0     0     0  was /dev/disk/by-id/ata-ST4000VN008-2DR166_ZGY3CCMN-part1
	    ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K3CC3UZ5    ONLINE       0     0     0
	logs
	  wwn-0x5002538d702bc018-part3                  ONLINE       0     0     0
	cache
	  wwn-0x5002538d702bc018-part4                  ONLINE       0     0     0
	spares
	  ata-ST4000VN008-2DR166_ZGY7PJGC               INUSE     currently in use
	  ata-ST4000VN008-2DR166_ZGY7PJKY               INUSE     currently in use
	  ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K6UNP786      INUSE     currently in use
	  ata-ST4000DM000-1F2168_S300J2JV               UNAVAIL 

What I am going to do now is shut her down, and see if I can mcgyver some way to get the disks all hooked up without using the backplane. I probably should have done that in the first place.

Afterwards I still have to get rid of all of those in-use-but-not-really spares.

So that worked for about 5 minutes and then the whole system because unresponsive again. I’m just gonna wait for the new parts to arrive because I don’t think I’m making things better right now.

Yeah, when your system is touch and go its fools errand to resilver.

And detaching isn’t necessary, just pull it and do “zpool replace” with some numerical ID that shows if drive is missing.

Like in your last listing you can add new disk and just do:

zpool data replace 15234179876330307149 ata-YourNewDriveID

BTW. your logs are on single drive??

	logs
	  wwn-0x5002538d702bc018-part3                  ONLINE       0     0   0

This is tragedy waiting to happen, if thats not mirror…

Yeah, when your system is touch and go its fools errand to resilver.

It took a while due to a shipping issue, but I have the 3 sas HBAs installed and it’s been stable with a nice, slowly increasing, resilver rate. For about an hour so far. We’ll see tomorrow, but I am hopeful.

And detaching isn’t necessary, just pull it and do “zpool replace” with some numerical ID that shows if drive is missing.

The thing is though, that currently almost all my spares are “in use”, while none of my drives are marked as failing. I’m talking about this bit:

	    spare-4                                     ONLINE       0     0     0
	      ata-ST4000VN008-2DR166_ZGY31ED3           ONLINE       0     0     0  (resilvering)
	      ata-ST4000VN008-2DR166_ZGY7PJGC           ONLINE       0     0     0  (resilvering)
	      ata-ST4000VN008-2DR166_ZGY7PJKY           ONLINE       0     0     0  (resilvering)
	      ata-WDC_WD40EZRZ-00GXCB0_WD-WCC7K6UNP786  ONLINE       0     0     0  (resilvering)

I only need one of those drives to resilver, and rest to go back to the pool of spares.

I was talking in terms of “future endeavors”, because sometimes people don’t know how to do it.

More than few times someone was aking me for help because they removed bad disk and then did “zpool add” :wink:

Hope everything will go well.