I’m dealing with a peculiar error at work, and curious if anyone here has any insight.
We have a storage server with about 68 drives in it (many of those are in an external JBOD). Most of the disks are SAS though there are some SATA as well. This server hosts two zpools consisting of large-ish (~11 drive) raidz2 vdevs. When we put the storage under load, we see a bunch of read and write errors show up in zpool status
, which sometimes pile up to the point that ZFS kicks out the disk (marking it as faulted). A zpool clear
restores the pool to a working state, and it’s usually not the same drives that have this issue again a few minutes later. All drives self-report no errors (smartctl
).
Here’s an example dmesg
output that shows an actual I/O error.
[281421.461825] zio pool=bulk9 vdev=/dev/disk/by-id/scsi-35000c500d7d8a0db-part1 error=5 type=1 offset=10301571964928 size=741376 flags=40080cb0
[281421.848724] sd 12:0:58:0: [sdbd] tag#494 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[281421.848747] sd 12:0:58:0: [sdbd] tag#494 Sense Key : Aborted Command [current] [descriptor]
[281421.848756] sd 12:0:58:0: [sdbd] tag#494 Add. Sense: Ack/nak timeout
[281421.848763] sd 12:0:58:0: [sdbd] tag#494 CDB: Read(16) 88 00 00 00 00 04 af 43 f8 90 00 00 00 c8 00 00
[281421.848769] blk_update_request: I/O error, dev sdbd, sector 20120336528 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 0
Basically, we’re trying to figure out how to troubleshoot this. I don’t really understand what the above output means… Could the HBA be faulty? Could it be cabling? Clean power issues? Is the issue between the HBA and the disks or the CPU and the HBA? Could it be that the HBA just isn’t expensive enough and it can’t handle the throughput load?
Any advice on how to troubleshoot would be appreciated!