@bambinone I know exactly what this is.
The ZED daemon is responsible for handling, at user mode, all zpool events. Things like adding/removing devices or device errors.
On the 3008-based LSI HBA chipsets (9300, 9400 series HBA’s), the mpt3sas driver generates an info message every time a checksum data transfer event happens. With passive copper cables, this happens a lot and under STP transports of ATA over SAS, it can happen very frequently (but that’s not what is happening here.
To confirm my conjectures above, you can open up the mpt3sas.h file and see the debug flags needed to get a bit more verbosity out of the driver. I have assembled the bit mask for you here:
echo 0x0006020A >/sys/module/mpt3sas/parameters/logging_level
You’ll find that the above verbosity level shows that the 3008 chipset is experiencing occasional checksum errors on CDB Write 16 as a result of an improper link equalization handshake on initialization. Reset the link and you’ll get a different rate of occurrence. It is a known problem with passive copper cabling. At scale and for Enterprise, I tended to go with optical whenever possible as it completely eliminates the problem with the HBA link equalization.
While I was still with Seagate, I tried to work with Broadcom’s firmware engineers out of their Boulder Colorado office, but they refuse to fix this legacy LSI problem. In their mind this is chipset is obsolete and should be replaced with the new TriMode/NVMe capable HBAs.
The fix…
This is not an error. It is an informational message from the mpt3sas driver indicating that it had to do a partial retransmission of the SRB payload of the SCSI command.
However, the ZED daemon on Linux was never finished. It can’t tell the difference between an error, a warning, or an info-level message. Use the storcli
command line tool to read the firmware revisions on the card and it will generate a stream of events that the ZFS ZED daemon will interpret as errors and offline a device!
The fix was to fix the ZED daemon. To that end, I sponsored 9 patch sets for ZFS via Allen Jude at iXsystems. Most of them landed in OpenZFS 2.1, while the rest are landing with OpenZFS 2.2.
This resolves most of the issues, but also gives you the ability to control how frequent info/warn messages need to be before ZFS considers a device eviction. This completely resolved issues I was having with my old shoddy SAS JBODs hosting SATA drives.
The patch sets I sponsored are attached.
I would recommend upgraded to OpenZFS 2.2 as soon as possible. If you want some pre-built packages, I can help. I have them in production for the latest Arch, Debian, and Ubuntu 22.04.
Seagate.Sponsored.Zfs.Changes.txt (7.7 KB)