Hey everyone. I’ve been on an extended rabbit hole regarding T10-PI (protection information) for SCSI and NVMe devices under Linux. Basically, a set of extensions to the SCSI and NVMe specifications, referred to variously as T10-PI, T10-DIF, and T10-DIX, allow there to be end-to-end data integrity verification through checksums, checked at various parts of the chain connecting the OS to the disk.
If my understanding is correct, in the context of SAS drives,
T10-DIF (data integrity field) is a setup where the HBA performs checksums and transmits them to the disk along with every block. The disk is supposed to verify and store the checksum, and return it, to be checked by the HBA, upon reading that block back.
T10-DIX takes this a step further and allows the OS to compute the checksum. I think the HBA is supposed to verify it, and pass it along to the disk.
Anyways, I’m trying to get this working on real hardware. I have:
A Seagate Exos 20TB SAS drive
An LSI card based on the SAS2008 chip, flashed in IT mode to act as an HBA
An x86_64 Linux host running kernel 6.1.0-17 (debian)
I formatted the disk. Since (I think) DIF works by using larger blocks to store the checksum in addition to the data, at first I tried formatting it with larger sectors (4160 bytes) via sg_format --format --size=4160 --fmtpinfo=2 --pfu=0 --long /dev/sdg. The fmtpinfo flag is supposed to enable Type1 DIF according to the sg_format man page. But after that the detected capacity was zero due to unsupported block size. So I tried again with sg_format --format --size=4096 --fmtpinfo=2 --pfu=0 --long /dev/sdg And now I have a drive that shows up with a usable capacity.
I see a note in dmesg [sdg] Enabling DIF Type 1 protection, which sounds good, and lsscsi -p notes DIF/Type1 beside the drive. But, when I cat the various parameters in /sys/block/sdg/integrity/ it seems like integrity is not actually enabled:
device_is_integrity_capable = 0
format = none
I’ll note that I did all this without a machine restart. I could restart it but the host is in use so it’s inconvenient.
My questions:
I gather DIX is not enabled, but I’m not clear on whether DIF is even working properly. Is it? How do I know?
How would I enable DIX if I wanted to?
Does anyone have any other general advice or knowledge to share?
I think what’s going on is that Type1 DIF is working, but the HBA doesn’t support DIX. Here’s why I think this:
lsscsi and dmesg say/said that Type 1 DIF is enabled
smartctl -a says
Formatted with type 1 protection
8 bytes of protection information per logical block
I started poking around in the linux source code as well, and it looks like the path /sys/block/sdg/integrity/device_is_integrity_capable is reporting on DIX, not DIF. It’s reporting on the blk_integrity struct, which is only actually registered if DIX is being used, not DIF.
I believe my supposition that Type 1 DIF is correctly enabled (and it’s just DIX that isn’t) was correct.
Unfortunately, the reason I believe that is because the protection info seems to have caught a checksum mismatch. dmesg says the following. I don’t really have a point, I’m just noting it in case anyone has any further insight.
… upon further reading and thinking, if the “reference tag” check is failing, wouldn’t that mean that it’s not actually a checksum mismatch (that would be the “guard” tag, I think), but a misdirected read? Interesting. Perhaps the problem is not the disk itself…
That is how I was reading it.
I remember there being some kind of 4k block issue Linux had that would cause this, but for the life of me can’t remember where I read about it. It had something to do with a reftag remap being handled badly when going from the block input-output layer to the SCSI stack, the 4k bio>scsi remap had some emergent behavior that the 512b bio>scsi remap didn’t.
I barely follow to be honest, but I appreciate your reply. If you recall any additional details please let me know! I’m confused because this same HBA and same disk worked in a different host. Both hosts are running Debian 12, which is odd. I don’t know if the kernel versions were exactly the same but they must be close.
I turned up mpt3sas’s verbosity a bit and there’s a little more information, though it still doesn’t mean that much to me.
@twin_savage thanks for the tip! Based on my half-understanding of what you were saying, I had a hunch to go disable legacy option ROMs in the bios, thereby disabling the ROM for my SAS2008 chip. This seems to have fixed the issue! This also explains why this was working on my other host - that host has option ROMs disabled I believe. Thanks again
@hsnyder: This is a very nice post and I am trying to do the same (i.e., enabling T10 PI/DIF/DIX). However, I did not get the point where you tested that both DIF and DIX are enabled. It seems that you managed to enable both on the Seagate Exos 20TB SAS drive: but how did you check that it is enabled? Were you able to display it as “enabled”? If yes, where and how? Furthermore, were you able to test it by injecting faults?
Best regards