Experiments with dm-integrity and T10-PI on Linux

Kioxia also says, in that same blog post that you linked,

T10 Protection Information is a feature first enabled on enterprise SAS HDDs and SSDs that has been included in the NVMe specification, and has been included in KIOXIA Enterprise class NVMe SSDs since CM5.

1 Like

:pensive: We all would’ve liked if Kioxia wrote this clearly in the data sheet or specifications instead of having us parse a blog post to extract this information…

This problem of the expression “end-to-end protection” having no certain meaning seems to be as widespread as “AES encryption” not meaning access to the contents of the drive being protected by any sort of authentication.

Perhaps there is some implicit agreement that if you’re an enterprise buying enterprise drives, T10-PI/DIF/DIX is as implicit as water being wet—something that us plebs wouldn’t know.

1 Like

Perhaps there is some implicit agreement that if you’re an enterprise buying enterprise drives, T10-PI/DIF/DIX is as implicit as water being wet—something that us plebs wouldn’t know.

But we keep finding enterprise drives that don’t actually support it (e.g. intel P4500s)

You need their secret decoder ring, something that all enterprise SSD procurement specialists get once they are sworn in and have pledged secrecy.

1 Like

I recently acquired an enterprise grade NVMe with T10-PI support (a Netlist NS1952 7.68T) and got PI working (I think). I figured I’d post some notes here since there was some troubleshooting involved.

nvme id-ns listed out the logical block formats that the drive supported, and LBAF 3 corresponded to 4096 sectors with 8 bytes of metadata. So I formatted the drive as follows:

nvme format /dev/nvme0n1 --lbaf=3 --pi=1 --pil=1 --ses=1

The --pil=1 flag was necessary on this device (the command fails claiming lack of support otherwise) but since, according to the manual page for nvme format, it just determines whether the protection info is the first or last 8 bytes of metadata, in this configuration I think there is actually no difference.

After the format, most interactions with the SSD would trigger ref tag errors in dmesg, presumably because the format didn’t actually update the metadata to be correct across the disk. I had substantial trouble creating partitions and luks on the device, until I zeroed out the whole drive with dd. Note that blkdiscard did not work in my case, I had to write zeros to the whole drive. After that, I was able to proceed using the drive as normal.

It appears that T10-PI is working:

$ cat /sys/class/nvme/nvme0/nvme0n1/integrity/format 
T10-DIF-TYPE1-CRC
$ cat /sys/class/nvme/nvme0/nvme0n1/integrity/read_verify 
1
$ cat /sys/class/nvme/nvme0/nvme0n1/integrity/write_generate 
1
$ cat /sys/class/nvme/nvme0/nvme0n1/integrity/device_is_integrity_capable 
1

Note that at one point (before I figured out that I needed to zero the drive), I tried a format with --ms=1. This caused the drive to show up as not integrity capable at all.

1 Like

fill the disk then confirm that delete/discards Re working normally too? wonder if it’s fine till it’s total full

Oooh good catch! fstrim on the mount point says “the discard operation is not supported”. I have xfs on lvm on luks2 on the raw partition, so I may have just missed a setting somewhere… looking into it.

Edit: the issue what that I didn’t use --allow-discards in cryptsetup open. I’ve rectified that and fstrim now works.

now also cross check with smart stats to confirm it makes sense with the operations you’re doing?

I’m not too familiar with reading smartctl outputs on nvme devices, but here’s smartctl -a. I don’t see anything which I can obviously correlate with my actions.

=== START OF INFORMATION SECTION ===
Model Number:                       NS1952UF17T6
Serial Number:                      [redacted]
Firmware Version:                   283001K0
PCI Vendor/Subsystem ID:            0x1c1b
IEEE OUI Identifier:                0x38b19e
Total NVM Capacity:                 7,681,501,126,656 [7.68 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.2
Number of Namespaces:               32
Local Time is:                      Mon Mar  3 10:25:29 2025 EST
Firmware Updates (0x17):            3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x000e):   Format Frmw_DL NS_Mngmt
Optional NVM Commands (0x0034):     DS_Mngmt Sav/Sel_Feat Resv
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     75 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W       -        -    0  0  0  0      100     100
 1 +    24.00W       -        -    1  1  1  1      115     115
 2 +    23.00W       -        -    2  2  2  2      130     130
 3 +    22.00W       -        -    3  3  3  3      145     145
 4 +    21.00W       -        -    4  4  4  4      160     160
 5 +    20.00W       -        -    5  5  5  5      175     175
 6 +    19.00W       -        -    6  6  6  6      190     190
 7 +    18.00W       -        -    7  7  7  7      205     205
 8 +    17.00W       -        -    8  8  8  8      220     220
 9 +    16.00W       -        -    9  9  9  9      235     235
10 +    15.00W       -        -   10 10 10 10      250     250
11 +    14.00W       -        -   11 11 11 11      265     265
12 +    13.00W       -        -   12 12 12 12      280     280
13 +    12.00W       -        -   13 13 13 13      295     295
14 +    11.00W       -        -   14 14 14 14      310     310
15 +    10.00W       -        -   15 15 15 15      325     325

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        53 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    1,585,022 [811 GB]
Data Units Written:                 37,411,179 [19.1 TB]
Host Read Commands:                 6,452,500
Host Write Commands:                2,641,022,858
Controller Busy Time:               748
Power Cycles:                       7
Power On Hours:                     38
Unsafe Shutdowns:                   2
Media and Data Integrity Errors:    0
Error Information Log Entries:      3
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               64 Celsius
Temperature Sensor 2:               53 Celsius
Temperature Sensor 3:               48 Celsius
Temperature Sensor 4:               47 Celsius

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0          3     0  0x0000  0x4004  0x028            -     0     -  Invalid Field in Command
  1          2     0  0x0000  0x4004  0x028            -     1     -  Invalid Field in Command
  2          1     0  0x0000  0x4004  0x028            -     0     -  Invalid Field in Command

Self-tests not supported

I also tried nvme error-log, it did emit 63 entries in total, but the ones numbered 4 and higher are identical to the one numbered 3. I don’t know how to interpret this, but will do some reading. I don’t think these are the PI errors I was seeing in dmesg, but I could be wrong.

Error Log Entries for device:nvme0 entries:63
.................
 Entry[ 0]   
.................
error_count     : 3
sqid            : 0
cmdid           : 0
status_field    : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
phase_tag       : 0
parm_err_loc    : 0x28
lba             : 0xffffffffffffffff
nsid            : 0
vs              : 0
trtype          : Fibre Channel Transport error.
csi             : 0
opcode          : 0
cs              : 0
trtype_spec_info: 0
log_page_version: 0
.................
 Entry[ 1]   
.................
error_count     : 2
sqid            : 0
cmdid           : 0
status_field    : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
phase_tag       : 0
parm_err_loc    : 0x28
lba             : 0xffffffffffffffff
nsid            : 0x1
vs              : 0
trtype          : Fibre Channel Transport error.
csi             : 0
opcode          : 0
cs              : 0
trtype_spec_info: 0
log_page_version: 0
.................
 Entry[ 2]   
.................
error_count     : 1
sqid            : 0
cmdid           : 0
status_field    : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
phase_tag       : 0
parm_err_loc    : 0x28
lba             : 0xffffffffffffffff
nsid            : 0
vs              : 0
trtype          : Fibre Channel Transport error.
csi             : 0
opcode          : 0
cs              : 0
trtype_spec_info: 0
log_page_version: 0
.................
 Entry[ 3]   
.................
error_count     : 0
sqid            : 0
cmdid           : 0
status_field    : 0(Successful Completion: The command completed without error)
phase_tag       : 0
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
csi             : 0
opcode          : 0
cs              : 0
trtype_spec_info: 0
log_page_version: 0

Looks like I spoke too soon about T10-PI working. After a power cycle, dmesg contained a whole lot of these:

[  962.112578] Buffer I/O error on dev nvme0n1p1, logical block 1875366128, async page read
[  962.113459] nvme0n1: ref tag error at location 15002931072 (rcvd 0)
[  962.130956] nvme0n1: ref tag error at location 15002931072 (rcvd 0)
[  962.131065] nvme0n1: ref tag error at location 15002931072 (rcvd 0)
[ 1001.063037] nvme0n1: ref tag error at location 15002931072 (rcvd 0)
[ 1001.063159] nvme0n1: ref tag error at location 15002931072 (rcvd 0)
[ 1145.538343] nvme0n1: ref tag error at location 15002931072 (rcvd 0)
[ 1145.538385] Buffer I/O error on dev nvme0n1p1, logical block 1875366128, async page read
[ 1145.539247] nvme0n1: ref tag error at location 15002931072 (rcvd 0)
[ 1145.556586] nvme0n1: ref tag error at location 15002931072 (rcvd 0)
[ 1145.556695] nvme0n1: ref tag error at location 15002931072 (rcvd 0)
[ 1149.762937] nvme0n1: ref tag error at location 15002931072 (rcvd 0)
[ 1149.763007] nvme0n1: ref tag error at location 15002931072 (rcvd 0)

The LUKS container failed to come up automatically. I was able to manually open it, and activate the LVM volume thereon, but I wasn’t able to mount the XFS filesystem (or even detect it with blkid) without disabling integrity validation via echo 0 > /sys/block/nvme0n1/integrity/read_verify. That allowed me to mount the FS.

I’m currently running an xfsdump so that I have an up-to-date backup, and will be reformatting the drive to remove T10-PI, at least for now. I don’t think this is expected behavior and I’m feeling mistrustful of the T10-PI implementation in the drive firmware.

Input is welcome.

Read one of these using:

I’m pretty sure this can be adapted to read the sector and its metadata together, though it’s been a while since I last did it.

All zeroes beyond the data bytes means no metadata/protection information is being written.

NVMe status: End-to-end Reference Tag Check Error: The command was aborted due to an end-to-end reference tag check failure(0x6284)
0000000

Read without checking reference tag?

If I change -p 15 to -p 14 (clear bit 0), I get the same successful read output.

Still looking for a way to get the drive to tell me what the stored reference tag actually is, without checking it.

Edit: I think I found it. Add -y 8 to specify reading 8 bytes of metdata.

nvme read /dev/nvme0n1 -s 0xb7e700 -r 0xb7e700 -c 0 -z 4096 -y 8 -p 14 | od -t x1

Last few lines of output are:

0007760 d7 ee d8 a8 b1 0a 14 80 e7 0b 1c 07 51 36 8f 26
0010000 00 00 00 00 00 00 00 00
0010010

At first I was confused because it looks like the protection bytes are all zero, but the man page, in the table describing the effects of the -p flag, notes that bit 3 causes metadata to be stripped on read. Trying again with -p 6 gives

0007760 d7 ee d8 a8 b1 0a 14 80 e7 0b 1c 07 51 36 8f 26
0010000 5b cb 00 00 00 b7 e7 00
0010010

Looks like the guard tag is populated, the app tag is not, and the ref tag is populated. The strange thing is that the ref tag is correct here… Type 1 protection → ref tag should be the lower 32 bits of the LBA, i.e. 00b7e700. So either it got generated on the read by nvme-read itself, or I really don’t know what’s going on. But the behavior is the same if I omit the -r argument.

1 Like

Okay, facepalm time. I wasn’t looking at the right block!

Consider this dmesg entry:

[   27.510466] Buffer I/O error on dev nvme0n1p1, logical block 1875366128, async page read

I think the appropriate read command is

nvme read /dev/nvme0n1 -s 1875366128 -c 0 -z 4096 -y 8 -p 7 | od -t x1

This yields

read: Success
0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0010000 00 00 00 00 00 00 00 00
0010010

So the whole sector is full of zeros, and the metadata is also full of zeros. Therefore, the reference tag doesn’t match, and yet the nvme read completed successfully.

So, still confused, but at least I’m probably checking the right block now…

1 Like

Okay I have my filesystem working again. I think I have an incorrect ordering dependency between LVM and Cryptsetup, so I have to manually activate the lv after booting, and then mount the fs, but that’s fine for now.

Speaking of the FS, XFS was refusing to mount due to a ref tag error on the last block in the lv (probably backup superblock?) I was able to fix this by writing zeros to that region with dd. So perhaps trimming this ssd causes the ref tags of free space to get zeroed out, or something, and yet various stages of filesystem/partition probing, and even filesystem mounting, check blocks that aren’t allocated and therefore get trimmed.

Annoying.

I’m trying to decide if I’m better off leaving T10-PI on or off… It appears to “work”, but in ways which the rest of the stack haven’t necessarily accounted for. In the meantime, it’s on and I’m back to using the filesystem.