Lvm raid1 with dm-integrity - How am I meant to replace disks safely?

I’m curious if anyone in the community has experience using lvm raid1 alongside its dm-integrity support (link).

For context, I have a logical volume in raid1, with integrity enabled in bitmap mode. I’ve found it rather cumbersome to use the standard lvm tooling in multiple cases which on the surface seem like they should be able to work.

My setup is composed of HDDs for the data storage, and I have 2 sata SSDs in the volume group as well which are being used for the integrity data.

Getting all the integrity data to live on the ssd’s wasn’t straightforward. I felt like I was essentially “tricking” lvm into putting it there rather than being able to specify dedicated integrity/checksum PVs. But I did manage to extend the LV in clever ways to get the integrity on the SSD PVs eventually.

In general things seem to work fine. The bitmap mode performance hit doesn’t seem noticeable for my use cases, especially with the integrity data on the SSDs. I’m able to schedule periodic scrubs to validate my data, and presumably it should auto-repair bitrot assuming one of the mirrored disks has a valid checksum match (I haven’t had any mismatches thus far though, so I’ve never put that to the test).

The problem comes in when I’m trying to replace drives. For example, if I try to pvmove a PV to another, empty PV, it returns an error saying that’s not supported with integrity enabled. I have to first disable integrity and then pvmove.

Doesn’t this kind of defeat the point? It’s introducing a period when I’m not tracking data integrity. I would’ve thought I should be able to replace a an old/bad disk with integrity active, especially with the volume being raid1.

Does anyone have experience using a similar setup? If so, do you have any tips/suggestions on how to replace disks?

2 Likes

It is, but you’re overlooking the fact that before such migration, any tasks writing data to the PV should be halted/stopped/paused so there’s no data to be written during that period.

(FYI: I’ve investigated LVM a while back but I’m not actively using it. So I may have misinterpreted the man pages. Mea Culpa then :facepalm: )

What concerns me in that case is if I haven’t scrubbed the drives recently, disabling integrity temporarily seems to erase my chance of detecting recent bitrot sitting on the disks.

Whenever I’ve disabled integrity and re-enabled it, it appears to recompute all the checksums from scratch. I suppose it has to, since it doesn’t know if I really have chosen to write new data to the disks during the time integrity was disabled.

So the new checksums will treat the bitrot as correct instead of comparing against the older, correct checksum.

1 Like

If you find no better leads to follow, maybe ask a mailing list where devs hang out?