Testing, fixing, and when to give up on bad drives for ZFS

Nils42 · December 5, 2021, 9:21pm

I recently picked up a set of old enterprise Toshiba SSDs*, and spent yesterday replacing all my old array drives with them. I should have tested them first but I was excited to get off of my spinning rust and have backups. Anyways, my pool status now says degraded and I have a handful of checksum errors on my drives; SMART comes back clean, and scrubbing doesn’t seem to change anything, however I can’t do my replication backups any longer due to it giving those checksum errors.

At the moment I offline’d the worst offender and am now marching through it with a badblocks command which so far looks clean. So my question is where do I go from here? I would like to assume that badblocks finds whatever bad sectors there are and after that I’m good to keep using them as is, but I don’t actually know what else to do after that, and I’m not sure at which point I should just give up and look for new drives.

*Drives in question are PX02SMQ160 1.6TB SAS SSDs.

Trooper_ish · December 6, 2021, 2:48am

Hi, I’m not sure if bad blocks works the same on SSD’s, might want to check their site?

SSD controllers LIE and would hide stuff.

If I got errors, I would do a full/advanced secure erase, because the lies get hidden…
SSD’s do go bad, but they hide it until it gets Real bad… if a couple drives gave errors, and one put back in starts erroring after a clear, I’d look elsewhere.

But a coupl drives at once could be a displaced cable to backplane or a memory stick worked a bit loose.

I’d re-jig cables, press on memory dimms and double check SSD’s don’t got no lint on the pins?

Nils42 · December 6, 2021, 5:37am

Well, it finished the backblocks -ns without any errors, and I reran a long SMART, that’s taking forever so I’ll report back when it’s done.

Now that you mention it, this is on an as of yet unused part of my backplane so there very well could be something funky going on with those cables/ports/card.

I guess my other question (assuming one of those things fixes things) how do I go about clearing the checksum errors and restoring the health of the pool?

Trooper_ish · December 6, 2021, 9:06am

zpool clear <poolname>

And then might be worth a scrub?

Nils42 · December 6, 2021, 8:56pm

Thanks.

SMART finished and indeed showed no issues, so I’ll switch my drives over to the other side of my backplane, do that clear and see what happens.

Nils42 · December 8, 2021, 4:56am

Well, I switched my drives over to the other side of my backplane, but was still getting errors, realized that I could see what files were causing problems with zpool status -v and discovered it was only my VM that had already been acting strangely that had any errors.

Tried to pull data out while it was running but it started throwing way more errors. Cut my losses, deleted the VM, ran a zpool clear, a scrub, and now everything is nice and clean.

So it looks like I’m all in the clear now. Remember kids, start with seeing what files are actually throwing errors before you go about changing everything and assuming bad hardware.

system · September 7, 2022, 10:56pm

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.