Anyone have any suggestions for a way of testing hard drives in a way which will reliably reveal faults? I have a bunch of disks that I have pulled out because they’ve have consistent issues but I’m now fairly sure I’ve had a PSU problem as well which has been causing disks to drop out. So I want to test all these questionable disks and see if it’s worth using them again.
Run smart, badblocks and then put them in a temporary ZFS pool, fill it completely with anything and run a scrub. Run another long smart for good measure and I’d feel pretty confident in the drives.
Sounds like a plan
Generally I recommend zeroing a drive and then check the SMART status but if you’ve had problems in the past it’ll be a good idea to replace the SATA/SAS cables as they do fail. “disk I/O errors” under SMART is an important value to watch/monitor. Stress testing via ZFS pool could work in pushing the drives until I/O errors occur.
There are some IT folks who prefer running a multi-pass zeroing to catch larger bad blocks/sectors, most HDDs have a bath tub curve if you’ve reached close to the half-way point of MTBF write amounts on the drive and errors are slowly occurring it’ll be easier to “retire” the drive or use it as a scratch disk. I once had a problematic drive, turned out the SATA cable was bad and the drive was reused for a few more years until SMART complained about the amount of hours it racked up.
Does scrub test all copies of the data now? Like, both sides of mirrors?
It’s my understanding that it steps through every record in the pool and verifies via checksum. I don’t think it cares what is redundant or parity until it needs to heal a bad record.
I think your original suggestion of BadBlocks should do the same but for every sector on each drive, several time over? And then use ZFS for safe data storage after drives check out…
I think it’s prudent to do a real-world test after smart and badblocks, but I also never actually looked into what badblocks does under the hood.
I could be wrong, but I got the impression it wrote a bit of text again and again for the entire drive, then read the whole lot, then wrote a different pattern, then read that, four passes by default?
And that you could provide the pattern or let it go ahead.
If so, probably not great to do too often for flash drives, but once when you first get it might let the onboard controller block off and bad pages?
I’m not sure it’s appropriate to run badblocks on an ssd at all tbh. I only use it on hdds.
Any kind of write-then-read-to-verify test should work. The drive was already tested at the factory, all you are doing is checking for very rare shipping damage or bad cable connections (which is prudent IMO) After that just use the drive, you can’t ever test it enough to keep data loss from happening. Badblocks works, but is overkill and takes days on large capacity drives that I use
If you need help sleeping at night, a checksumming file system like ZFS or BTRFS with proper redundancy, and incremental backups are required.
These aren’t new disks, they’re disks I pulled out because of errors but now I think the errors may have been because of a dud power supply. I just want to test them before deciding to use them again or not.
That’s exactly how badblocks works, and I agree with @oO.o , I wouldn’t ever use it on an SSD.
Usually I mostly follow the steps from this guide, especially when dealing with 2nd hand drives:
- SMART short test
- SMART conveyance test
- SMART long test
- SMART long test
For new drives I might do only 1, 2 and 3, depending on time.
This script (from TheArtOfServer) is useful to test big batches btw. It can send you a mail when done, which is nice as badblocks runs can take a while…
Is there any reason to run short if you’re running long also? I guess you could abort sooner on a bad drive.
Ah that makes sense. Yeah I would do badblocks in that case if I suspected the drive was having issues.
I once saw someone on r/datahoarder do some testing on his drives that involved recording and examining the IO speed when writing and reading across sectors, with the theory that a drive with marginal areas that aren’t “bad” yet would have a higher variance from the norm. Never saw anything after that post, but it was definitely an interesting concept.
That’s exactly the reason
Badblocks and other’s advice is usually sufficient.
Here’s an alternative to badblocks that additionally records the read/write speed - any unexplicable drops in performance are probably surface-level damage (poor, not yet bad blocks) - disk-filltest - Simple Tool to Detect Bad Disks by Filling with Random Data - panthema.net
I actually couldn’t find anything saying badblocks on an ssd is a bad idea. It’s not like one pass would be that terrible, although I wouldn’t run it on a schedule or anything.
FreeNAS will alert when a specific drive is hurting pool performance, but I’m not sure what the underlying mechanism for that is.
Well, SSDs come with spare capacity built-in since they expect cells to go bad as part of normal operation, as long as there’s enough life left on the SSD testing for bad blocks won’t really do much, I’d imagine. Indeed, the amount of data badblocks writes (4 times the entire capacity of the drive) is probably not going to do any favours for the drive’s life expectancy (especially when using a “bad” block size)
Note that a non-destructive badblocks still writes to the drive, it just attempts to write the original data back after (meaning it isn’t exactly risk-free, and writes even more data).
A long SMART self-test is already supposed to scan the entire capacity of the drive (per my understanding, this is read-only). Not sure if there’s much to gain by writing the entire capacity of the drive, guess it could flush out extremely fast increasing bad sectors on a fresh SSD…?