How To Handle SSD Reporting Read Failure?

programster · January 28, 2023, 9:58am

Hello there,
I wasn’t sure whether to put this under Linux, or hardware as it’s both. I decided on Linux because I want to know what I need to do to handle this in the OS and the tools.

I’m running KVM on Debian 11, and I noticed that I got a disk I/O error when trying to copy one of my qcow2 disk files to a backup drive. After noticing this, I immediately started shifting all the other stuff off the drive wherever possible, and have backed up the system that was using the qcow2 file in question through other means.

Filesystem Check

After doing this, I tried to force a filesystem check on the root partition, and the data drives by running:

sudo tune2fs -c 1 /dev/sda2
sudo tune2fs -c 1 /dev/sdb1

This should set the filesystems to be checked on every reboot, and I did see it flash past on the next boot of the system, but it was so fast, I don’t really believe it ran the check. If there is anything else I should have done, please let me know.

SmartCTL Tests

The next thing I did was run SMART tests against the drives by executing:

sudo smartctl -t long /dev/disk/by-id/ata-Samsung_SSD_870_QVO_1TB_S5SVNG0N875075P
sudo smartctl -t long /dev/disk/by-id/ata-Samsung_SSD_870_EVO_1TB_S626NX0RA09514F

… and I set an alarm to go off when the output said the tests would be finished. When I run the commands to get the test results, I get the following:

sudo smartctl -l selftest /dev/disk/by-id/ata-Samsung_SSD_870_QVO_1TB_S5SVNG0N875075P
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-20-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     13958         -
# 2  Extended offline    Completed without error       00%      6883         -
# 3  Short offline       Completed without error       00%      6882         -
# 4  Short offline       Completed without error       00%      5974         -

…and…

sudo smartctl -l selftest /dev/disk/by-id/ata-Samsung_SSD_870_EVO_1TB_S626NX0RA09514F
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-20-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      7197         115010096
# 2  Extended offline    Completed without error       00%       123         -
# 3  Short offline       Completed without error       00%       122         -

So at least this means that SMART has detected the problem too. My questions at this point are what should I do with the drive? If I want to continue using it, do I need to just run a smart erase and set it up again with fresh partitions and the drive will know not to use the faulty blocks? Should I just bin the drive, as its very likely to have other failures in the future? Are there any repair jobs/tools I should run?

Gnuuser · January 28, 2023, 2:57pm

Rule out the simplest causes
1st remove the drive and clean the contacts and inspect for damage.
Carefully reseat the drive making sure its squarely in the socket. this is to rule out drift and intermittent contacting.
Then run the tests under various loads.

Ssd drives are reliable and seldom fail unless they suffered surge or static electricity overload.
The fact it is functional rules out static and surge damage.
But we cant rule out memory block failure on the ssd.

You should be able to see the partition with any drive management software in windows or Linux.
If your system does not see it, try using a different machine.