SMART and dmesg errors - is this HDD dying?

lapsio · October 31, 2019, 12:59am

Recently I started getting from time to time temporary I/O freezes and similar dmesg errors to these:


[12295.723447] ata15.00: exception Emask 0x0 SAct 0x200000 SErr 0x0 action 0x0
[12295.723450] ata15.00: irq_stat 0x40000001
[12295.723452] ata15.00: failed command: READ FPDMA QUEUED
[12295.723455] ata15.00: cmd 60/20:a8:c0:39:18/00:00:00:00:00/40 tag 21 ncq 16384 in
         res 51/40:05:db:39:18/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[12295.723457] ata15.00: status: { DRDY ERR }
[12295.723458] ata15.00: error: { UNC }
[12295.729220] ata15.00: configured for UDMA/133
[12295.729232] sd 14:0:0:0: [sdi] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[12295.729234] sd 14:0:0:0: [sdi] tag#21 Sense Key : Medium Error [current] [descriptor] 
[12295.729236] sd 14:0:0:0: [sdi] tag#21 Add. Sense: Unrecovered read error - auto reallocate failed
[12295.729238] sd 14:0:0:0: [sdi] tag#21 CDB: Read(10) 28 00 00 18 39 c0 00 00 20 00
[12295.729240] blk_update_request: I/O error, dev sdi, sector 1587675
[12295.729246] BTRFS error (device dm-3): bdev /dev/mapper/hitachi1 errs: wr 0, rd 453, flush 0, corrupt 11, gen 0
[12295.729266] ata15: EH complete
[12297.005459] BTRFS info (device dm-3): read error corrected: ino 1 off 831750144 (dev /dev/mapper/hitachi1 sector 1585600)
[12298.913768] BTRFS info (device dm-3): read error corrected: ino 1 off 831754240 (dev /dev/mapper/hitachi1 sector 1585608)
[12298.922124] BTRFS info (device dm-3): read error corrected: ino 1 off 831758336 (dev /dev/mapper/hitachi1 sector 1585616)
[12299.113862] BTRFS info (device dm-3): read error corrected: ino 1 off 831762432 (dev /dev/mapper/hitachi1 sector 1585624)
[12312.056208] ata15.00: exception Emask 0x0 SAct 0x10000000 SErr 0x0 action 0x0
[12312.056211] ata15.00: irq_stat 0x40000001
[12312.056213] ata15.00: failed command: READ FPDMA QUEUED
[12312.056216] ata15.00: cmd 60/20:e0:20:f0:16/00:00:00:00:00/40 tag 28 ncq 16384 in
         res 51/40:20:20:f0:16/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[12312.056217] ata15.00: status: { DRDY ERR }
[12312.056218] ata15.00: error: { UNC }
[12312.061908] ata15.00: configured for UDMA/133
[12312.061920] sd 14:0:0:0: [sdi] tag#28 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[12312.061924] sd 14:0:0:0: [sdi] tag#28 Sense Key : Medium Error [current] [descriptor] 
[12312.061934] sd 14:0:0:0: [sdi] tag#28 Add. Sense: Unrecovered read error - auto reallocate failed
[12312.061937] sd 14:0:0:0: [sdi] tag#28 CDB: Read(10) 28 00 00 16 f0 20 00 00 20 00
[12312.061939] blk_update_request: I/O error, dev sdi, sector 1503264
[12312.061947] BTRFS error (device dm-3): bdev /dev/mapper/hitachi1 errs: wr 0, rd 454, flush 0, corrupt 11, gen 0
[12312.061966] ata15: EH complete
[12314.693762] IPv4: martian source 255.255.255.255 from 10.237.235.225, on dev brnat
[12314.693768] ll header: 00000000: ff ff ff ff ff ff 08 00 27 6e f6 e6 08 00        ........'n....
[12319.714302] ata15.00: exception Emask 0x0 SAct 0x2000000 SErr 0x0 action 0x0
[12319.714307] ata15.00: irq_stat 0x40000001
[12319.714311] ata15.00: failed command: READ FPDMA QUEUED
[12319.714318] ata15.00: cmd 60/20:c8:20:0b:17/00:00:00:00:00/40 tag 25 ncq 16384 in
         res 51/40:1e:22:0b:17/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[12319.714321] ata15.00: status: { DRDY ERR }
[12319.714323] ata15.00: error: { UNC }
[12319.720014] ata15.00: configured for UDMA/133
[12319.720038] sd 14:0:0:0: [sdi] tag#25 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[12319.720043] sd 14:0:0:0: [sdi] tag#25 Sense Key : Medium Error [current] [descriptor] 
[12319.720046] sd 14:0:0:0: [sdi] tag#25 Add. Sense: Unrecovered read error - auto reallocate failed
[12319.720050] sd 14:0:0:0: [sdi] tag#25 CDB: Read(10) 28 00 00 17 0b 20 00 00 20 00
[12319.720053] blk_update_request: I/O error, dev sdi, sector 1510178
[12319.720063] BTRFS error (device dm-3): bdev /dev/mapper/hitachi1 errs: wr 0, rd 455, flush 0, corrupt 11, gen 0
[12319.720089] ata15: EH complete
[12320.040862] BTRFS info (device dm-3): read error corrected: ino 1 off 788545536 (dev /dev/mapper/hitachi1 sector 1501216)
[12320.048136] BTRFS info (device dm-3): read error corrected: ino 1 off 792084480 (dev /dev/mapper/hitachi1 sector 1508128)
[12320.049188] BTRFS info (device dm-3): read error corrected: ino 1 off 788549632 (dev /dev/mapper/hitachi1 sector 1501224)
[12320.056486] BTRFS info (device dm-3): read error corrected: ino 1 off 792088576 (dev /dev/mapper/hitachi1 sector 1508136)
[12320.057540] BTRFS info (device dm-3): read error corrected: ino 1 off 788553728 (dev /dev/mapper/hitachi1 sector 1501232)
[12320.064841] BTRFS info (device dm-3): read error corrected: ino 1 off 792092672 (dev /dev/mapper/hitachi1 sector 1508144)
[12320.065898] BTRFS info (device dm-3): read error corrected: ino 1 off 788557824 (dev /dev/mapper/hitachi1 sector 1501240)
[12320.074513] BTRFS info (device dm-3): read error corrected: ino 1 off 792096768 (dev /dev/mapper/hitachi1 sector 1508152)
[12334.805508] ata15.00: exception Emask 0x0 SAct 0xfc0000 SErr 0x0 action 0x0
[12334.805512] ata15.00: irq_stat 0x40000008
[12334.805514] ata15.00: failed command: READ FPDMA QUEUED
[12334.805518] ata15.00: cmd 60/20:b8:00:d1:17/00:00:00:00:00/40 tag 23 ncq 16384 in
         res 51/40:11:0f:d1:17/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[12334.805520] ata15.00: status: { DRDY ERR }
[12334.805521] ata15.00: error: { UNC }
[12334.811207] ata15.00: configured for UDMA/133
[12334.811232] sd 14:0:0:0: [sdi] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[12334.811234] sd 14:0:0:0: [sdi] tag#23 Sense Key : Medium Error [current] [descriptor] 
[12334.811236] sd 14:0:0:0: [sdi] tag#23 Add. Sense: Unrecovered read error - auto reallocate failed
[12334.811238] sd 14:0:0:0: [sdi] tag#23 CDB: Read(10) 28 00 00 17 d1 00 00 00 20 00
[12334.811240] blk_update_request: I/O error, dev sdi, sector 1560847
[12334.811246] BTRFS error (device dm-3): bdev /dev/mapper/hitachi1 errs: wr 0, rd 456, flush 0, corrupt 11, gen 0
[12334.811260] ata15: EH complete
[12334.904234] BTRFS info (device dm-3): read error corrected: ino 1 off 818020352 (dev /dev/mapper/hitachi1 sector 1558784)
[12335.123438] BTRFS info (device dm-3): read error corrected: ino 1 off 818024448 (dev /dev/mapper/hitachi1 sector 1558792)
[12335.145937] BTRFS info (device dm-3): read error corrected: ino 1 off 818028544 (dev /dev/mapper/hitachi1 sector 1558800)
[12335.154283] BTRFS info (device dm-3): read error corrected: ino 1 off 818032640 (dev /dev/mapper/hitachi1 sector 1558808)
[12357.904701] ata15: failed to read log page 10h (errno=-5)
[12357.904706] ata15.00: exception Emask 0x1 SAct 0x1efc0 SErr 0x0 action 0x0
[12357.904707] ata15.00: irq_stat 0x40000008
[12357.904709] ata15.00: failed command: WRITE FPDMA QUEUED
[12357.904712] ata15.00: cmd 61/50:30:b0:fe:e6/00:00:cd:00:00/40 tag 6 ncq 40960 out
         res 50/00:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[12357.904713] ata15.00: status: { DRDY }
[12357.904714] ata15.00: failed command: WRITE FPDMA QUEUED
[12357.904717] ata15.00: cmd 61/80:38:00:ff:e6/00:00:cd:00:00/40 tag 7 ncq 65536 out
         res 50/00:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[12357.904718] ata15.00: status: { DRDY }
[12357.904719] ata15.00: failed command: WRITE FPDMA QUEUED
[12357.904729] ata15.00: cmd 61/80:40:80:ff:e6/00:00:cd:00:00/40 tag 8 ncq 65536 out
         res 50/00:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[12357.904730] ata15.00: status: { DRDY }
[12357.904731] ata15.00: failed command: WRITE FPDMA QUEUED
[12357.904734] ata15.00: cmd 61/80:48:00:00:e7/00:00:cd:00:00/40 tag 9 ncq 65536 out
         res 50/00:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[12357.904735] ata15.00: status: { DRDY }
[12357.904736] ata15.00: failed command: WRITE FPDMA QUEUED
[12357.904738] ata15.00: cmd 61/30:50:80:00:e7/00:00:cd:00:00/40 tag 10 ncq 24576 out
         res 50/00:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[12357.904739] ata15.00: status: { DRDY }
[12357.904740] ata15.00: failed command: READ FPDMA QUEUED
[12357.904742] ata15.00: cmd 60/20:58:40:29:18/00:00:00:00:00/40 tag 11 ncq 16384 in
         res 50/00:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[12357.904743] ata15.00: status: { DRDY }
[12357.904744] ata15.00: failed command: READ FPDMA QUEUED
[12357.904747] ata15.00: cmd 60/20:68:60:49:7e/00:00:2d:00:00/40 tag 13 ncq 16384 in
         res 50/00:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[12357.904748] ata15.00: status: { DRDY }
[12357.904749] ata15.00: failed command: READ FPDMA QUEUED
[12357.904751] ata15.00: cmd 60/18:70:68:ab:f0/03:00:8a:00:00/40 tag 14 ncq 405504 in
         res 50/00:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[12357.904752] ata15.00: status: { DRDY }
[12357.904753] ata15.00: failed command: READ FPDMA QUEUED
[12357.904755] ata15.00: cmd 60/20:78:80:e6:39/00:00:a4:00:00/40 tag 15 ncq 16384 in
         res 50/00:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[12357.904756] ata15.00: status: { DRDY }
[12357.904757] ata15.00: failed command: READ FPDMA QUEUED
[12357.904759] ata15.00: cmd 60/40:80:c0:e6:39/00:00:a4:00:00/40 tag 16 ncq 32768 in
         res 50/00:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[12357.904760] ata15.00: status: { DRDY }
[12357.911007] ata15.00: failed to IDENTIFY (I/O error, err_mask=0x1)
[12357.911009] ata15.00: revalidation failed (errno=-5)
[12357.911012] ata15: hard resetting link
[12358.221692] ata15: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[12358.227478] ata15.00: configured for UDMA/133
[12358.227531] ata15: EH complete
[12363.204572] ata15.00: exception Emask 0x0 SAct 0x803c00 SErr 0x0 action 0x0
[12363.204575] ata15.00: irq_stat 0x40000008
[12363.204577] ata15.00: failed command: READ FPDMA QUEUED
[12363.204580] ata15.00: cmd 60/20:b8:40:29:18/00:00:00:00:00/40 tag 23 ncq 16384 in
         res 51/40:13:4d:29:18/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[12363.204581] ata15.00: status: { DRDY ERR }
[12363.204582] ata15.00: error: { UNC }
[12363.210278] ata15.00: configured for UDMA/133
[12363.210299] sd 14:0:0:0: [sdi] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[12363.210302] sd 14:0:0:0: [sdi] tag#23 Sense Key : Medium Error [current] [descriptor] 
[12363.210304] sd 14:0:0:0: [sdi] tag#23 Add. Sense: Unrecovered read error - auto reallocate failed
[12363.210306] sd 14:0:0:0: [sdi] tag#23 CDB: Read(10) 28 00 00 18 29 40 00 00 20 00
[12363.210307] blk_update_request: I/O error, dev sdi, sector 1583437
[12363.210313] BTRFS error (device dm-3): bdev /dev/mapper/hitachi1 errs: wr 0, rd 457, flush 0, corrupt 11, gen 0
[12363.210328] ata15: EH complete

It looks to me a bit like disk failure on hardware level. Smartctl shows following:


smartctl 6.3 2014-07-26 r3976 [x86_64-linux-4.6.4-1-default] (SUSE RPM)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 7K3000
Device Model:     Hitachi HDS723020BLA642
Serial Number:    MN5220F31RXSKK
LU WWN Device Id: 5 000cca 369d883a1
Firmware Version: MN6OA5C0
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Oct 31 01:32:48 2019 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (20235) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 338) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   076   076   016    Pre-fail  Always       -       119275578
  2 Throughput_Performance  0x0005   134   134   054    Pre-fail  Offline      -       87
  3 Spin_Up_Time            0x0007   136   136   024    Pre-fail  Always       -       420 (Average 420)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       887
  5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 6
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   135   135   020    Pre-fail  Offline      -       26
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       12749
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       816
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1046
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       1046
194 Temperature_Celsius     0x0002   150   150   000    Old_age   Always       -       40 (Min/Max 18/51)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       6
197 Current_Pending_Sector  0x0022   001   001   000    Old_age   Always       -       4999
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       20
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 48 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 48 occurred at disk power-on lifetime: 12749 hours (531 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 13 4d 29 18 00  Error: UNC at LBA = 0x0018294d = 1583437

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 68 68 00 c4 f0 40 00      03:26:18.823  READ FPDMA QUEUED
  60 00 60 00 ba f0 40 00      03:26:18.823  READ FPDMA QUEUED
  60 80 58 80 b3 f0 40 00      03:26:18.823  READ FPDMA QUEUED
  60 00 50 80 ae f0 40 00      03:26:18.823  READ FPDMA QUEUED
  61 50 e0 b0 fe e6 40 00      03:26:18.747  WRITE FPDMA QUEUED

Error 47 occurred at disk power-on lifetime: 12749 hours (531 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 13 4d 29 18 00  Error: UNC at LBA = 0x0018294d = 1583437

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 20 68 60 49 7e 40 00      03:26:12.597  READ FPDMA QUEUED
  60 20 60 e0 28 18 40 00      03:25:59.897  READ FPDMA QUEUED
  60 20 58 40 29 18 40 00      03:25:59.897  READ FPDMA QUEUED
  61 30 50 80 00 e7 40 00      03:25:59.897  WRITE FPDMA QUEUED
  61 80 48 00 00 e7 40 00      03:25:59.897  WRITE FPDMA QUEUED

Error 46 occurred at disk power-on lifetime: 12749 hours (531 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 11 0f d1 17 00  Error: UNC at LBA = 0x0017d10f = 1560847

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 20 b8 00 d1 17 40 00      03:25:52.007  READ FPDMA QUEUED
  61 80 b0 00 2b 83 40 00      03:25:52.007  WRITE FPDMA QUEUED
  61 18 a8 80 2b 83 40 00      03:25:52.007  WRITE FPDMA QUEUED
  61 80 a0 80 2a 83 40 00      03:25:52.007  WRITE FPDMA QUEUED
  61 80 98 00 2a 83 40 00      03:25:52.007  WRITE FPDMA QUEUED

Error 45 occurred at disk power-on lifetime: 12749 hours (531 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 1e 22 0b 17 00  Error: UNC at LBA = 0x00170b22 = 1510178

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 20 c8 20 0b 17 40 00      03:25:34.323  READ FPDMA QUEUED
  60 20 c0 e0 42 7e 40 00      03:25:34.315  READ FPDMA QUEUED
  60 20 b8 60 e4 68 40 00      03:25:34.300  READ FPDMA QUEUED
  60 20 b0 e0 cb 16 40 00      03:25:32.981  READ FPDMA QUEUED
  60 20 a8 80 f1 7d 40 00      03:25:32.975  READ FPDMA QUEUED

Error 44 occurred at disk power-on lifetime: 12749 hours (531 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 20 20 f0 16 00  Error: UNC at LBA = 0x0016f020 = 1503264

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 20 e0 20 f0 16 40 00      03:25:29.793  READ FPDMA QUEUED
  61 00 d8 88 53 9e 40 00      03:25:29.793  WRITE FPDMA QUEUED
  61 00 d0 80 d3 15 40 00      03:25:29.741  WRITE FPDMA QUEUED
  61 00 c8 80 7f 15 40 00      03:25:29.741  WRITE FPDMA QUEUED
  61 00 c0 80 0f 15 40 00      03:25:29.741  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I’ve never been through actual HDD failure so I don’t really know what are symptoms of such thing however my 2 still active brain cells tell me that this is exactly how it looks like. 6 reallocation failures don’t look like terrible number though.

Is this disk still usable or do I need to replace it absolutely immediately, right this instant? To add some context it’s BTRFS software RAID1 so death of this drive shouldn’t result in data loss… Assuming BTRFS RAID1 acutually works that is…

Trooper_ish · October 31, 2019, 1:24am

Raid1 in BTRFS is still pretty solid, but you should really replace the drive as soon as practicable.

Raid1 half working is still better than some other systems; it will detect errors, even if it can’t heal them, and it should still snapshot etc…

I would personally treat it gently untill the replacement drive is in and online, and definitely order a replacement drive today, assuming I have the funds

If you havbe a backup, can’t hurt to test it/ check it is good condition, just in case.

On the other hand, a lot of people only use 1 drive, so at their safest, they are at your least secure!

lapsio · October 31, 2019, 1:38am

Tbh I have backups on another, larger BTRFS RAID5. This RAID1 is just for /home. And /home is mirrored periodically to laptop so it’s more or less like 2 backups. That’s why I’m a bit lazy in this matter and I don’t really panic.

I just found out that I have unplugged 3rd Hitachi drive inserted as cold spare in this PC so I guess I’ll just connect it and then remove failing drive. Adding new and then removing old drive without re-balance between should basically move all data from failing drive to new one I guess… keeping data consistency and integrity since over whole migration period there will be 2 data copies available (I happen to have craptons of bit flips in all of my drives, idk why - maybe strong magnetic field in room)

Adubs · October 31, 2019, 1:39am

Yep, reallocated sectors is an indication its time to swap it out.

Trooper_ish · October 31, 2019, 1:42am

Lots of bit flips / corruption might also be from cable /controller issues too.

I’ve heard Raid5 is the one to avoid on BTRFS, but you seem to have a fair grasp on level of risk / reward that you feel comfortable with.

lapsio · October 31, 2019, 1:44am

On the other hand I’m a bit skeptical about this number. I have one drive that like 5 years ago derped heavily (I remember I heard some really concerning noises and after reboot there were 4095 reallocated sectors). And that’s it. It’s been many years ago - single shot. It works until today perfectly fine.

But I guess this time is different since on this drive this number is growing. It’s not stable like on that ancient Seagate.

Trooper_ish · October 31, 2019, 1:45am

Well, looks like I need to find some humble pie.

Says “For data, it should be safe as long as a scrub is run immediately after any unclean shutdown.”
So I guess I can stop wailing on the Raid5 BTRFS

https://btrfs.wiki.kernel.org/index.php/RAID56

lapsio · October 31, 2019, 1:52am

Yeah it’s funny story actually because it’s my second BTRFS RAID5 backup storage. First one I made on initial BTRFS RAID5 implementation - before this super ultra fatal flaw in BTRFS RAID5 implementation has been discovered and entire implementation has been rewritten from scratch with no backwards compatibility (it turned out rebuilding RAID5 in btrfs basically didn’t work).

Few months after that information popped out my old RAID5 got corrupted on fs level and I had to rebuild it. But since I knew at that point that rebuilding doesn’t work in my old implementation I made new BTRFS RAID5 and migrated data to it. So you could say I don’t really learn from mistakes lmao.

New implementation supposedly works. So far so good. That first, flawed RAID5 broke after around 2 years. This new implementation already works for me for another 2 years and still counting. I have a lot of faith in btrfs… Probably more than it’d be wise to have… But I haven’t lost any data so far so fingers crossed.

lapsio · October 31, 2019, 2:00am

Besides - fun fact regarding:

“For data, it should be safe as long as a scrub is run immediately after any unclean shutdown.”

scrub on my 16/14 TB BTRFS RAID5 (8x 2TB drive) takes literally around 5 days (non-stop 24h/day). So word “immediately” here is quite a lot to ask for lmao. I love how they said on website that “scrub should be run around once a week”. If I would do that I’d need to basically run scrub in while true xD

Trooper_ish · October 31, 2019, 2:03am

That is peculiar.
I would have thought a 2TB drive should scrub in a couple of hours, max, so 8 drives should be at most an overnight job?

lapsio · October 31, 2019, 2:06am

Ironically scrub on RAID56 is not run on all drives in parallel. It’s like running scrub on each drive individually plus it’s even slower than scrub on single drive. Probably because of operation related to parity computing and verification. That’s why scrub on RAID56 with 8 drives is in fact longer than running scrub on 8 drives individually, one by one.

You can see this behavior in iotop. When you run scrub on 8-bay RAID56 only 1 or 2 drives are active at once. Also there’s some massive read amplification due to this kind of behavior.

Adubs · October 31, 2019, 10:51am

It’s not a sign of imminent failure but that the disk is beginning to fail. Your disk from 5 years ago was likely a collision and should have been RMAd

system · July 31, 2020, 5:04am

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.