Troubleshooting IO errors with TrueNAS

BKCOM · May 16, 2022, 1:09pm

Recently installed TrueNAS Scale on a small PC, gave it 5 8TB drives raidz2. I think I am having SMART errors, but I want to be sure before I send the drives back.

Using an SMB share, I was transferring data from a windows 10 PC to the NAS, when the transfer stops. TrueNAS GUI indicates that the pool is unhealthy, but now says healthy after restarting the NAS.

GUI says that the disks all failed a SMART test, but running smartctl on the disks shows they pass.

Any ideas on how to figure this out?

EDIT: Here’s the smartctl for one of the drives

root@truenas[~]# smartctl -a /dev/sdd
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.109+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HUH728080ALE601
Serial Number:    VLHJKXKY
LU WWN Device Id: 5 000cca 260d5a0b7
Firmware Version: A4GL0003
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon May 16 09:18:09 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline 
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (   1) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   134   134   054    Pre-fail  Offline      -       104
  3 Spin_Up_Time            0x0007   100   100   024    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       145
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   140   140   020    Pre-fail  Offline      -       15
  9 Power_On_Hours          0x0012   095   095   000    Old_age   Always       -       37655
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       42
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
 45 Unknown_Attribute       0x0023   100   100   001    Pre-fail  Always       -       1095233372415
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       588
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       588
194 Temperature_Celsius     0x0002   150   150   000    Old_age   Always       -       40 (Min/Max 19/56)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
231 Temperature_Celsius     0x0032   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0012   100   100   000    Old_age   Always       -       3246211477567
242 Total_LBAs_Read         0x0012   100   100   000    Old_age   Always       -       3691552074326

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     37654         -
# 2  Extended offline    Interrupted (host reset)      90%     37643         -
# 3  Short offline       Completed without error       00%     37639         -
# 4  Short offline       Completed without error       00%     37611         -
# 5  Vendor (0x70)       Completed without error       00%     34889         -
# 6  Vendor (0x71)       Completed without error       00%     34889         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

EDIT 2:

Looks like the smartctl -t long test is hanging at 90%. Bad drives, or possibly enclosure?

anon75264233 · May 16, 2022, 2:17pm

If you put all the drives through a proper burn in with smartctl and bad locks then we can say the drives are good.

The spot about bad sectors in the smartlog was 0 (good) . That is the one you want to pay close attention to.

I would check dmesg for any IO interrupt messages to see if the cable enclosure is being flakey. Seems that you might be able to trigger it by putting it under a sustained load like with your previous test.

Fawkes · May 16, 2022, 2:37pm

Not sure if you have cables (judging by the case) but i’ve had many similar issues with old sata cables and new sata cables.
Especially crosstalk. Make sure the cables have the most distance between each other or atleast don’t touch.

BKCOM · May 16, 2022, 2:51pm

I’m not sure if I did a proper burn in, all I did before making the pool was check smart with smartctl -a.

After I got an issue, I tested with smartctl -t short, which passed, but smartctl -t long seems to hang at 90%. I’m giving it 5 minutes to run on 1 drive.

Here is the part of the smartctl -a output that I get after trying to run smartctl -t long:

Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.

BKCOM · May 16, 2022, 3:11pm

Looks like one of the drives is failing part of the test, but still “passing:”

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: servo/seek failure 80%     43260         0
# 2  Short offline       Completed without error       00%     43260         -
# 3  Extended offline    Interrupted (host reset)      90%     43246         -
# 4  Extended offline    Completed: servo/seek failure 90%     43245         0
# 5  Short offline       Completed: servo/seek failure 80%     43242         0
# 6  Vendor (0x70)       Completed without error       00%     43170         -
# 7  Vendor (0x71)       Completed without error       00%     43170         -

I’m going to try to return these drives, buying used probably wasn’t the best idea…

felixthecat · May 16, 2022, 3:27pm

I would interpret writing all blocks on the Hard Drives a few times was a proper burn-in Test.
There’s some good advice in this Post

It is definitely possible that the issue doesn’t show up in Smart Tests unless you do a proper burn-in first.

anon75264233 · May 16, 2022, 3:50pm

^^^ this

I did a burn test recently for my 8TiB and it took 4.5 days. Keep that in mind OP.

BKCOM · May 16, 2022, 3:50pm

Thanks for that link, I’m running badblocks now and I will rerun the smartctl test.

anon75264233 · May 16, 2022, 3:51pm

Badblocks writes the entire drive 4 times FYI.

01010101
10101010
11111111
00000000

And compares each time, then at the end of it you have a Zero’ed out drive ready to go.

BKCOM · May 16, 2022, 4:43pm

Oof, this is going to take a while.

TrueNAS keeps changing the drive letters of the drives, which is annoying and hopefully not part of the problem.

I got tmux working and im running it on all 5 disks. I’ll report back in a week when this is done I guess!

The output of badblocks starts normal, with a percentage amount done etc, but quickly turns into fast scrolling numbers, this is normal right?

felixthecat · May 16, 2022, 5:00pm

What’s the exact command you ran?

This is a bit odd indeed, I’ve never seen that happen.

anon75264233 · May 16, 2022, 5:44pm

did you do do -wsv or just -ws the -v flag is verbose mode and tells you what part of the drive you’re on.

Keep in mind that you cannot run badblocks on a live zfs pool, you need to remove them or destroy your pool before doing badblocks.

^^^ yes please post OP.

If the drives are on a hotswap HBA and getting constantly remounted due to a flaky cable this reassignment might be why.

The /dev/daX doesn’t matter really because Truenas uses the /dev/disk/by-id/* path

BKCOM · May 16, 2022, 6:14pm

I did badblocks -b 4096 -ws /dev/sdd. I disconnected the pool with the “disconnect/ export” button in the gui, so the pool is not showing anymore. I used TMUX to run the command on each disk at once.

All instances of badblocks have now quit, reporting too many bad blocks to continue.

Now the drives have moved to new letters again somehow, maybe after the test completed? Anyway I’ll try to run smartctl again, but if badblocks failed im assuming the drives are bad…

EDIT:

Smartctl -test short /dev/sdd works but smartctl -test long /dev/sdd is stuck at 90%. The command says it should take 1 minute, but I’m not sure how long its supposed to take

anon75264233 · May 16, 2022, 7:34pm

This is a good decision.

sorry you got bad(-ish) drives. You can still use drives with bad sectors, there is built in slack space, but once the rot starts to overtake that space is when bad things happen. If you’re starting out a fresh install then it would be a disservice to recommend the use of these drives.

Are they all that way? Do they still have a warranty?

A short test takes a few minutes, and conveyance test might take half an hour. A long test will take several hours. For my 8 Tib drives it took about 2.5 hours.

BKCOM · May 16, 2022, 10:01pm

Yes, they’re all warrantied by the reseller, we’ll see how that goes. I’m going to send them back if possible and get new 4TB 5400rpm drives from microcenter, around $70 per drive, or see what else is available new at under $100 per drive.

anon75264233 · May 16, 2022, 10:02pm

be careful with drivers under 8TB as they can be SMR drives. You’ll want to ensure they are CMR.

BKCOM · May 16, 2022, 11:14pm

Thanks, I will check to make sure I don’t get SMR drives I hadn’t thought of that.

I’m going to let Smartctl long test run on all the drives tonight and check them again tomorrow, in case that gives a more clear error. I don’t want to rule out the enclosure somehow being the problem.

BKCOM · May 18, 2022, 9:13pm

I ran smartctl on all the drives, which took a bit longer than 12 hours. They all pass, but one showed that it didn’t pass the self test every time (still says it passed though}. I’ll post the smartctl output below:

Summary

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.109+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HUH728080ALE601
Serial Number:    VKGKWM1X
LU WWN Device Id: 5 000cca 254c821f8
Firmware Version: A4GL0003
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed May 18 16:46:21 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (   1) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   131   131   054    Pre-fail  Offline      -       114
  3 Spin_Up_Time            0x0007   253   253   024    Pre-fail  Always       -       91 (Average 271)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       64
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   094   094   000    Old_age   Always       -       43313
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       23
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
 45 Unknown_Attribute       0x0023   050   050   001    Pre-fail  Always       -       16711935
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       393
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       393
194 Temperature_Celsius     0x0002   153   153   000    Old_age   Always       -       39 (Min/Max 21/51)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
231 Temperature_Celsius     0x0032   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0012   100   100   000    Old_age   Always       -       3396923411791
242 Total_LBAs_Read         0x0012   100   100   000    Old_age   Always       -       4666842358406

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     43287         -
# 2  Extended offline    Interrupted (host reset)      90%     43261         -
# 3  Short offline       Completed: servo/seek failure 80%     43260         0
# 4  Short offline       Completed without error       00%     43260         -
# 5  Extended offline    Interrupted (host reset)      90%     43246         -
# 6  Extended offline    Completed: servo/seek failure 90%     43245         0
# 7  Short offline       Completed: servo/seek failure 80%     43242         0
# 8  Vendor (0x70)       Completed without error       00%     43170         -
# 9  Vendor (0x71)       Completed without error       00%     43170         -
3 of 3 failed self-tests are outdated by newer successful extended offline self-test # 1

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

…but this is the part that looks bad:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     43287         -
# 2  Extended offline    Interrupted (host reset)      90%     43261         -
# 3  Short offline       Completed: servo/seek failure 80%     43260         0
# 4  Short offline       Completed without error       00%     43260         -
# 5  Extended offline    Interrupted (host reset)      90%     43246         -
# 6  Extended offline    Completed: servo/seek failure 90%     43245         0
# 7  Short offline       Completed: servo/seek failure 80%     43242         0
# 8  Vendor (0x70)       Completed without error       00%     43170         -
# 9  Vendor (0x71)       Completed without error       00%     43170         -
3 of 3 failed self-tests are outdated by newer successful extended offline self-test # 1

A couple of the selftests completed with a failure, but the drive is passing smart test somehow.

Should I just return all the drives and try again with new drives?

anon75264233 · May 18, 2022, 9:43pm

This two are probably the most critical in smart data:

Raw_Read_Error_Rate
Reallocated_Sector_Ct

Both of those were 0 which would indicate a healthy drive.

Smart tests failing aren’t the end of the world, but I have successfully returned drives that had corrupted firmware and would refuse to run smart tests despite a burn-in test coming back ok. Because if the drive controller firmware is going bad then even if the rest of it is fine then the risk is much higher. That might be what you have here.

If its only doing that for this one drive then the rest in your pool should be fine should they pass a proper burn-in test.

Try just working with the other drives instead of this one that is giving you trouble.

BKCOM · May 18, 2022, 10:25pm

All 5 of the drives are free from errors, but I don’t really know what exactly caused the pool to cease functioning during the SMB transfer.

None of the drives made it through the badblocks -ws program.

I think I will return these and get 6TB drives like these:

~$140 per drive, not much more expensive than the used 8TB drives I bought. That way I will know that drives work.