Recently installed TrueNAS Scale on a small PC, gave it 5 8TB drives raidz2. I think I am having SMART errors, but I want to be sure before I send the drives back.
Using an SMB share, I was transferring data from a windows 10 PC to the NAS, when the transfer stops. TrueNAS GUI indicates that the pool is unhealthy, but now says healthy after restarting the NAS.
GUI says that the disks all failed a SMART test, but running smartctl on the disks shows they pass.
If you put all the drives through a proper burn in with smartctl and bad locks then we can say the drives are good.
The spot about bad sectors in the smartlog was 0 (good) . That is the one you want to pay close attention to.
I would check dmesg for any IO interrupt messages to see if the cable enclosure is being flakey. Seems that you might be able to trigger it by putting it under a sustained load like with your previous test.
Not sure if you have cables (judging by the case) but i’ve had many similar issues with old sata cables and new sata cables.
Especially crosstalk. Make sure the cables have the most distance between each other or atleast don’t touch.
I’m not sure if I did a proper burn in, all I did before making the pool was check smart with smartctl -a.
After I got an issue, I tested with smartctl -t short, which passed, but smartctl -t long seems to hang at 90%. I’m giving it 5 minutes to run on 1 drive.
Here is the part of the smartctl -a output that I get after trying to run smartctl -t long:
Self-test execution status: ( 249) Self-test routine in progress...
90% of test remaining.
I did badblocks -b 4096 -ws /dev/sdd. I disconnected the pool with the “disconnect/ export” button in the gui, so the pool is not showing anymore. I used TMUX to run the command on each disk at once.
All instances of badblocks have now quit, reporting too many bad blocks to continue.
Now the drives have moved to new letters again somehow, maybe after the test completed? Anyway I’ll try to run smartctl again, but if badblocks failed im assuming the drives are bad…
EDIT:
Smartctl -test short /dev/sdd works but smartctl -test long /dev/sdd is stuck at 90%. The command says it should take 1 minute, but I’m not sure how long its supposed to take
sorry you got bad(-ish) drives. You can still use drives with bad sectors, there is built in slack space, but once the rot starts to overtake that space is when bad things happen. If you’re starting out a fresh install then it would be a disservice to recommend the use of these drives.
Are they all that way? Do they still have a warranty?
A short test takes a few minutes, and conveyance test might take half an hour. A long test will take several hours. For my 8 Tib drives it took about 2.5 hours.
Yes, they’re all warrantied by the reseller, we’ll see how that goes. I’m going to send them back if possible and get new 4TB 5400rpm drives from microcenter, around $70 per drive, or see what else is available new at under $100 per drive.
Thanks, I will check to make sure I don’t get SMR drives I hadn’t thought of that.
I’m going to let Smartctl long test run on all the drives tonight and check them again tomorrow, in case that gives a more clear error. I don’t want to rule out the enclosure somehow being the problem.
I ran smartctl on all the drives, which took a bit longer than 12 hours. They all pass, but one showed that it didn’t pass the self test every time (still says it passed though}. I’ll post the smartctl output below:
Summary
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.109+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: HUH728080ALE601
Serial Number: VKGKWM1X
LU WWN Device Id: 5 000cca 254c821f8
Firmware Version: A4GL0003
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed May 18 16:46:21 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 101) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 1) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 131 131 054 Pre-fail Offline - 114
3 Spin_Up_Time 0x0007 253 253 024 Pre-fail Always - 91 (Average 271)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 64
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 128 128 020 Pre-fail Offline - 18
9 Power_On_Hours 0x0012 094 094 000 Old_age Always - 43313
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 23
22 Unknown_Attribute 0x0023 100 100 025 Pre-fail Always - 100
45 Unknown_Attribute 0x0023 050 050 001 Pre-fail Always - 16711935
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 393
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 393
194 Temperature_Celsius 0x0002 153 153 000 Old_age Always - 39 (Min/Max 21/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
231 Temperature_Celsius 0x0032 100 100 000 Old_age Always - 0
241 Total_LBAs_Written 0x0012 100 100 000 Old_age Always - 3396923411791
242 Total_LBAs_Read 0x0012 100 100 000 Old_age Always - 4666842358406
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 43287 -
# 2 Extended offline Interrupted (host reset) 90% 43261 -
# 3 Short offline Completed: servo/seek failure 80% 43260 0
# 4 Short offline Completed without error 00% 43260 -
# 5 Extended offline Interrupted (host reset) 90% 43246 -
# 6 Extended offline Completed: servo/seek failure 90% 43245 0
# 7 Short offline Completed: servo/seek failure 80% 43242 0
# 8 Vendor (0x70) Completed without error 00% 43170 -
# 9 Vendor (0x71) Completed without error 00% 43170 -
3 of 3 failed self-tests are outdated by newer successful extended offline self-test # 1
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
…but this is the part that looks bad:
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 43287 -
# 2 Extended offline Interrupted (host reset) 90% 43261 -
# 3 Short offline Completed: servo/seek failure 80% 43260 0
# 4 Short offline Completed without error 00% 43260 -
# 5 Extended offline Interrupted (host reset) 90% 43246 -
# 6 Extended offline Completed: servo/seek failure 90% 43245 0
# 7 Short offline Completed: servo/seek failure 80% 43242 0
# 8 Vendor (0x70) Completed without error 00% 43170 -
# 9 Vendor (0x71) Completed without error 00% 43170 -
3 of 3 failed self-tests are outdated by newer successful extended offline self-test # 1
A couple of the selftests completed with a failure, but the drive is passing smart test somehow.
Should I just return all the drives and try again with new drives?
This two are probably the most critical in smart data:
Raw_Read_Error_Rate
Reallocated_Sector_Ct
Both of those were 0 which would indicate a healthy drive.
Smart tests failing aren’t the end of the world, but I have successfully returned drives that had corrupted firmware and would refuse to run smart tests despite a burn-in test coming back ok. Because if the drive controller firmware is going bad then even if the rest of it is fine then the risk is much higher. That might be what you have here.
If its only doing that for this one drive then the rest in your pool should be fine should they pass a proper burn-in test.
Try just working with the other drives instead of this one that is giving you trouble.