FreeNAS: Wonder if I've got a badly seated SATA cable?

bsodmike · September 11, 2017, 8:18pm

Hi all,

Got a single error report from my FreeNAS 11-U2 box,

freenas.local kernel log messages:
(ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT )
(ada0:ahcich0:0:0:0): RES: 51 04 e0 01 8c 40 b0 00 00 00 00
(ada0:ahcich0:0:0:0): Retrying command

I’ve traced this to SATA port 0 on Asus X99-E WS/USB3.1 board - scbus0 should be SATA0

ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WD40EFRX-68N32N0 82.00A82> ACS-3 ATA SATA 3.x device
ada0: Serial Number WD-xxx
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 3815447MB (7814037168 512 byte sectors)
ada0: quirks=0x1<4K>

In any case, this doesn’t seem like it failed, as I’ve googled and seen logs where it retries a couple times before exhausting with ‘retry failed’.

Also a Flush cache operation may not be that dangerous, say compared to a failed physical write to disk?

In any case, there has been no warnings from the ZFS pool etc. SMART data looks good too. I run an extended SMART test once a week too, and the short-test daily; cronjob output is mailed to me twice daily.

SgtAwesomesauce · September 11, 2017, 8:21pm

On occasion there are errors. If this is the only time this has happened, I’d just monitor it to see if it errors again.

Since the smart data looks good, I think you’re good to go.

If you think it’s a bad cable, try replacing it. We don’t often get bad cables in my DC, but it’s definitely possible.

How old is the system?

anon37371794 · September 11, 2017, 8:33pm

I had a very similar error on one of the drives in my own NAS. There was a 50/50 chance that I’d get such an error message the night after booting the NAS (it has this nasty habit of only mailing me at 3AM local time).

Unplugging and re-plugging the cables did the trick for me, I haven’t received a single warning since.

bsodmike · September 11, 2017, 10:25pm

Only been a month, and the first time I’m seeing it. That said, my setup is terribly janky and as such I just don’t touch the box.

Serious FreeNAS Server Build & 3D Printing Fun Build Logs

Hi all, At the outset, I wanted to house this server in a proper rackmount case, rack etc. However, iStarUSA was unable to fulfil my order given that they are refreshing their EX3M16 (well, EX) line. B&H USA ended up cancelling a good part of my orders as a result. While it was a set back in terms of time put in to planning components, matching 2U compatible PSUs with the enclosure, I’ve had to resort to taking a more DIY approach. This build consists of an Intel Xeon E5-2620 v4, Asus X99-E WS/USB3.1 (SSI-CEB) mainboard, Crucial 32GB RDIMM ECC DDR4 RAM, Corsair AX860i PSU, and a Coolermaster Evo 212 cooler. [Album] Imgur: The most awesome images on the Internet [Imgur: The most awesome images on the Internet] On the storage side, the first Raidz2 volume will be running 8x WD Red 4TB NAS drives for total accessible capacity of 68% usable space, ~19.9TB. I spent a good while planning the cost vs. usable space aspect via this handy calculator before deciding on 6 vs 8 drives for the volume. The WD Black drive pictured is for archival backups. [Album] Imgur: The most awesome images on the Internet [Imgur: The most awesome images on the Internet] I plan to run the ZFS FreeNAS server along side my 2x Synology RAID setup (total 40TB of WD Red Pros), and may look to re-configure this down the road. I will start by moving critical data into ZFS first before making the FreeNAS the ‘canonical’ store. Will most likely run it in parallel for the first couple months before making any drastic decisions.

The HDDs are held together by 3D printed plates LOL

bsodmike · September 18, 2017, 6:40am

Well, I just got this report

########## SMART status report for ada0 drive (Western Digital Red: WD-WCC7K7xxx) ##########
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   185   185   021    Pre-fail  Always       -       5741
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       10
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1164
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       2
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       8
194 Temperature_Celsius     0x0022   110   108   000    Old_age   Always       -       40
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline    Completed: read failure       60%      1163         2961965536

The first SMART error, but the array is still healthy etc… Hmmmm…

Update My short SMART test just ran about 5-hours ago, and no issues it seems on the short test?

########## SMART status report for ada0 drive (Western Digital Red: WD-WCC7K7xxxx) ##########
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   185   185   021    Pre-fail  Always       -       5741
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       10
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1188
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       2
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       8
194 Temperature_Celsius     0x0022   111   108   000    Old_age   Always       -       39
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%      1184         -

SgtAwesomesauce · September 18, 2017, 3:45pm

Looks like your drives are still new. Only 48ish days of power on time. I’d look at getting that one replaced under warranty.

bsodmike · September 18, 2017, 6:30pm

Thanks @SgtAwesomesauce - I’ll wait a bit longer and see if this drive gives further trouble but between the earlier SATA error and this SMART report, I have a bad feeling lol.

SgtAwesomesauce · September 18, 2017, 6:34pm

That’s the thing. Something about it just rubs me the wrong way, so personally, I’d absolutely make sure you get it replaced within the warranty period.

bsodmike · October 31, 2017, 6:40am

You’re on to something, seeing my first error

200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%      2174         -

So here’s question - this zpool is running on 8x WD Red 4TB NAS drives. I’ve got 3x WD Red Pro 4TB drives doing nothing - could I swap in a Red Pro? From what I’ve read, it’s recommended to use the ‘same drive size’ across the zdev but alternative speeds - would that be an issue? Reds are 5400rpm, Red Pro 7200rpm.

Dexter_Kane · October 31, 2017, 8:33am

It will work okay

SgtAwesomesauce · October 31, 2017, 5:11pm

You won’t notice a difference.

There are also minor firmware differences with how they handle errors, if I remember correctly, but that, again, is not something you’re going to notice.

bsodmike · November 1, 2017, 5:07am

Right, that was largely something I was pondering about… Since I migrated (And right now not using) my 5-bay DSM1518+ Synology, got 3x Red Pro 4TB left. I can keep one as a disaster replacement spare, and add the other two to the array.

It is going to be a tight fit inside the Corsair Air 740 which already had 8-drives held together by 3D printed bits.

Interestingly, taking my RAIDz2 zvol from 8 > 10 drives takes the available usable space from 19.97TB (68.64%) to 26.7 (73.75%) which is very nice. I use this fantastic calculator as it was great when looking to find the sweet-spot between usable space vs. cost projection.

However, at RAIDz3, 8-drives only gave 16TB (~55%) but by making it 10 drives I can hit 23TB (64.35%) - since I’m after redundancy, I’ll go RAIDz3.

Any downsides to RAIDz3 vs 2 apart from potentially less ‘usable’ space? Triple parity is as the name says so I’m just reading the label right off the ‘tin’ as it were

EDIT - I got so exited that my 10 drives would let me do RAIDz3 with 64% available space - but I forgot this means migrating my data, killing the vdev, and recreating a fresh RAIDz3 pool across the 10 drives, then migrate all the data back.

Ugh no, can’t do that - BUT I can replace the failing drive for now and RMA the sucker.

SgtAwesomesauce · November 1, 2017, 4:37pm

You’re using 4TB drives, so I’d recommend raidz2 as the baseline, and if you’re paranoid, go with raidz3. It’s going to have more CPU overhead (only slightly) when calculating parity, and it’s also going to have more actual reads/writes per user-perceptible read/write. That’s about all I can say about z3 vs z2.

If you want to increase space, you’re going to want to replace the drives with larger. ZFS will allow you to upgrade a vdev, in place, with larger disks.

bsodmike · November 17, 2017, 11:04am

Right, so given I’ve got 8x 4TB drives, I can replace each one, rebuild and hypothetically replace all to 6TB.

However, ZFS doesn’t support adding further drives to an existing vdev right? You can however, add another vdev to an existing pool…?

Given I have 8x 4TB WD Red in a single vdev, what can I do with 2x 4TB WD Red Pros? Buy another 6x and create a second vdev?

Reason I ask is because I got the 80% warning today, having reached 81% usage…eek.

Dexter_Kane · November 17, 2017, 11:11am

I’ve heard that when you add a second VDEV it has to be the same size, however I’ve tested with virtual disks and there is no issue adding VDEVs of different sizes. So I imagine there’s a performance hit to do the way that data is striped across VDEVs but you can add a smaller (or larger) VDEV to a pool.

bsodmike · November 17, 2017, 11:17am

Right exactly, what I’ve also read online is to ensure the second vdev is of the same size/capacity, i.e. 8x 4TB.

-OR- I would assume, go through the existing vdev, replacing each drive, resilvering each (say to 6TB) and grow the size of the entire vdev to 8x 6TB drives. This would be an expensive affair given the cost of 6TB drives vs, buying 5x 4TB Reds (they are the cheapest at $137 a pop + coupling the 3x WD Red Pro 4TBs I have left (from my earlier migration).

Those 5x drives would set me back $685.25 which is the cheapest option right now. The other issue I face is that I’ve already used 8 of 12 SATA ports on the X99 WS-E/USB3.1 board, and I’d have to use the LSI card I have to connect the additional vdev of x8 drives. You’ll notice the janky 3D-printed HDD mounting solution I’ve used in the Corsair Air 740 case.

20393783_109952109648900_6213395819798724608_n

I was hoping to do this later in 2018 once I have a chance to procure a decent 24-bay rack chassis with GPIO backplane - will see how it goes.

SgtAwesomesauce · November 17, 2017, 5:12pm

Everything you said is correct.

You could create a second vdev of a different size, but you could have inconsistent speeds because of that.

It’s recommended to always have the same number of physical devices in a vdev.

It’s just a best practice, rather than any sort of hard limit.