Serious trouble with 9500-16i and Adaptec SAS Expander

A few weeks ago I combined a 9500-16i with two Adaptec AEC-82885T 36Port SAS Expander.
I took the AEC-82885T because it doesn’t require a PCIe slot. In retrospect, a SAS expander from Broadcomms SAS3xNN series would have been a better choice, maybe…
The plan was to use the 9500 for two PM9A3 U2 drives and 20 HDDs (TOSHIBA MG08ACA16TE) via SAS Expander, 12 internal and 8 external.
I connected all drives to test the combination, but waited to migrate the metadata to the two U2 drives, so the U2 drives had no load, only the disks were configured as productive and backup pool.
I had it running for 3 weeks with TrueNas Scale, everything was normal, the performance of the productive pool was good and the backup to my DIY SAS Enclosure also worked.

The trouble started with the transfer of the metadata Special vdev from the previous NVME mirror to the two U2 drives.
As soon as there was load on the pool, one or more disks failed and almost all of them now show UDMA CRC errors.
I first updated the HBA firmware, then improved the cooling of the HBA/Expander and replaced the cables, but nothing really helped, although I had the impression that the better cooling had helped a bit, the errors came much later.
But the cooling is now as good as it can be, I have a 40mm fan directly on the HBA and a 22" fan in front of it and the HBA shows 50 degrees Celsius, which seems high to me, because only the two U2 drives are connected, the other disks are now connected to a 9300 with no SAS expander in between.

Another problem is the PCIe link speed of the 9500, lspci shows “LnkSta: Speed 16GT/s, Width x8” but storcli shows “Device Interface = PCIe-8GT/s” who is right?
The PCIe slot is currently set to Auto, as soon as I can shut down the server again I’ll try to force PCIe Gen4, but a second backup runs at the moment, so that have to wait.

I think the idea with the SAS enclosure would be nice but unfortunately too much trouble.
I would like to keep the 9500, if I can even get a suitable SlimSAS SFF-8654 8i Straight to 8x SATA cable, because no dealer offers such cables in Germany and I don’t want to test the eBay/China cables after the drama of the last few days.

And what discourages me, is that Broadcom doesn’t have a SATA Breakout cable for the 9500 in its cable guide, however that would be optimal for my case.

I could get that, but my case isn’t ideal for the angled plugs

Does anyone know a dealer who delivers suitable and high-quality SFF-8654 8i Straight to 8x SATA cable to Germany?
Or has someone already gone through this and come to the conclusion that the 9500 needs a backplane?

I just found a firmware from 2021 for the AEC-82885T, maybe…

#./storcli64 /c0 show all
CLI Version = 007.2807.0000.0000 Dec 22, 2023
Operating system = Linux 6.5.0-15-generic
Controller = 0
Status = Success
Description = None


Basics :
======
Controller = 0
Adapter Type =   SAS3816(A0)
Model = HBA 9500-16i
Serial Number = SKB5091830
Current System Date/time = 02/04/2024 23:43:24
Concurrent commands supported = 4352
SAS Address =  500062b20918c500
PCI Address = 00:42:00:00


Version :
=======
Firmware Package Build = 29.00.00.00
Firmware Version = 29.00.00.00
Bios Version = 09.57.00.00_29.00.00.00
NVDATA Version = 29.02.00.11
PSOC FW Version = 0x0064
PSOC Part Number = 14790
Driver Name = mpt3sas
Driver Version = 43.100.00.00


PCI Version :
===========
Vendor Id = 0x1000
Device Id = 0xE6
SubVendor Id = 0x1000
SubDevice Id = 0x4050
Host Interface = PCIE
Device Interface = PCIe-8GT/s
Bus Number = 66
Device Number = 0
Function Number = 0
Domain ID = 0


Pending Images in Flash :
=======================
Image name = No pending images


Status :
======
Controller Status = OK
Memory Correctable Errors = 0
Memory Uncorrectable Errors = 0
Bios was not detected during boot = No
Controller has booted into safe mode = No
Controller has booted into certificate provision mode = No
Package Stamp Mismatch = No


Supported Adapter Operations :
============================
Alarm Control = No
Cluster Support = No
Self Diagnostic = No
Deny SCSI Passthrough = No
Deny SMP Passthrough = No
Deny STP Passthrough = No
Support more than 8 Phys = Yes
FW and Event Time in GMT = No
Support Enclosure Enumeration = Yes
Support Allowed Operations = Yes
Support Multipath = Yes
Support Security = Yes
Support Config Page Model = No
Support the OCE without adding drives = No
support EKM = No
Snapshot Enabled = No
Support PFK = No
Support PI = No
Support Shield State = No
Support Set Link Speed = No
Support JBOD = No
Disable Online PFK Change = No
Real Time Scheduler = No
Support Reset Now = No
Support Emulated Drives = No
Support Secure Boot = Yes
Support Platform Security = No
Support Package Stamp Mismatch Reporting = Yes
Support PSOC Update = Yes
Support PSOC Part Information = Yes
Support PSOC Version Information = Yes


HwCfg :
=====
ChipRevision =  A0
BatteryFRU = N/A
Front End Port Count = 1
Backend Port Count = 21
Serial Debugger = Absent
NVRAM Size = 0KB
Flash Size = 16MB
On Board Memory Size = 0MB
On Board Expander = Absent
Temperature Sensor for ROC = Present
Temperature Sensor for Controller = Absent
Current Size of CacheCade (GB) = 0
Current Size of FW Cache (MB) = 0
ROC temperature(Degree Celsius) = 47


Policies :
========

Policies Table :
==============

------------------------------------------------
Policy                          Current Default 
------------------------------------------------
Predictive Fail Poll Interval   0 sec           
Interrupt Throttle Active Count 0               
Interrupt Throttle Completion   0 us            
Rebuild Rate                    0 %     30%     
PR Rate                         0 %     30%     
BGI Rate                        0 %     30%     
Check Consistency Rate          0 %     30%     
Reconstruction Rate             0 %     30%     
Cache Flush Interval            0s              
------------------------------------------------

Flush Time(Default) = 4s
Drive Coercion Mode = none
Auto Rebuild = Off
Battery Warning = Off
ECC Bucket Size = 0
ECC Bucket Leak Rate (hrs) = 0
Restore HotSpare on Insertion = Off
Expose Enclosure Devices = Off
Maintain PD Fail History = Off
Reorder Host Requests = On
Auto detect BackPlane = SGPIO/i2c SEP
Load Balance Mode = None
Security Key Assigned = Off
Disable Online Controller Reset = Off
Use drive activity for locate = Off


Boot :
====
Max Drives to Spinup at One Time = 2
Maximum number of direct attached drives to spin up in 1 min = 60
Delay Among Spinup Groups (sec) = 2
Allow Boot with Preserved Cache = On


Defaults :
========
Phy Polarity = 0
Phy PolaritySplit = 0
Cached IO = Off
Default spin down time (mins) = 0
Coercion Mode = None
ZCR Config = Unknown
Max Chained Enclosures = 0
Direct PD Mapping = No
Restore Hot Spare on Insertion = No
Expose Enclosure Devices = No
Maintain PD Fail History = No
Zero Based Enclosure Enumeration = No
Disable Puncturing = No
Un-Certified Hard Disk Drives = Block
SMART Mode = Mode 6
Enable LED Header = No
LED Show Drive Activity = No
Dirty LED Shows Drive Activity = No
EnableCrashDump = No
Disable Online Controller Reset = No
Treat Single span R1E as R10 = No
Power Saving option = Enable
TTY Log In Flash = No
Auto Enhanced Import = No
Enable Shield State = No
Time taken to detect CME = 60 sec


Capabilities :
============
Supported Drives = SAS, SATA, NVMe
Enable JBOD = Yes
Max Parallel Commands = 4352
Max SGE Count = 128
Max Data Transfer Size = 32 sectors
Max Strips PerIO = 0
Max Configurable CacheCade Size = 0
Min Strip Size = 512Bytes
Max Strip Size = 512Bytes

Scheduled Tasks = NA

Secure Boot :
===========
Secure Boot Enabled = Yes
Controller in Soft Secure Mode = No
Controller in Hard Secure Mode = Yes
Key Update Pending = No
Remaining Secure Boot Key Slots = 7


Security Protocol properties :
============================
Security Protocol = None


Enclosure Information :
=====================

------------------------------------------------------------------
EID State Slots PD PS Fans TSs Alms SIM ProdID     VendorSpecific 
------------------------------------------------------------------
  0 OK       10  2  0    0   0    0   0 VirtualSES                
------------------------------------------------------------------


Physical Device Information :
===========================

Drive /c0/e0/s4 :
===============

-----------------------------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model                                    Sp 
-----------------------------------------------------------------------------------------------
0:4       1 JBOD  -  3.492 TB NVMe SSD -   -  512B SAMSUNG MZQL23T8HCLS-00B7C               -  
-----------------------------------------------------------------------------------------------

EID-Enclosure Device ID|Slt-Slot No|DID-Device ID|DG-DriveGroup
UGood-Unconfigured Good|UBad-Unconfigured Bad|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition


Drive /c0/e0/s4 - Detailed Information :
======================================

Drive /c0/e0/s4 State :
=====================
Shield Counter = N/A
Media Error Count = N/A
Other Error Count = N/A
Predictive Failure Count = N/A
S.M.A.R.T alert flagged by drive = N/A


Drive /c0/e0/s4 Device attributes :
=================================
Manufacturer Id = NVMe    
Model Number = SAMSUNG MZQL23T8HCLS-00B7C              
NAND Vendor = NA
SN = S63UNE0R409085      
WWN = 3D5C80C7D8B7C8E6
Firmware Revision = GDC51C2Q
Raw size = 3.492 TB [0x1bf1f72af Sectors]
Coerced size = 3.492 TB [0x1bf1f72af Sectors]
Non Coerced size = 3.492 TB [0x1bf1f72af Sectors]
Device Speed = 16.0GT/s
Link Speed = 16.0GT/s
Sector Size = 512B
Config ID = NA
Number of Blocks = 7501476527
Connector Name = C0.0 x4


Drive /c0/e0/s4 Policies/Settings :
=================================
Enclosure position = 0
Connected Port Number = 0(path0) 
Sequence Number = 0
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = N/A
Successful diagnostics completion on = N/A
SED Capable = N/A
SED Enabled = N/A
Secured = N/A
Needs EKM Attention = N/A
PI Eligible = N/A
Certified = N/A
Wide Port Capable = N/A
Multipath = No

Port Information :
================

------------------------------------------
Port Status Link Speed SAS address        
------------------------------------------
   0 Active 16.0GT/s   0xe6c8b7d8c7805c4d 
------------------------------------------


Inquiry Data = 
00 00 07 12 45 00 00 02 4e 56 4d 65 20 20 20 20 
53 41 4d 53 55 4e 47 20 4d 5a 51 4c 32 33 54 38 
31 43 32 51 53 36 33 55 4e 45 30 52 34 30 39 30 
38 35 20 20 20 20 20 20 00 00 00 c0 05 c0 06 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 



Drive /c0/e0/s6 :
===============

-----------------------------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model                                    Sp 
-----------------------------------------------------------------------------------------------
0:6       2 JBOD  -  3.492 TB NVMe SSD -   -  512B SAMSUNG MZQL23T8HCLS-00B7C               -  
-----------------------------------------------------------------------------------------------

EID-Enclosure Device ID|Slt-Slot No|DID-Device ID|DG-DriveGroup
UGood-Unconfigured Good|UBad-Unconfigured Bad|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition


Drive /c0/e0/s6 - Detailed Information :
======================================

Drive /c0/e0/s6 State :
=====================
Shield Counter = N/A
Media Error Count = N/A
Other Error Count = N/A
Predictive Failure Count = N/A
S.M.A.R.T alert flagged by drive = N/A


Drive /c0/e0/s6 Device attributes :
=================================
Manufacturer Id = NVMe    
Model Number = SAMSUNG MZQL23T8HCLS-00B7C              
NAND Vendor = NA
SN = S63UNE0R409724      
WWN = 3D5C7FC1DFB7C8E6
Firmware Revision = GDC51C2Q
Raw size = 3.492 TB [0x1bf1f72af Sectors]
Coerced size = 3.492 TB [0x1bf1f72af Sectors]
Non Coerced size = 3.492 TB [0x1bf1f72af Sectors]
Device Speed = 16.0GT/s
Link Speed = 16.0GT/s
Sector Size = 512B
Config ID = NA
Number of Blocks = 7501476527
Connector Name = C0.1 x4


Drive /c0/e0/s6 Policies/Settings :
=================================
Enclosure position = 0
Connected Port Number = 1(path0) 
Sequence Number = 0
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = N/A
Successful diagnostics completion on = N/A
SED Capable = N/A
SED Enabled = N/A
Secured = N/A
Needs EKM Attention = N/A
PI Eligible = N/A
Certified = N/A
Wide Port Capable = N/A
Multipath = No

Port Information :
================

------------------------------------------
Port Status Link Speed SAS address        
------------------------------------------
   0 Active 16.0GT/s   0xe6c8b7dfc17f5c4d 
------------------------------------------


Inquiry Data = 
00 00 07 12 45 00 00 02 4e 56 4d 65 20 20 20 20 
53 41 4d 53 55 4e 47 20 4d 5a 51 4c 32 33 54 38 
31 43 32 51 53 36 33 55 4e 45 30 52 34 30 39 37 
32 34 20 20 20 20 20 20 00 00 00 c0 05 c0 06 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

Does this setup run clean? no UDMA CRC errors?

Perhaps you’ve discovered a bug when mixing SAS/SATA and NVME drives on the same adapter without UBM?

Broadcom 05-60006-00 is what you want if you can forgo the right angle. The SFF-8482 connector is much more robust than the 7-pin sas/sata connector.

Just curious here, I know the HBA needs to support NVME to run those drives, but I have never thought about the expander requirements. Does the expander also need to be an nvme compatible one? Or will any SAS expander work cause it is just forwarding data and not really processing it?
That was my first thought anyway when you said things started acting really weird as soon as you started using the nvme drives with the setup.

@EniGmA1987

  • SAS Expanders can only handle SAS and SATA drives since they only talk AHCI.

  • NVMe is a different protocol host and target electronics have to support even if you might be able to use the same cables for SAS/SATA/NVMe when hooking them up to even a dumb backplane.

  • The latest trend are “Tri-Mode” HBAs (beginning with the Broadcom HBA 9400, for example, that can talk NVMe, SAS and SATA). These Tri-Mode HBAs can be connected to U.3 backplanes and these backplanes then function as a mixture between a SAS Expander and PCIe Switch, depending on what kind of drive is inserted into the backplane.

  • I dislike these solutions in their current form since the HBAs abstract connected drives to generic block devices, similar to RAID controllers that are just using a single connected drive as a volume which makes stuff like drive diagnostics harder and reduces performance.

  • Most “tips” on the Internet with DIY servers consist of getting an HBA since then connected drives are handled as if they were directly connected to the motherboard and software like ZFS can properly manage the drives without any RAID controller logic interfering; when using recent HBA generations this is unfortunately no longer a valid opinion.

@Janos

  • Your initial posting sounds like a cable quality issue with the U.2 drives, no load okay but actual load produces PCIe Bus errors which in turn can after reaching a certain threshold lead to data corruption or system instabilities (in this case the HBA shits the bed which affects everything else connected to the HBA).

  • Sadly the darn Broadcom HBA doesn’t properly report errors on the NVMe drive connection interface to the operating system.

  • How exactly are your U.2 SSDs hooked up?

1 Like

yes, since the Adaptec is out of the equation I no longer have any UDMA errors, which doesn’t really prove that the expander is the problem, since the disks are now attached to the 9300.

I was thinking about getting a case with a backplane, but the problem with backplanes is the reduced airflow.
The Fantec SRC-4240X07-12G would have been a candidate, but I have the server under my desk and the noise is an issue, so I went with the Raijintek Zofos Ultra.

I have no PCIe errors, dmesg shows only disk IO errors and smartctl shows this, strange is that “the device was in standby mode”, the errors can be easily reproduced with a zfs scrub

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       7955
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       567
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       8
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       4329
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       176
 23 Helium_Condition_Lower  0x0023   100   100   075    Pre-fail  Always       -       0
 24 Helium_Condition_Upper  0x0023   100   100   075    Pre-fail  Always       -       0
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       4
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       132
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1180
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       23 (Min/Max 9/45)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       4
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       40
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       35651586
222 Loaded_Hours            0x0032   091   091   000    Old_age   Always       -       3984
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       600
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 37 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 37 occurred at disk power-on lifetime: 4292 hours (178 days + 20 hours)
  When the command that caused the error occurred, the device was in standby mode.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 43 68 bf 1d a6 40  Error: ICRC, ABRT at LBA = 0x00a61dbf = 10886591

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 88 18 2d a6 40 00      01:42:19.109  READ FPDMA QUEUED
  60 00 60 18 2c a6 40 00      01:42:19.109  READ FPDMA QUEUED
  60 00 58 18 2b a6 40 00      01:42:19.108  READ FPDMA QUEUED
  60 00 80 18 2a a6 40 00      01:42:19.108  READ FPDMA QUEUED
  60 00 50 18 29 a6 40 00      01:42:19.108  READ FPDMA QUEUED

Be aware that Tri-Mode HBAs or PCIe Switch HBAs generate their own PCIe Lanes they use to interface with NVMe drives they are connected to. PCIe Bus Errors can happen in this chain, too - not only between CPU or Motherboard Chipset PCIe and any device like a PCIe NVMe SSD you install directly on the motherboard.

But it’s hard to diagnose since the HBA chipset isn’t made with diagnostics in mind, errors here might stay silent until the entire HBA SoC crashes.

yes but the two U2 devices have an devices alias, sdq and sdp, and a PCIe errors should come up as an IO error, but dmesg showed only IO errors for the spinning disks.
I’ll do a few more tests as soon as I have a second backup and the data on the actual backup pool is checked. I’m currently doing the second backup on individual disks, which will probably take another three days

Have you actually verified that PCIe errors with the NVMe SSDs actually get reported back to the operating system?

SATA drives have the handy C7 SMART value that can be used for error reporting (indicating physical connection issues with the drive and the host bus adapter), NVMe SSDs unfortunately don’t have such a value, meaning you’ll never know of them if these errors are masked in hardware.

I looked for all possible errors, including PCIe errors and although the problems only started with the migration of the special device vdev, at the moment the U2 devices are connected to the 9500 and the disks to the 9300 and there are no more errors.
I have now ordered a complete set new cables and an Adaptec HBA to be able to update the firmware of the SAS Expander, maxView Storage Manager does not recognize the Expander when they are connected to the 9500.

I got the SFF-8654 8i Straight to 8x SATA yesterday , now all drives are connected to the 9500.
So far the pool is stable, no more errors under load.
It was either the cable to the SAS expander or the expander itself is not compatible with the 9500.
I hope the new firmware for the SAS expander helps

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.