Persistent HBA Reset and ZFS Expansion Issues on Dell R720XD with PCIe HBA
Description:
I’m encountering a persistent issue on my TrueNAS system (running version 24.10) where my PCIe HBA connected to SAS drives causes the pool to suspend after boot due to a device reset. The problem occurs while the system is performing a ZFS expansion and after a few restarts, leading to data corruption in a few non-critical files.
Key Details:
-
Hardware:
- Server Model: Dell R720XD
- Network Card: Intel X520 SFP+ single-port NIC
- HBAs: Dell H310 (flashed to IT mode) and PCIe HBA for SAS drives
- SAS Drives: Connected via PCIe HBA and onboard HBA
- SATA Drives: Connected via Dell H310 HBA (flashed to IT mode)
- External SAS Setup: Single-path SAS
- NetApp Storage: NetApp 4224 (additional storage)
- NetApp IOM6 Module: Recently replaced
-
Issue Summary:
- The system imports the pool as “ONLINE” during boot but quickly suspends after detecting the PCIe HBA reset.
- This issue occurs during ZFS expansion, and the pool has been in a scrub process since the last reboot.
- After the boot and reset, ZFS functions show normal behavior, but some corruption was found in a few non-critical files.
- Only the vdevs attached to the PCIe HBA (connected to SAS drives) are affected, while the vdev connected to the Dell PERC H310 (onboard) for expansion remains unaffected.
- The issue started after upgrading from TrueNAS 24.04 to 24.10, following a ZFS pool upgrade.
- Prior to the TrueNAS update and the ZFS pool version update, there were no errors in the pool or on any disks. Rolling back is not an option as the pool was upgraded to the new version during the upgrade process.
Steps Taken:
- BIOS Adjustment: Set Dell R720XD to performance mode to disable ASPM. I did not modify GRUB bootloader settings to disable ASPM entirely.
- Hardware Troubleshooting:
- Replaced the HBA and cables.
- Switched PCIe slots and replaced the IOM6 module in the NetApp system.
- Scrubbing and Pool Resuming: The pool shows normal ZFS scrub and expansion performance once the device resets. However, permanent errors occurred in a few non-critical files after several reboots.
- Observations: SATA drives attached via the Dell H310 HBA (flashed to IT mode) and SAS drives connected to the onboard HBA show no issues after the reset. The expansion vdev, which uses the PERC H310, remains unaffected.
Impact:
- Data Corruption: Only non-critical files were corrupted.
- Pool Suspension: The pool suspends after boot due to the PCIe HBA reset, impacting boot reliability.
- ZFS Expansion: ZFS expansion is still ongoing but functioning without issues on the unaffected vdev.
Additional Information:
- The issue started after upgrading to TrueNAS 24.10 from 24.04, following a ZFS pool upgrade.
- Prior to the update, there were no issues with the pool or disks.
- The expansion continues without issue despite the resets, though it’s slow due to the state of the pool during the expansion process.
- The problem has persisted across multiple restarts, despite changing hardware components and adjusting BIOS settings.
Suggestions / Questions:
- Has anyone encountered similar issues with PCIe HBAs and SAS drives on TrueNAS after upgrades? What solutions have you found?
- Would disabling ASPM through the GRUB bootloader fix this issue?
- Is there a specific fix for ZFS pools during expansion that might mitigate the resets or prevent file corruption?
Thanks in advance for any insights or suggestions!