So i’m having some issues with my NAS and rebuilding an array (TruNAS) let me start at the beginning.
Setup 8x10TB (half Exos half WD RED PRO)
SAS2008 card
Ryzen 5 + Server Mobo
128 GB ECC DDR4 RAM
I started seeing Checksum errors on one of my drives… tried some scrubs and clears and the numbers started jumping around for the number of errors (61,133,271,45,) at this point i decided to replace the drive
I replaced the drive but the new replacement drive is showing the same Checksum errors and Scrubs are not fixing them…. still jumping around after clears and scrubs (150,61,23)…. tried swapping SAS ports…… issue follows drive
Figured new drive was damaged during shipping, got second replacement WD RED PRO and resilvered the array again…… Checksum errors return and drive errors out ….scrubs don’t fix it and checksum errors continue after clears and repeated scrubs
Figured it was a SAS card issue… replaced with SAS3008 card + new wiring…. wiped drive and did rebuild and still seeing Checksum errors tried scrubs and clearing again but numbers jump around after each scrub (1,61,32)
** Note none of the drives are failing SMART array is going in to degraded state and not failing
I do have all this data backed up but i’m concerned that ZFS is unable to repair this and this is a larger issue with hardware
I have pulled the logs and all the errors are kinda odd… keeps saying the drive is blank in places where it should have data (not normal… i expect to see bit rot)… Pulled the logs below
Drives are directly attached to the HBA, and checked the IMPI and no ECC events only clock syncs with NTP…… I also checked dmesg to see if there were any obvious HW errors and not seeing any…… I have also set the HDDS to always on and APM to 254 (Max power) .
I’m going to try a new resilver with the new drive connected to the mobo SATA ports then a scrub to see what happens…… I also updated the FW on the HBA last night to 16.0.12.0 based on the TrueNAS recommendations for SAS 3008…… Maybe this rebuild will fix it