Truenas SSD mirrored pool - kaputt

MadMatt · December 29, 2021, 9:26pm

Hi guys, just checking notes with the community, I just had one of my SSD mirrored pool die on me.
It looks like both SSDs failed at the same time, or they probaby failed over time, but hung on until I shut down and rebooted the server …
I had to do some electrical rewiring on my UPS line today, so I did a controlled shut down of my NAS (that failed) … unplugged all power from my UPS line and proceeded with the rewiring.
Once don I powered up again my NAS (an HP microserver Gen8, running 2 10TB HDDS, 2 60GB SSDs - mirrored root, 2 500GB SSDs - mirrored jails, vms and other stuff) … ony to discover my 500GB SSD pool was not coming up, and righty so, since the 500GB SSDs weren’t even initializing on the controller

I pulled the SSDs out and I can’t access either of them from my laptop with an USB to sata interface …
For now I just pluged in a spare 500GB SSD and restored from backup, and all is running smoothly so far (albeit on a single drive), waiting for a second SSD to come in to reestablish the mirror.

The SSDs are KingDian S280 480GB models , both bought from Amazon, one on september 2017 and the other on may 2018 …
It’s the first time in years I have an SSD fail on me, and never before had two fail in a span of weeks (last time the NAS was rebooted was probably a month ago) …
Just wondering how I could have noticed them failing … no SMART errors reported, no errors reported during scrub, no bad sectors reported … they just did not recover from a reboot …

Data in it isn’t really important, I recovered from a snapshot and I am only running jails and my docker VM, with no real data lost other than 24 hours of house sensors and prometheus monitoring … just wondering if anybody has a clue as to might have gone so horribly wrong and wat I could do in the fture to avoid the pain of restoring the whole pool from backup …

MadMatt · December 31, 2021, 5:23pm

replying to myself just to document what I did to recover from the double SSD failure …
I had one spare 500GB ssd

booted the system with the new SSD inserted
deleted the disappeared pool from truenas gui, without selecting delete data/delete share/replication task config (this moved the system dataset to my other pool, and left the existing shares, iscsi config and replication tasks untouched, other than disabing them)
created a new pool with the same name, with a single vdev, unmirrored
restored the data in the pool from a remote snapshot on my backup NAS (yes, I have two, and yes, I replicate them daily):

root@hpmicro-02[~]# zfs list -t snapshot | grep ssdmicro01@
hd01/ssdmicro01@auto-2021-12-18_00-00                                                                                         0B      -     1.38M  -
hd01/ssdmicro01@auto-2021-12-19_00-00                                                                                         0B      -     1.38M  -
hd01/ssdmicro01@auto-2021-12-20_00-00                                                                                         0B      -     1.38M  -
hd01/ssdmicro01@auto-2021-12-21_00-00                                                                                         0B      -     1.38M  -
hd01/ssdmicro01@auto-2021-12-22_00-00                                                                                         0B      -     1.38M  -
hd01/ssdmicro01@auto-2021-12-23_00-00                                                                                         0B      -     1.38M  -
hd01/ssdmicro01@auto-2021-12-24_00-00                                                                                         0B      -     1.38M  -
hd01/ssdmicro01@auto-2021-12-25_00-00                                                                                         0B      -     1.38M  -
hd01/ssdmicro01@auto-2021-12-26_00-00                                                                                         0B      -     1.38M  -
hd01/ssdmicro01@auto-2021-12-27_00-00                                                                                         0B      -     1.38M  -
hd01/ssdmicro01@auto-2021-12-28_00-00                                                                                         0B      -     1.38M  -
hd01/ssdmicro01@auto-2021-12-29_00-00                                                                                         0B      -     1.38M  -
root@hpmicro-02[~]# zfs send -R hd01/ssdmicro01@auto-2021-12-29_00-00 | ssh [email protected] zfs recv -F ssdmicro01

waited for the restore to finish
set the restored snapshot to read write (when backed up it is automaticaly set to read only by truenas)

root@hpmicro01[~]# zfs set readonly=off ssdmicro01

rebooted, moved the system datased to the new pool, and re-enabled all dependent shares, then restarted my docker VM
*ordered a new ssd to match my spare one, once it arrived, I added it to my existing pool

 1035  gpart list
 1036  gpart create -s gpt /dev/da3
 1037  gpart add -t freebsd-zfs /dev/da3
 1040  gpart list
 1041  zpool attach ssdmicro01 /dev/gptid/66c659b3-68dc-11ec-b1c9-94188238f234 /dev/gptid/b337c61c-6a5b-11ec-bbb4-94188238f234

root@hpmicro01[~]# zpool status ssdmicro01
pool: ssdmicro01
state: ONLINE
scan: resilvered 130G in 00:07:09 with 0 errors on Fri Dec 31 18:13:08 2021
config:

NAME                                            STATE     READ WRITE CKSUM
ssdmicro01                                      ONLINE       0     0     0
  mirror-0                                      ONLINE       0     0     0
    gptid/66c659b3-68dc-11ec-b1c9-94188238f234  ONLINE       0     0     0
    gptid/b337c61c-6a5b-11ec-bbb4-94188238f234  ONLINE       0     0     0

errors: No known data errors