Server Degraded from Checksum Errors, what to do next?

Hello Everyone,

So I recently started dumping all of my dvd’s and blu-rays into my server. By and large this had been going swimmingly. A few nights ago I was streaming via jellyfin while more dvds were being ripped and the video quality and FPS started to tank. Wondering what was going on, I ventured to my desktop and pulled up the GUI. The Data VDEVs were listed as mix capacity (They’re identical 4 disc RZ1 VDEVs with a hot spare) and the Pool Status was Degraded with over 500 ZFS Errors. At the time when I looked at the Devices page, the degraded discs were on my StarTech sata expansion card. I stopped the two discs that were getting ripped, pulled a couple of recent work files off the server and onto my desktop, let it finish its in-progreess resilver, and shut the server down.

I thought something was wrong with the StarTech card, so after some reading, I decided to replace it with a new LSI 9207-8i with a pair of 4 way breakouts.

Card and cables arrived today, I popped them in, plugged in the according drives, and booted the server back up. It started resilvering and I went in to the devices to see what all the drives were up to.

Not much in the way of useful info there, so I went and ran zpool status in the shell.

Now, at the time it happened late Saturday night, I didn’t know zpool status, so I didn’t think to run it. But today, it was showing 48 errors. I don’t understand what prompts a checksum error, but I do know I hadn’t written to or read from any drives while this was happening. This is also where I realized that my spare got called off the bench.

For some reason TN seems to be shuffling the sd letter assignments, so I’ve come to a couple of wrong conclusions. I originally thought it was the StarTech card. But when I booted it up today, it had sda-d listed as the drives in trouble. Per the original configuration, a-d are on the mobo. So I thought maybe I just had too much happening for the mobo to keep up with. Now that 8 of the drives are on the LSI card, the lettering has roughly gone back to original. So my original thought to blame the StarTech card seems to stand up again.

Resilver is done:

And here’s where I don’t know what to do next. I did read the doc it told me to. But I’m still not sure what to do. I can run the zpool clear command, but will that take the drives out of degraded mode? And why did my spare get pulled into this? All the drives in raidz1-1 only have about 400 hours on them, but I’m not buying the idea that 3 brand new drives are having an early death. It had to be the StarTech Expansion card. That aside, I don’t know what the 500+ errors were that piled up before I realized something was wrong. (I have tried to set up OAuth email to alert me of stuff like this, but it REFUSES to work.) That I can tell, none of the files seem to be corrupted that were being W/R when this was all going sideways.

So what should I be doing here? Do I clear the pool and carry on? Is there a way to see all the previous errors? Why is my spare drive involved? If the card was the issue, how do I get the drive that the spare replaced back into the mix as the new spare? Anyone got any ideas on the email thing? This guide is what I followed and I got no love.

For reference, this is the NAS build:
Mobo: GIGABYTE Z590 AORUS Master
CPU: i5-11600K
Network: Dual 10Gb X540-AT2
NVME: Samsung 980 Pro
Pool: 9x 6TB WD Blue
(4 in a RaidZ1)
(4 in a RaidZ1)
(1 as a Hot Spare)
HBA: LSI 9207-8i
Optical: 2x LG WH16NS40 16x Bluray
PSU: Corsair HX1000
Case: Corsair 750D

Some of 6Tb blue models are SMR. Shingle magnetic recording.
SMR will not work with ZFS. In my opinion, SMR should be banned. First, make sure they are not SMR. If they are SMR, 400 Hours, if they are still in return windows, return them.

2 Likes

According to the screenshots you provided, this is not the case. Your pool “Rust Belt” has 2 raidz1 vdevs. raidz1-0 contains 4 drives. raidz1-1 contains 3 drives and 2 spares (spare-0) dedicated to that vdev. Finally you have a spare for the whole pool (spares). Unfortunately the only way you will be able to fix this is by recreating the pool. Single drives cannot be added to raidz vdevs, though there are plans to support this in the future.

If any of your drives are SMR as @jxdking suggested, this and/or the mixed capacity vdevs are likely explanations for the performance issues.

This is not specific to TrueNAS, device discovery on Unix-style systems works on a first come first served basis. The only consistent way to identify a drive is by checking it’s smart data to get it’s serial number. It is good practice to make a list of each drives physical location in your case by their serial numbers for easy identification.

2 Likes

I think at the moment the screen was taken, the spare, 00a…ccc was being resilvered to replace 61b…e1e.

looks like OP did a zpool clear.
I would suggest, when the spare has been resilvered in to replace the other drive, then backing up what data can be, and firing off a scrub

1 Like

Oops, I clearly was not caffeinated enough, you are correct.

Yep, turns out 5 of my 9 drives are SMR. I’m super f-ing thrilled about that.

Thankfully I got all the genuinely important stuff off the array and while I wouldn’t be excited about it, I do have all the other data retrievable on other external drives or DVD/BR.

When I started this whole project, I told myself to get Ironwolfs and the LSI card, and all the other things I saw in all the videos I watched. I guess I learned a lesson there.

When I get back home from work in a couple of weeks I’m gonna try and get the rest of the data off of it. Then I’m nuking the thing. Gonna start looking at relocating this, the workstation I’m about to build, and my gaming rig into rack mounts and putting them in a different room. I’m tired of all this hot ass air blowing on me at my desk.

Thanks for the SMR info. Here’s to taking all the TN stuff I learned and doing it all over again!

3 Likes

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.