TrueNas Drive errors

Hey all, Looking for some help troubleshooting errors I am getting on my Truenas storage server. I am running Truenas 12.0 stable, with a set of 4 2tb and 2 3tb drive in a raid z2 pool. Yeah I know that’s not optimal, the 2 different sizes, they were all 2tb at one point, but when a couple died, I replaced them with the 3tb ones I had lying around, and so far, this entire system has all been free parts so I can’t complain. I am now seeing increasing numbers of “checksum errors” on the drives in that pool. I started as one or two errors a month, then one or two a week. I tried changing the cables out to newer, hopefully, better quality ones because I read online that may be an issue with many drives in a small space. For about a month after that, there were no errors, but now I am seeing about 60 errors that popped up overnight ranging from 6 to 17 per drive. I am not seeing any reported smart errors and It looks like scrubs are still completed successfully. Do yall think this means the drives are all starting to fail? or could it be something else?

could be the controller.

They plugged into the motherboard? if so what motherboard? I noticed on z170/270 systems running a lot of drives in a setup like this that the on-motherboard ports would degrade after a while.

I noticed this on sandy bridge era machines too but I thought it was fixed by the z170/270 era. not so much.

2 Likes

They are all plugged into the motherboard yeah. It being all hammy down used hardware, it’s pretty old, so it is a GA-890GPA-UD3H running a Phenom 2 x6. Not exactly a spring chicken of a system but it has been getting the job done for a good while, until now.

1 Like

ZFS doesn’t care what port drives are plugged into as they are marked by ZFS for identification. You can try moving a drive to a different port and see if the errors continue.

Obviously do this while powered down so you aren’t running live without redundancy.

Also check or try replace cables.

Different size drives shouldn’t be a problem. Are you sure the new ones are not SMR recording?

1 Like

My vote goes to problem with the controller as opposed to cabling or hdds since all the drives seem to be affected.

Can you measure your southbridge temperature during drive activity?
If you saw it getting too hot (perhaps due to aging thermal paste/pad) that might be a clue.

2 Likes

You also want to check memory especially if you’re not running ECC, but cabling and power delivery can also cause interesting issues. I see that you’ve replaced cable but I’d check your memory (use memtest86 for a 4+ hours) and PSU. Could be the motherboard but I somewhat doubt it.

1 Like

because it is a few drives at once, does not look like the actual drives. (As others mentioned)

Could it be like a fan failing or some new source of vibration?

Also, the mixed sizes are totally fine; I would replace 2TB’s with 4TB+ drives as they fail, and then when more go, pro-actively upgrade the rest (one by one) and expand the pool when all on new drives.

I don’t think that will help this issue, just don’t be sorry about mixed drives (same speed/size does give minor improvements/consistency, but not much)

1 Like

Any input on how exactly I could do this? Is there a way I can view the temp from within Truenas? or would I be looking to try and get a thermal probe in there somehow haha. Currently running a memtest; if that does not turn up anything this will be my next check.

hmmm… you could try to see if lm-sensors could read the temp but I have a feeling it won’t due to the somewhat uncommon hardware.

If it were me I’d just touch the south bridge with my hand to see if it hurt, the human threshold for pain is about 140F which is conveniently the line I’d draw to what is considered kind of hot for silicon.

Alternatively you could just hook up an extra fan and point it at the south bridge and see if that effects error rates.

​ ​ ​​

​ ​ ​​
diizzy’s response has got me thinking it’s possible it could also be a power supply (or power supply cabling) issue too, where the power brownouts just enough that only the HDDs are affected.
You don’t by chance have an extra power supply laying around do you?

Out of the box you can usually extract temperatures of CPU (sysctl) and HDDs (smartctl) I think these are present by default in TrueNAS Core?

1 Like

oops scratch using lm-sensors, had my linux thinking cap on.

IDK if this is the best way to do this, but here’s an update for everyone who replied. Ran memtest86+ and found that I was getting a decent amount of errors almost immediately. So I have spent the day putzing around with memory. I have put together a set of 4, 4gb sticks to replace the 2 8s that were in there. These 4 all worked individually for a full pass in memtest when seated in my test bench system. They worked in pairs in that same system and worked all 4 together in that same test bench system. I then tried 2 of them in the truenas server system, almost immediate errors again. Next running the other pair in there, and it ran memTest for about 3 hours with no issues. I briefly tried all 4 but got errors pretty quickly. I am now back to the “good” set and running the test again. I am in no huge rush to get the server up as it’s just a home server, and I migrated the important “live” data to my backup nas for the time being, so I am gonna let it run the test overnight just to be sure.

3 Likes

Another tip for pool configuration: If you use 3 vdevs with mirrors, you get 7TB out of it. Better performance and you can remove a vdev if other drives fail and use the remaining to fill up the gaps. And you don’t run the 3TB drives as 2TB. And adding 2 new disks is just easy mode compared to RaidZ.

3 striped mirrors vs 6-wide RaidZ2 at the cost of 1TB? Very good trade.

I hope the memory was in fact the cause of it. Always good to pinpoint the problem rather then searching in frustration. I agree with the others…sudden simultaneous errors on multiple drives points to “not the drives”. e.g. I’ve seen some very “special” cabling from users here who had the same problem. I personally never had any trouble with on-board SATA, but many users here hate it as it can indeed be a source for trouble.

ZFS can be very fussy when things smell fishy and lives by “better safe than sorry”. Complaining about errors or locking down the pool to prevent user data from getting permanent damage is a nice thing, but sometimes it’s just “hey, what’s the problem? talk to me!”.

Keep us updated :slight_smile:

2 Likes

So 3 sets of 2 drives mirrored if I get your meaning. I see the appeal of that, gives me more storage, and in theory, I can lose up to 3 drives and still have all my data, as long as it is the correct set of 3. Personally, I really like a larger vdev with RAIDZ2 just because I really like being able to lose any two drives and still have all my data, as opposed to losing 2 random drives and I am either fine or maybe I lost some data depending on what two drives failed. Especially in systems like this where the majority of the drives, or when I started all the drives, are not nas rated drives, and all used. That’s just a personal preference though, totally not something I would argue with anyone over.

2 Likes

Another update, hopefully the final one. After letting the system run memtest86+ for over 12 hours with the “good” pair of memory sticks in it, it had presented no errors. So I then booted it back into truenas, got it back up and running on the network, and moved a small number of things back to running off it. A couple of VMs running off an iscsi target on the system, as well as my desktop ran a full backup to that system. It has been just about 2 days now since then, and I am not seeing any new errors. Hopefully, it stays that way for a good while now. Current plan is in December to try and rebuild the system with a new cpu/mobo/ram/sata controller combo. Hopefully, it’s all good until then. But for now, thanks everyone for the input and support!

3 Likes

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.