Resolving I/O errors unrelated to disk

MuffinSmith · December 29, 2019, 3:23am

Hiya,

I’m trying to learn Linux by throwing myself off the deep end and I finally am starting to really struggle with how to resolve an issue. I am trying to build myself a Proxmox server with a zfs storage pool and I think I’ve narrowed down the i/o errors to the backplane or HBA I’m using but I have no idea how to proceed.

The server is built in a silverstone RM21-308 with a SAS 9207-8i Host Bus Adapter flashed in IT mode to connect the backplane.

The issue started when I tried to move my media collection onto it from another server, during the move it would get enough read and write errors to fault a disk and continue in a degraded state. I’ve since learned how to properly set the pool up like forcing ashift=12 and setting disks by id but the error isn’t resolved.

What i’m getting currently is this message every once in a while when I initialize my zpool and it upticks the i/o error counter in zpool status. I’ve also set initialize to write only 0s.

I’ve been thinking that I could train myself to become a computer systems engineer but this is really triggering my impostor syndrome while all three of my servers are in a vulnerable state or broken.

oO.o · December 29, 2019, 3:51am

Can you remove the l2arc and zil just to eliminate some variables?

How are you sure it’s not the disks?

MuffinSmith · December 29, 2019, 4:00am

The issue was happening before I figured out how to add those devices to a pool but I’ve gone ahead and removed them. I don’t think its the disks because they’re all brand new and running smartctl --all on all the disks shows no errors logged but I have not done a self test yet.

edit: I checked and the errors are still adding up

oO.o · December 29, 2019, 4:04am

Run a long test on all of them. It could definitely be the backplane, just want to eliminate the disks first. Do you have any other board you can plug them into? Even one at a time?

MuffinSmith · December 29, 2019, 4:14am

Should I run the long test on the backplane? unfortunately my other server with a backplane is an old dell that a week ago had a failure and I’m waiting on a part to bring it back up. It would just be a huge pain and I’d have to bring down my third server thats just a mini itx with 6hdds shoved in it. its an unraid that i learned with but the cache drive corrupted itself so I want to put zfs on it but literally everything is falling apart on me.

oO.o · December 29, 2019, 4:23am

On a known good board/interface if possible (but try it on the backplane if no alternative is available). What model are the drives?

MuffinSmith · December 29, 2019, 4:28am

I think the drives are Seagate BarraCuda ST4000DM004 4TB 5400 RPM 256MB Cache

samarium · December 29, 2019, 6:12am

poor sata cables and or poor connections can also cause problems

check the 8087 cables and make sure they are fully engaged

if you wiggle the cables while io load testing you might be able to induce errors, or also wiggle cable at the strain relief.

another set of cables might be nice to test against

sufficient power may be an issue, especially under active load

try running some io tests with minimal drives connected to both power and signal, try with just one drive, and also try connecting the same drive to the onboard controllers outside of the onboard backplane just using sata cables, and then using a 8087 to sata break out cable from your hba

eliminate as many variables as possible, and then swap equivalent configurations

make sure you have single disks working fine before adding more disks

try testing some different disks?

if you flashed the HBA with BIOS, enter the BIOS and check the configuration, maybe reset to factor defaults? I usually leave the BIOS out when I’m reflashing, and just use one of the mpt utils to interrogate from the command line, but unfortunately I don’t have access to that system at the moment, to tell you the name.

I had an issue with an SSD logging IO errors visible to smartctl -A. I swapped the SSD but issue persisted. I reseated the SATA cable and it seems to be better, but insufficient testing to be sure so far. I haven’t yet tested reinstalling the original SSD, it has 5 years uptime 24/7, so not sure I want to.

freqlabs · December 29, 2019, 6:20am

^ I second this

MuffinSmith · December 29, 2019, 7:27pm

I ran the long smart test on all my new disks and I think nothing came up from that. I’ve got an 8087 break out cable in the mail from another project I’m going to use to test the HBA with the disks attached to the psu. Last night I added a single spare drive into the backplane and set it to initialize on its own, it finished the process with 102 write errors but the disk is still online at least.

I’ve just now shutdown and re seated the sas cables and that cleared the write errors for some reason but not the 9 read errors. I’ve restarted the initialization process to test it while I wait for more cables.

I didn’t personally flash the hba, I specifically bought one of those and this was the first one I got So I’ll try swapping it next.

What would be a good way to test IO? I haven’t figured that one out yet
edit: immediately finds stress-ng after googling

freqlabs · December 29, 2019, 7:35pm

SMART will only report on the drive, not any of the interconnect.

Check out the observability tools diagrams here: http://www.brendangregg.com/linuxperf.html

For example this one:

MuffinSmith · December 31, 2019, 12:39am

The errors seem to have gone away when I plugged the drives in with an 8087 to sata break out and I’ve successfully moved most of my media collection without any IO errors. I’ve ordered more sas cables to try replacing the ones I have.

Thats a really neat diagram that will take me a long while to really understand thanks!

MuffinSmith · January 2, 2020, 11:20pm

Turns out it was a bad HBA. replacing that part has cleared up all the errors. Really frustrating that it didn’t fail outright. Thank you for the help.

samarium · July 14, 2023, 4:03am

Bad connection will often show as UDMA CRC errors.

Thanks for the graphic and the memory refresh, time to clone some tools.