While not exactly a Linux issue, I just (re)acquired the SATA bug on my system with too many HDDs (hardware issue). I don’t yet have Alertmanager set up in Prometheus, so I don’t know for how long it’s been there (not more than a week, I know I check Proxmox status whenever I mess with anything in it).
dmesg output
[ 699.941647] ata3.00: status: { DRDY }
[ 699.942353] ata3: hard resetting link
[ 705.873287] ata3: link is slow to respond, please be patient (ready=0)
[ 708.157341] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 708.159869] ata3.00: configured for UDMA/33
[ 708.159886] ata3: EH complete
[ 708.197458] ata3.00: exception Emask 0x10 SAct 0x400 SErr 0x4890000 action 0xe frozen
[ 708.198300] ata3.00: irq_stat 0x08400040, interface fatal error, connection status changed
[ 708.198976] ata3: SError: { PHYRdyChg 10B8B LinkSeq DevExch }
[ 708.199643] ata3.00: failed command: READ FPDMA QUEUED
[ 708.200309] ata3.00: cmd 60/00:50:00:08:00/01:00:00:00:00/40 tag 10 ncq dma 131072 in
res 40/00:50:00:08:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
[ 708.201706] ata3.00: status: { DRDY }
[ 708.202400] ata3: hard resetting link
[ 714.145372] ata3: link is slow to respond, please be patient (ready=0)
And after a while of trying to go slow, it just s***s itself
dmesg
[ 1093.944417] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[ 1093.945033] ata3.00: irq_stat 0x40000001
[ 1093.945577] ata3.00: failed command: READ DMA EXT
[ 1093.946125] ata3.00: cmd 25/00:00:00:48:e0/00:01:e8:00:00/e0 tag 7 dma 131072 in
res 53/04:00:00:48:e0/00:01:e8:00:00/e0 Emask 0x1 (device error)
[ 1093.947226] ata3.00: status: { DRDY SENSE ERR }
[ 1093.947780] ata3.00: error: { ABRT }
[ 1093.953194] ata3.00: configured for UDMA/33
[ 1093.953268] sd 2:0:0:0: [sdl] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 1093.953300] sd 2:0:0:0: [sdl] tag#7 Sense Key : Illegal Request [current]
[ 1093.953303] sd 2:0:0:0: [sdl] tag#7 Add. Sense: Unaligned write command
[ 1093.953321] sd 2:0:0:0: [sdl] tag#7 CDB: Read(10) 28 00 e8 e0 48 00 00 01 00 00
[ 1093.953359] blk_update_request: I/O error, dev sdl, sector 3907012608 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 0
[ 1093.958897] ata3: EH complete
I fixed this a while ago by just checking the cables, worked for about half a year or more at this point. Probably due to vibrations and being moved around (and a small earthquake a while ago), something moved just enough to mess the connection. 11 HDDs do make the case vibrate, even with all the rubbery pads that keep them in place. Anyway, the last HDD is a hot-spare and I had it there just for these kinds of situations (all heil ZFS - and yes, I know ZFS is not the only solution to support hot-spares). While my zpool is “degraded,” it’s just because the original HDD is not being detected correctly.
Nothing I can do about it. And funnily enough, it’s not the add-on SATA cards (3 cards with 2 ports each) that act up, but the build-in port in the motherboard. And this is the reason why I decommissioned this server from production at work, it used to be a bare md array with 4 disks (raid 10) which kept constantly corrupting data of some oracle dbs. At first I didn’t look too much into it, it was just a case of the server being extremely old and the HDDs also having a long runtime, so it was about time they were replaced. For home use, it was just fine for a while. And while I could solve the issue by adding a sane controller with mini-SAS to avoid the cabling issues, I think this is just a rare occasion when the SATA connection just plainly sucks. Someone was telling me on the forum a year ago that these kinds of issues happen pretty often with SATA with many disks in 1 system. This is the only time I encountered an issue where the port half works and the only other issue I encountered was with a dead SATA port on an old motherboard. Yeah, mini-SAS is better and I would have preferred it, but eh, this worked for what I needed.
smartctl output
Read SMART Data failed: scsi error badly formed scsi parameters
=== START OF READ SMART DATA SECTION ===
SMART Status command failed: scsi error badly formed scsi parameters
SMART overall-health self-assessment test result: UNKNOWN!
SMART Status, Attributes and Thresholds cannot be read.
Read SMART Log Directory failed: scsi error badly formed scsi parameters
Read SMART Error Log failed: scsi error badly formed scsi parameters
Read SMART Self-test Log failed: scsi error badly formed scsi parameters
Selective Self-tests/Logging not supported
I was messing with my Proxmox cluster and wanted to enable replication (and discovered I actually wasn’t using ZFS as intended in Proxmox, but just a directory and couldn’t enable replication - I usually just run VMs on separate NASes) when I saw I got a degraded array. Again, all data is fine, the hot-spare is doing it’s job. I never thought I’d love to have a hot-spare inside my system, I’m usually against hot-spares, but this was the only solution I could come up with because I won’t be able to touch the system for a while and if something goes wrong, I can’t physically replace disks.
For the problem I am currently facing, I tried changing the cable with a new one, it didn’t work. As I was afraid, it’s an issue with the port on the motherboard. So, I have 2 more disks that could fail. And just when I wanted to replicate the data on another storage pool. Just my luck…
Update: messed around and managed to eject the spare instead of the badly detected disk , so now it’s resilvering (15% in 13 min, about 50min left on a 2TB disk in RAIDz2). Can I get dumber than this? Well, that wouldn’t be out of the question. Anyway, I had the VMs powered off during my debugging, so shouldn’t see any additional read/write activity during the resilvering. I can’t wait to move and replace my infrastructure with something saner and more portable. Might just go RAID1/10 with SSDs if I have the money. I’ve recently heard of Asustor, they seem pretty bog-standard hardware (Intel Apollo Lake Celerons), anyone ran Linux or FreeBSD on them? Or you can’t spin your own OS on them, like you can on Synology? I might end up just building my own mini-ITX NAS if not.