SATA disk disappears every few months, ideas?

nand · November 1, 2023, 5:28pm

I have a Supermicro X10SLL-F motherboard in a 1U chassis. After running a few months of being mostly idle, I’ll find that the machine has lost a disk and is hung, without the ability to write any files.

This is for our community network non-profit. It’s supposed to be a datacenter hypervisor/backup edge router. We de-racked it and removed it from the DC because it’s suspect at this point. What could be going wrong?

Deets:

It’s done this more than once
It’s done this with a pair of 512gb ssds and pair of 1tb ssds
We changed sata cables
Enumerating disks via BIOS, efi shell, and installing linux distros seems to be inconsistent

I’m trying to figure out if it’s worth it putting this back into service, or if it’s just e-waste at this point.

Output:
screen-scraped with phone:

Adjusting accordingly, but partition table may be garbage.
darning: PartitIon table header claims that the size of partition table entrles Is 0 bytes, but this program supports only 128-byte entries. adjust ing accordingly, but partition table may be garbage.
The operat lon has completed successfully.
Marning: the kernel Is still using the old part it lon table.
The new table will be used at the next reboot or after you run partprobe (8) or kpartx(8)
The operation has completed successfully.
256+0 records in
256+0 records out
968435456 butes (268 MB, 256 MIB) copled, 1.09204 S, 246 MB/S
256+0 records In
256+0 records out
268435456 bytes (268 MB, 256 MIB) copled, 0.109463 S, 2.5 GB/S
Freating new GPT entrles in memory.
EPT data structures destroyed! You may now partItlon the disk using fdisk or other utilities.
entries is 0 bytes, but this program supports only 128-byte entrles. fdjusting according ly, but part it lon table may be garbage. larning: Part it Lon tabLe header clalms that the size of partItlon table entries Is 0 bytes, but this program supports only 128-byte entrles. idlust ins accordingly, but partition table nay be garbage. creating new GPT entrles in memory. the operat lon has completed successfully.
The operat lon has completed successfully.
256+0 records in
256+0 records out
568435456 bytes (268 MB, 256 MIB) copled, 1.2407 s, 216 MB/S
256+0 records In
25640 records out
368435456 bytes (268 MB, 256 MIB) copled, 2.00209 5, 134 MB/S
/dev/disk/by-Id/ata-Samsung_SSD_850_PRO_5126B_S250NIXAG912466N-parta°:Device or resource busy
Cannot create
'rpool': one or more devices is currently unavailable
inable to create 2fs root poo!
mount: /rpool/ROOT/pve-1/run: no mount point spec!fled.
mount: /rpool/ROOT/pve-1/mnt/hostrun: no mount point specifled.
mount: /rpool/ROOT/ove-1/tmp/pkg: no mount point specified.
mount: /rpOOL/ROOT/ove-i/tup: no mount point specifled. mount: /rpool/ROOT/pve-1/proc: no mount point specified.
mount: /rpool/ROOT/ove-1/sys/fIrmware/efl/efIvars:no mount point specif led.

regulareel · November 2, 2023, 12:50am

I think there were news recently of SSDs bricking after a set amount of time and that was a firmware issue.

Lt.Broccoli · November 2, 2023, 9:30am

Is it just one bad SATA port?

Kougar · November 2, 2023, 11:52am

This.

If it’s multiple ports then it’s probably the controller/board. If it’s just one port it could be the pins are worn to the point they are no longer making good enough contact, either physically or electrically. That board is a decade old at this point.

anon7678104 · November 2, 2023, 11:25pm

as you tried different distros and found it inconsistent it might be a software timeout.

it might be your leaving it running for so long it accumulating software errors.
such as running out of cache and cant allocate/de-allocate memory properly.

to test it, you could automate a reboot,
just make cron job that reboots the o.s. and starts your software every day or week.
a reboot will take about a minute.

yeah it will be a month before you will know for sure but you can set up logging of your aplications to in the meantime. so at least if it goes tits up again, you have something to look at.