Server troubles

The_Russian_Spoider · January 30, 2018, 9:18pm

Alright, so firstly, I am a hardware guy in nanoarchitecture, please include short instructions on what you exactly want me to do when asking for readouts from something.
Quick hardware rundown, I am running a supermicro 2026T-URF
https://www.supermicro.com/products/system/2U/6026/SYS-6026T-URF.cfm

Dual socketed Intel® Xeon® Processor X5550, with an LSI HBA card. A mellonox 10gbe card is sitting in there as well. Got 8 HGST 3tb white label drives and a couple SSDs to boot. I have 2 Kingston 120 and 2 Intel 240 drives. As it sits, there is no raid of any sort.

This server ran fine for 2 years with no real issue on Arch. Now after my install broke twice within a month, I can’t for the life of me get an install running, be it Debian, Ubuntu, Gentoo, Sabayon, Arch, Antergos, Manjaro, or OpenSuse. So likely this is a hardware problem. Figure I try some of the others incase magic happens and it works.

The issues that arise from these install are:
In many of them even with the configs set the package manager cannot resolve the mirror. This is especially true with the Arch installs for Pacman. Portege and Equo work fine, as do apt-get.
In all of these installs that I have tried Mysql/MariaDB never worked. It always failed from the start. Chroot in before an actual boot and it runs, but as soon as I boot it, itll give errors referring to no socket, and no write perms for the test files.
Inevitably after a short while of running, give it a day or two, and something breaks so that on none of the will commands work. RIP /bin/bash

I’ve checked the SSDs ive used to boot. Each one reads 0 problems and I have those running through the built in SATA on the motherboard. I don’t know what the sata runs through if it is the chipset or not. I havent dug that far into this yet. But seeing as it does not appear to be the drives and I haven’t gotten any hardware errors from the server, no abnormal posts, no BIOS warnings, I am at a bit of a halt. I am not sure how to diagnose this one to get working on a fix. I can easily boot into any ISO of linux to run whatever to help find the issue.
As it sits bin/bash is borked on what I had on there. So any ideas where to start with this?

Levitance · January 30, 2018, 9:24pm

If stuff works for a while and then dies, I would start looking real hard at hardware. Do you have the luxury of booting from a LiveCD/LiveUSB for a few days to see if anything terrible happens to that install? My initial thoughts on culprit here are memory or the LSI card.

The_Russian_Spoider · January 30, 2018, 9:26pm

Oo memory. right. I have 96gb of something in there. Don’t remember but it is easily accessed. Yes I can boot into a liveUSB. I’ve chrooted into the installation and things worked fine. Then I booted and everything went to shit.

The_Russian_Spoider · January 30, 2018, 9:29pm

Any preference to which one I use?
Edit: And if we think memory, maybe I should start with a memtest here.

Levitance · January 30, 2018, 9:32pm

I would recommend Ubuntu LTS. But totally agree, if you have time, run a memtest.

The_Russian_Spoider · January 30, 2018, 9:35pm

Alright. Ill start with that and post the results. This is a personal server so time is no real issue. Downtime sucks, but it is what it is.

The_Russian_Spoider · January 31, 2018, 8:50pm

9 Hours of fun later and a single pass gave 0 errors. So memory would be unlikely.
Now on to the next step. I can run a liveUSB for a while to see if things break, but if memory is unlikely, I am unsure what that may accomplish. Beyond that, I am also not sure where to go in terms of figuring this out. I feel like a goof for not thinking memory, but at the same time it gave no errors.

Levitance · January 31, 2018, 9:44pm

If the LiveUSB is able to run without issues like /bin/bash suddenly going away, it throws your disk controller/disks into question.

The_Russian_Spoider · February 1, 2018, 12:43am

Okie, I see what we are doing here. Essentially the sniff test. Smells like a controller, must be a controller. The disks are all verified in a separate system. All signs pointed to healthy drives. So, in theory if I run the boot through the LSI controller things should run smoothly. I had not done that yet. So, Ill start there and see what happens. Assuming it works then in the mean time of letting it run off a USB Ill get in contact to see if I can get myself a replacement motherboard. Too bad that isn’t socketed.

The_Russian_Spoider · March 7, 2018, 6:27am

Thread revival with the same recurring issue! WOOO!
Alright so, the motherboard has been replaced (I might wager this had not been needed. Also not a big deal. Spare board now. Whatever.) More importantly I was able to catch it early just as bin/bash broke since it was in use. sudo bash and then cat /cat/log/kern.log brought up more or less what I expected with the system being in read-only. That is, a whole tonne of input output errors in reference to ata5 which is the SSD with the install, and then some errors after other services time out. I dropped that into a file off system that I can reference. Ill drop a few lines here so you get the idea. This is the start. In general it went through the process of lowering the link speed and still erroring. And then stopped with a
blk_update_request: I/O error, dev sdk, sector 131452792
at the very end of the log. The system is still live in memory as of this post.

ata5.00: exception Emask 0x40 SAct 0x200 SErr 0x880800 action 0x6 frozen
ata5: SError: { HostInt 10B8B LinkSeq }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/08:48:70:34:ed/00:00:0e:00:00/40 tag 9 ncq 4096 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout)
ata5.00: status: { DRDY }
ata5: hard resetting link
ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata5.00: configured for UDMA/133
ata5.00: device reported invalid CHS sector 0
ata5: EH complete
ata5.00: Enabling discard_zeroes_data
ata5.00: exception Emask 0x40 SAct 0x40 SErr 0xa80801 action 0x6
ata5.00: irq_stat 0x40000008
ata5: SError: { RecovData HostInt 10B8B BadCRC LinkSeq }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/08:30:68:60:d5/00:00:07:00:00/40 tag 6 ncq 4096 out
       res 41/84:08:68:60:d5/00:00:07:00:00/00 Emask 0x450 (ATA bus error) <F>
ata5.00: status: { DRDY ERR }
ata5.00: error: { ICRC ABRT }
ata5: hard resetting link
ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Soo, wat do? As far as I can see, having tested the memory, and know for certain the SSD’s are good. The only explanation I can come up with is maybe a bum SATA cable. Cheap, so I ordered some new ones already to arrive on thursday. figure they will get used either way. Any other suggestions?

edit: as a quick note as well. It does not seem to break when run through either USB or the LSI controller. I dont wish to run it through these. The only thing that is the same that I can see between setups is that sata cable. The HBA running SAS to a backplane for reference.

update: It appears that a bum sata cable may have indeed been the problem. For the first time I was able to recover the system with minimal trouble. fsck found 2 errors preventing the boot, but then it was golden. Time will only tell if this is the actual fix, but for now it is back online.