SATA SSDs not working on HBA anymore

I have been messing around with proxmox all year with this old server board. everything had been working fine up until recently. The server had been down for about a month and a half because I had not been using it that much, and it was making to much heat for the hvac to keep up with during the summer. I went to power it back up a week ago and proxmox booted fine, but my VMs would not start. turns out my vm pool was down because most of my 8 drives in the pool were not showing up. I tried restarting and got more drives to show, but they were different from the ones that showed up before. I also tried unplugging all the drives from the HBA and reconnecting them no change. Then I realized that I was also having issues with a few of the drives connected to the SAS controller on the motherboard. I tried swaping the HBA with another known working one out of my storage server, and had the same issue. I know the HBA and drives work because I installed it in my desktop with all the drives, and everything worked fine. Also I have no issue when the drives are connected to the SATA ports on the motherboard. The only issue I see is that it takes a while for the HBA to initialize on boot while the card is install in the server board, and it will not detect all the drives. I even tried swapping the motherboard to another Identical board, and the issue is now happening on that board to. Everything was working fine when I shut it down a month and a half ago. Anyone ever had a similar issue?

Specs
Motherboard: Supper Micro X9DR3-F
CPUs: 2x Xeon E5-2660 v2
RAM: 128gb
Boot Drives: 2x kingston 120gb SSDs (similar issue when connected to a SAS controller)
HBA: LSI Logic SAS 9207-8i (think I ended up with a clone card though)
SSDs: 8x Patriot Burst Elite 1.92TB SATA SSDs
OS: Proxmox 8 (Tried Ubuntu for troubleshooting same issue as Proxmox)
SAS Cables used: https://www.amazon.com/gp/product/B012BPLYJC
Power Supply: Corsair RM1000x

Perhaps could be a power delivery issue? Are you able to test with a spare PSU?

I’ve had many “Weird” issues be caused by a pad PSU

I tried swapping out the power supply no change.

My first thought would be software or configuration. I noticed a lot of cards that had firmware blobs prepackaged before recently had the firmware blobs removed and packaged separately, at least in Archlinux; my Bcom 57810S adapter went down and needed a firmware file from AUR to work again.
I haven’t noticed this with my LSI SAS 2008 cards, but it might be something to look into.

The insanity gnawing at the back of my mind is rambling something about modern SATA revisions and potential compatibility with the older SATA compliance of the SAS HBAs. AFAIK, SATA 3 had a lot of revisions to add SSD-oriented functionality and compatibility. It could be that your drives are expecting certain commands to be run, and others to not, and negotiations have broken down over lack of os-controlled trim commands or some such.
A bit of wiki reading says that SATA can behave as native SAS disks on a SAS controller, so it may be that the drives themselves have a bug preventing them from correctly assuming SAS compatibility and compliance? I had thought it was the other way around…
Other resources mention that SATA is sent over SAS via tunneling, and not native compatibility… Which is more in line with what conventional wisdom was on the subject afaicr.

Thankyou for the response, and thankyou to @FunnyPossum for the earlier response. I have continued testing. I think at least 4 of the drives are dead. When I got them all connected to my desktop must have been the last death cry for some of them. I have been going through and testing each SSD on the HBA and onboard SATA ports, and watching the logs. At this point any drive throwing errors is getting RMAed. I don’t know why I wasn’t getting errors before, but now 4 off them just time out when I plug them in. I don’t recommend anyone buy patriot burst elite SSDs at this point at least the 1.92TB ones. This is a 50% failure rate, and I am not done testing them all yet. The 120gb burst elite drives I am using for boot drives seem to be holding up for now, but I am going to go through and test all of them now for errors after this.

To add to this, I’d suggest anyone looking at consumer SSDs look at the controller, and avoid any dramless sata drive. They are all lemon I think, because there is no host memory whatever for SATA.
I believe the drive in question is QLC with a SMI dramless controller at 1.92tb?

Just an update for anyone else that see this. I am going to declare all of the drives dead or close to it at this point. I can actively see them getting worse every time I reboot to test.

1 Like

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.