Strange backplane behavior, need help diagnosing

iolaus · January 24, 2023, 10:57pm

I’ve got a very strange issue and would like to tap the hive mind for ideas.

I’ve got a 24-Bay 4U chassis I picked up from Alibaba and the version I purchased has a 12Gb/s SAS/SATA backplane with an LSI 3X36 expander.

I’ve got the backplane connected to a Supermicro H12SSL-CT Motherboard (with onboard Broadcom 3008 SAS3 controller flashed to IT mode).

The whole shebang is powered by a 1000W SuperMicro PWS-1K25P-PQ PSU.

Installed in the backplane are 18 drives:

2x Intel S3510 120GB (SSDSC2BB120G6) SSD
4x SanDisk CloudSpeed Eco Gen II 2TB (SDLF1CRR-019T-1HA1) SSD
12x Seagate Exos X18 18TB (ST18000NM000J) HDD

Now for the issue…

When I hot-plug an additional spinning rust HDD into an open slot on the backplane the 4x SanDisk SSDs disconnect and then reconnect causing data corruption. I’ve tried multiple different HDDs (as the new hot-plug device), different combinations of drive placement on the backplane, and two different HBAs and the behavior is the same. The other 14 drives on the backplane don’t ever seem to have any issues at all during the hot-plug procedure.

If I hot-plug an additional SSD into an open slot on the backplane (as apposed to a spinning disk) everything seems to be fine and the SanDisk drives don’t disconnect/reconnect.

Finally, if I move two of the SanDisk SSDs directly to the motherboard (leaving two on the backplane) and perform the hot-plug procedure with a spinning HDD only the two SanDisk drives connected to the backplane disconnect/reconnect while the two connected directly to the motherboard stay connected.

When the SanDisk SSDs disconnect I see the following entries in the syslog for each SanDisk SSD:

Log Snippet

    sd 8:0:12:0: device_block, handle(0x0016)
    sd 8:0:12:0: device_unblock and setting to running, handle(0x0016)
    sd 8:0:12:0: [sdm] Synchronizing SCSI cache
    sd 8:0:12:0: [sdm] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
    mpt3sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x500605b00001788e)
    mpt3sas_cm0: removing handle(0x0016), sas_addr(0x500605b00001788e)
    mpt3sas_cm0: enclosure logical id(0x500605b0000178bf), slot(14)
    mpt3sas_cm0: enclosure level(0x0000), connector name(     )
    mpt3sas_cm0: handle(0x16) sas_address(0x500605b00001788e) port_type(0x1)
    scsi 8:0:18:0: Direct-Access     ATA      SDLF1CRR-019T-1H RPA1 PQ: 0 ANSI: 6
    scsi 8:0:18:0: SATA: handle(0x0016), sas_addr(0x500605b00001788e), phy(14), device_name(0x0000000000000000)
    scsi 8:0:18:0: enclosure logical id (0x500605b0000178bf), slot(14) 
    scsi 8:0:18:0: enclosure level(0x0000), connector name(     )
    scsi 8:0:18:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
    scsi 8:0:18:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
    sd 8:0:18:0: Attached scsi generic sg13 type 0
    sd 8:0:18:0: Power-on or device reset occurred
    end_device-8:0:18: add: handle(0x0016), sas_addr(0x500605b00001788e)
    sd 8:0:18:0: [sdn] 3750748848 512-byte logical blocks: (1.92 TB/1.75 TiB)
    sd 8:0:18:0: [sdn] 4096-byte physical blocks
    sd 8:0:18:0: [sdn] Write Protect is off
    sd 8:0:18:0: [sdn] Mode Sense: 9b 00 10 08
    sd 8:0:18:0: [sdn] Write cache: enabled, read cache: enabled, supports DPO and FUA
    sdn: sdn1 sdn2
    sd 8:0:18:0: [sdn] Attached SCSI disk
    sdn: Process '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/sdn' failed with exit code 1.
    sdn1: Process '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/sdn1' failed with exit code 1.

Is it possible there is an incompatibility between the backplane and the SanDisk drives? Perhaps they’re more sensitive to some power fluctuation or a command delay that takes place during the spinning HDD hot-plug? Any ideas on how I could narrow this down further or log more useful data?

aBav.Normie-Pleb · January 24, 2023, 11:52pm

My tummy says power delivery issue, either the PSU in general or components in the backplane itself:

3.5" HDDs need 12 V and can cause a sudden current spike >=2 A when their motor is starting up.

2.5" SATA SSDs only need 5 V so they use a different rail and much less current.

My first tests would be reseating all power plugs and then powering the backplane with a different PSU.

The backplane likely has 12 V-only PCIe or CPU EPS power connectors and not Molex or SATA power inputs, is this correct?

twin_savage · January 25, 2023, 12:00am

My money is also on the power delivery being inadequate, most likely the backplane itself rather than the computer power supply.

Here’s an interesting piece of information, both the intel ssds and seagate hdds are 12v + 5v devices while the sandisks are 5v only devices.

iolaus · January 25, 2023, 2:46am

To answer your question, the backplane is powered by six Molex connectors. The PSU coincidentally provides six Molex connectors (two each on three different leads) and I’m using them all. The only other thing being powered off of those leads is one Optane drive (mounted internally and powered via a SATA connector).

According to IPMI the PSU is only pulling ~200W average and has peaked at about 220W so I would be surprised if it was failing to keep up but perhaps it would be worth experimenting with a separate PSU powering just the backplane, that would at least let me rule out the SuperMicro PSU as an issue. Suppose I’ll have to jump pins to get a 2nd PSU running while not connected to a mobo but that shouldn’t be too difficult.

iolaus · January 25, 2023, 2:49am

This is extremely interesting and might explain the different behavior between the SanDisk, Intel, and Seagate drives. Perhaps there is just enough dip on the backplane’s 5V rail to freak out the SanDisk drives?

Can you please explain where you got the information about the power specs of the different drives? Maybe if the problem does end up being the backplane’s 5V rail I could swap out the SanDisk SSDs for an alternative that uses 12V + 5V and would be more resilient.

iolaus · January 25, 2023, 2:50am

Also, thank you both for your thoughtful replies! These are far and away the most helpful / informed responses I’ve received anywhere thus far.

iolaus · January 25, 2023, 3:30am

As I’m thinking about it more, I do have some SATA to 8-pin PCIE power adapters laying around. I could potentially hook one of those up in an empty bay, throw a multi-meter on the PCIE pins, and look for a voltage drop while hot-plugging a HDD into another empty bay. I wonder if I’d be able to see the drop (assuming there is one) or if it would be too quick. Any thoughts?

Zedicus · January 25, 2023, 3:48am

You MIGHT be able to catch it with a good meter. That is borderline oscilloscope territory though. Do you have any other random ssd you could toss in to see if they drop also. Just a random 2.5 sata ssd would suffice for testing.

iolaus · January 25, 2023, 3:49am

Good thought, I don’t have one in front of me but I bet I can scrounge one up tomorrow to test with.

twin_savage · January 25, 2023, 6:23pm

Yeah the 5v dip would be my guess aswell. Perhaps the backplane just routes in the 12v from the power supply but has it’s own onboard voltage regulators for the 5v? If you post a picture of the backplane we might be able to tell.

just the standard datasheet sleuthing:

iolaus · January 26, 2023, 3:45am

So, I was able to get my hands on a handful of different SSDs for testing.

1x Samsung 860 2TB
1x Samsung 860 500GB
1x Samsung 840 250GB
1x Patriot Torqx2 32GB

I left all the original drives in the system and I hot-plugged each new SSD a half-dozen or so times and also hot-plugged a spinning rust disk a dozen or so times with different combinations (sometimes all) of the new SSDs connected.

And the surprise result…

Only the SanDisk SDDs ever disconnected during a hot-plug event!

All four SanDisk SSDs disconnect/reconnect every time I hot-plug a spinning disk but all the new SSDs (as well as the original Intel SSDs and Exos HDDs) stay connected just fine.

Additionally, I did see a couple instances where one of the SanDisk SSDs would disconnect/reconnect while hot-plugging the Patriot SSD. This leads me to believe perhaps it has something to do with the timing of reading the new disk as I’m guessing the ancient Patriot 32GB SSD might take longer to initialize than the EVOs. That is just a guess, however, and I’m definitely still open to other interpretations.

For now, my fix will be to replace the SanDisk SSDs with something else. If I’m feeling ambitious once I’ve successfully replaced the SanDisk drives I may run an experiment whereby I flash each one with a different firmware version and see if the hot-plug behavior is the same (I currently have them all flashed with the latest firmware).

I’d like to extend a very sincere “thank you” to everyone who responded, I greatly appreciate all the information and ideas you provided! If anyone has ideas on how to capture more information for posterity-sake I’d definitely still be interested.

MarcT · February 22, 2023, 8:57pm

Open a support ticket with SanDisk? I’d almost put money on it being a drive firmware issue.

system · November 23, 2023, 2:58pm

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.