I’ve got a very strange issue and would like to tap the hive mind for ideas.
I’ve got a 24-Bay 4U chassis I picked up from Alibaba and the version I purchased has a 12Gb/s SAS/SATA backplane with an LSI 3X36 expander.
I’ve got the backplane connected to a Supermicro H12SSL-CT Motherboard (with onboard Broadcom 3008 SAS3 controller flashed to IT mode).
The whole shebang is powered by a 1000W SuperMicro PWS-1K25P-PQ PSU.
Installed in the backplane are 18 drives:
- 2x Intel S3510 120GB (SSDSC2BB120G6) SSD
- 4x SanDisk CloudSpeed Eco Gen II 2TB (SDLF1CRR-019T-1HA1) SSD
- 12x Seagate Exos X18 18TB (ST18000NM000J) HDD
Now for the issue…
When I hot-plug an additional spinning rust HDD into an open slot on the backplane the 4x SanDisk SSDs disconnect and then reconnect causing data corruption. I’ve tried multiple different HDDs (as the new hot-plug device), different combinations of drive placement on the backplane, and two different HBAs and the behavior is the same. The other 14 drives on the backplane don’t ever seem to have any issues at all during the hot-plug procedure.
If I hot-plug an additional SSD into an open slot on the backplane (as apposed to a spinning disk) everything seems to be fine and the SanDisk drives don’t disconnect/reconnect.
Finally, if I move two of the SanDisk SSDs directly to the motherboard (leaving two on the backplane) and perform the hot-plug procedure with a spinning HDD only the two SanDisk drives connected to the backplane disconnect/reconnect while the two connected directly to the motherboard stay connected.
When the SanDisk SSDs disconnect I see the following entries in the syslog for each SanDisk SSD:
Log Snippet
sd 8:0:12:0: device_block, handle(0x0016)
sd 8:0:12:0: device_unblock and setting to running, handle(0x0016)
sd 8:0:12:0: [sdm] Synchronizing SCSI cache
sd 8:0:12:0: [sdm] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
mpt3sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x500605b00001788e)
mpt3sas_cm0: removing handle(0x0016), sas_addr(0x500605b00001788e)
mpt3sas_cm0: enclosure logical id(0x500605b0000178bf), slot(14)
mpt3sas_cm0: enclosure level(0x0000), connector name( )
mpt3sas_cm0: handle(0x16) sas_address(0x500605b00001788e) port_type(0x1)
scsi 8:0:18:0: Direct-Access ATA SDLF1CRR-019T-1H RPA1 PQ: 0 ANSI: 6
scsi 8:0:18:0: SATA: handle(0x0016), sas_addr(0x500605b00001788e), phy(14), device_name(0x0000000000000000)
scsi 8:0:18:0: enclosure logical id (0x500605b0000178bf), slot(14)
scsi 8:0:18:0: enclosure level(0x0000), connector name( )
scsi 8:0:18:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
scsi 8:0:18:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
sd 8:0:18:0: Attached scsi generic sg13 type 0
sd 8:0:18:0: Power-on or device reset occurred
end_device-8:0:18: add: handle(0x0016), sas_addr(0x500605b00001788e)
sd 8:0:18:0: [sdn] 3750748848 512-byte logical blocks: (1.92 TB/1.75 TiB)
sd 8:0:18:0: [sdn] 4096-byte physical blocks
sd 8:0:18:0: [sdn] Write Protect is off
sd 8:0:18:0: [sdn] Mode Sense: 9b 00 10 08
sd 8:0:18:0: [sdn] Write cache: enabled, read cache: enabled, supports DPO and FUA
sdn: sdn1 sdn2
sd 8:0:18:0: [sdn] Attached SCSI disk
sdn: Process '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/sdn' failed with exit code 1.
sdn1: Process '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/sdn1' failed with exit code 1.
Is it possible there is an incompatibility between the backplane and the SanDisk drives? Perhaps they’re more sensitive to some power fluctuation or a command delay that takes place during the spinning HDD hot-plug? Any ideas on how I could narrow this down further or log more useful data?