Intro
So, i’m a masochistic bastard and decided to set up a fiberchannel target using a bunch of old 4gbps qlogic fiberchannel cards some years ago. It has gotten to the point where all my stuff boots off this one machine. There’s this RAID 0 array of 2x seagate 4TB 2.5" drives that all my stuff boots from. This is cached in NVMe drives too to make it spoody, and for the most part, it is. Before i get people harping on about how RAID 0 is madness, i back stuff up ok? Nothing on it is irreplaceable. Just would take forever to download again.
How stuff’s set up
Everything is forwarded to the clients using fileio, so i’ve got a bunch of raw disk images just sat on this storage that i boot from. The remainder of the space i access via samba and this is where i keep the steam library and a ton of other stuff. Works out kinda nicely, though this has some weird issues after i upgraded the machine.
The problem
Every time i write a massive file (over 20GB) via samba, the target disconnects, then doesn’t reconnect until i power cycle the server everything’s booting from or switch the port and reconfigure the client. Irritating as one may imagine, as it bluescreens the affected machine. Doesn’t do it when i copy a 1TB file from the target to another one though. Just writing to it.
Some diags already done
Initially i thought this was due to write back caching or something, so i set it to write through instead to no success. I also thought this would be a disk utilization problem, though as far as i can see, the disks barely ever hit 100% due to most of it being reads, and the cache in my case is pretty huge (128GB of RAM, 2TB of 500GB NVMe drives, long story on how i got that). LIO doesn’t really spit out any errors into the log from what i can see either.
The weird thing is, it’s just that port that dies on the qlogic card. The moment i move the machine to another port, everything springs back into life after a bit of reconfiguration in the bios on the client machine. What’s more bizzarre is how it knows which machine is doing the writing. I can have anything else running without an issue. I might be copying stuff from a USB drive to the samba share, not even touching the SAN, still only affects the one machine that’s doing the copying.
I know what you’re thinking. “not possible” right? Exactly what i thought, but here we are!
To do
I have some more diags to come. Trying a linux client and seeing if the same thing happens is on my to-do list. That’d rule out a windows driver issue which at this point and due to all the weirdness, is the only thing i think it could be. I’m tempted to see if it’s heat related too, this server is kind of a fire breather (has 4 CPUs, silly TDP too). Doesn’t explain the rather specific disconnection but i’ve ordered myself some K type probes to have a look anyway.
Target specs
-- General stuff --
Chassis - IBM X3850 X5
CPU - 4x Intel(R) Xeon(R) CPU E7- 4830 @ 2.13GHz (32 cores, 64 threads)
RAM - 128GB of 1333MHz ECC RAM. IBM branded.
OS - Fedora 29
Graphics - An actual potato (Matrox something or other. Even with drivers, text mode environments perform poorly let alone video & graphics)
-- Storage --
- Pair of 72GB 15K drives in md RAID 1 (boot storage)
- Pair of ST4000LM024-2AN17V (4TB drives) in md RAID 0 (LVM Storage)
- 4x Samsung 970 EVO 500GB drives in md RAID 0 (LVM Cache)
-- Expansion cards --
- Fibre Channel: QLogic Corp. ISP2432-based 4Gb Fibre Channel to PCI Express HBA (x4) - What i'm using as the target
- InfiniBand: Mellanox Technologies MT25408A0-FCC-QI ConnectX, Dual Port 40Gb/s InfiniBand / 10GigE Adapter - (Intend to set up SRP for this, kinda got it working, but loads more needs doing.)
Client Specs
-- General Stuff --
Mobo - ASRock Fatal1ty X79 Champion (aptly named)
CPU - Intel I7 3820
RAM - 16GB Corsair Vengeance @ 1600MHz
OS - Windoge 10 Pro 64 bit (Version 1903)
Graphics - 2x AMD Radeon R9 290 (not in crossfire due to another issue i've been having)
-- Storage --
The target above
A metric assload of 1TB drives that probably have 30 seconds before failure - don't actually use them other than for shits and giggles though.
Any further suggestions are welcome.