Fiberchannel target dies after large file transfer over samba

FYI

you have write back enabled.

yep, am aware. changed it back because it made no difference

1 Like

Why do you have three initiators per target port?

Pretty sure it’s just config copied over to all target ports so that OP could connect and disconnect clients (which, I assume, are directly attached) without worrying about what connects where.

1 Like

bingo

i could make it less cancer of you’d like, but the reason for it is i don’t use my brocade switch because i think it’s sacrilege to license individual ports. That coupled with not needing to care as to where i plug stuff in. That and if my win10 pc just dies, i can swap it out and reboot the machine without having to reconfigure it in targetcli or wait (no joke) 4 hours for this target machine to shut down and restart.

No just asking questions, getting to know the environment.

Is the Win10 machine that is the having issue(s) the machine using lun0?

pretty much. Don’t have to reboot anything under linux, but everything hangs until the transfer is complete and flushed to disk.

Win10 after the bluescreen happens, doesn’t see any of the luns once it comes back up. coincidentally, the target will only take 4 hours to reboot if the win10 machine dies during a transfer

Both the GuestPC lun and the DesktopPC lun are windows 10

Have you tried not using lun0 for win10 ?

Is the initiator using any MPIO software?

Never figured out MPIO, i’ll probs look intoit at some point but all done via a single route. No failover.

1 Like

Also, i fail to see how using a different lun would help, but i’ll juggle them around and see what happens. If you’ve experienced weirdness about that before then i’m all ears too.

Regarding mpio: AFAIK, win10 simply doesn’t have it. I recall reading something about experiments including taking ws16 libraries and frakensteining them into win10 - and, apparently, it worked - but never tried (or had to) doing it myself.
Regarding lun 0: ehhh, maybe? Not familiar with targetcli, but on many storages lun 0 is either reserved or doubles as sort of “scsi control path”/controller lun (on some lsi-based storages it’s lun 30 for some reason, iirc). Kinda interested in explanation for this suggestion as well.

To be honest i’d expect the redundant paths to be resolved on the firmware level rather than operating system. Unless all MPIO setups aren’t alike. Though I suppose Ethernet doesn’t do that either.

I pretty much don’t have time this week to do anything with it as work has kinda gotten in the way of things. Changing out the cables on Sunday though yielded no noticeable difference.

Nope. Wouldn’t work with multiple cards, for starters. And even if it’s a multiport card, firmware treats every port as a separate HBA. It’s probably intentional, since usually every port connects to a (physically!) separate SAN, and even if for some reason they’re in the same SAN, zoning should ensure that there’s only one initiator per zone so that they wouldn’t interfere with each other during fabric login or, say, didn’t trigger SCN/RSCN on each other.

1 Like

I kinda/sorta solved it.
i just ran 2 cables between each machine. No MPIO stuff, but windows seems to know they’re the same disk and doesn’t mess with the redundant paths.
boot settings are configured on each initiator port and when things kinda just fall over, one of the target ports stops working with the particular initiator, but the other unused one stays up.

This takes 4 hours to reset by itself though, so i can’t bluescreen my PC within that time otherwise it’ll happen again.

Did some testing with disk speed. reconfigured the NVME stuff so it’s no longer cache, but just a separate drive/RAID 0 to use for booting other machines from. Kinda overkill, but this seems to have stopped the problem in its tracks.

Seems to me the cache was getting full server side due to poor disk performance, not responding because it’s twiddling its thumbs while flushing it to disk, and windows bluescreened as a result. The initiator doesn’t log out properly, and ends up keeping its session open, presumably because windows doesn’t do this under BSOD conditions, and the target doesn’t let it create another until the old one times out (in 4 hours).

iirc, without MPIO Windows should show both paths as separate entries in Disk Manager, one in normal state and another one as disabled/offline. It knows these disks are the same one because of the same UUID. Never actually tried to use this kind of configuration in production. Does it actually reenable the second path the moment first one goes down all by itself? It’s less stupid than I thought if so, providing active-standby redundancy even if there is no MPIO installed.

I thought you said you had 500GB for cache… How full is it and is it in write back or write thru mode?

i had 2TB of NVMe as LVM cache. However, this gets full, and stays full after a couple of days of being up and running. Then things get promoted/demoted to /from it accordingly. So not all things that are being written are being cached. It’s just frequently read stuff once it gets full.

Linux by default caches writes in RAM, but this being said, i’m using default settings which may limit the actual amount of caching done. Server’s off for the night anyway, will see how to get those stats in the morning. Wouldn’t surprise if it’s around the 30GB mark.

I don’t think it’ll fail over in a live environment, but i’ll pull a cable and see what happens.