Reporting back! But first, shoutout to lemma for chiming in on my predicament. I was getting lonely in here and appreciate your perspective - it helped!
So this is going to be the shorter and less technical post, and I’ll report back with more detailed writeup later.
The current update is that the issue has been resolved (pausing to hear the cheering from the crowd)! Lemma’s post made me come out of my stubborn hole of driving my main displays run on my 4090s and that led me on the path towards resolving it for good.
Essentially, with lemma’s corroboration of the bus signal maybe being the culprit, I moved all my monitors to the GPUs in slots 1 and 2. With all of the other changes I’ve made to my system while trying to figure out this issue, I was able to transfer files at 5.5GB/s and had no issues doing so while other programs were open (the “other changes” mainly being: turning on PBO, changing an L3 cache setting, and disabling C-States, of which I’ll share more details later). I did notice that file transfer speed would not stay at 5.5GB/s if other compute or drive intensive tasks were running, but the speeds were still in the 2-3 GB/s range, so that is forgivable.
While eveything was working great, this is not an ideal solution for me because I’d mainly like AI training to be the only thing running on the GPUs in slots 1-4 (RTX 6000 ada GPUs), while I maybe do some video editing, 3D modeling, gaming, etc. on the main 4K monitor or my main ultrawide. So after reading lemma’s post I decided that I should go back in and mess with the redriver settings.
I started by reading this post to get a feel for what I might need to do in the BIOS with the redriver settings: Help with WRX80E-Sage SE Render server - Hardware / Build a PC - Level1Techs Forums. This post was super helpful to me and gave me enough information to jump in and start messing with things. After applying settings about 10 times, having a couple blue screens on startup, and having some temporary drive failures on my M.2s plugged into PCIe slot 7, I finally got things running stable, and I can still hit 5+GB/s.
So what I think was happening is just as lemma stated - the bus errors were causing the CPU to freak out and all threads would turn on. The more the CPU freaked out, the more the transfer speed would suffer. The CPU freakout while transfer files made my CPU usage jump to 30% for the duration of the file transfer. Kinda crazy to see 30% CPU usage on a 64 core threadripper for just 70-300MB/s writes.
Anyway, long story long, everything is working now. I think the redriver settings are keeping the bus communication stable, and the other changes I made in the BIOS helped increase the speed of the transfers. After I confirm this by testing, I’ll post the technical results here in case somebody wants to copy my settings.