WRX90e SSD speed - writes super slow

Dear Diary. Me again. Still having issues. I keep thinking I’m past the issue, but it keeps popping up again.

I can’t pinpoint what the issue is, but it is still happening where the drive speeds drop down to 70MB/s. It seems almost like an on/off switch.

Right now, a good example is that I turn on the machine, it allows file transfers around 2500MB/s, and then I launch WSL2, and the transfer speeds go back down to 70MBs. There are a whole bunch of apps that have an immediate effect on the speed. WSL2 is just one of many.

I started running Windows Performance Recorder and Windows Performance Analyzer trying to find some information. And what I found was very interesting - As I was looking at the active IO time for my file transfers and then dividing them by the size of what was transferred - it showed that my 70MB/s transfers were actually transferring at 3800MB/s but only in a small, short bursts at a time. I then looked at my regular file transfer speeds of about 2500MB/s (before I opened any apps) and noticed that there were fewer weird spikes in my file transfers and more constant communication. Lastly, I tried out crystaldiskmark and that shows extremely high and extremely constant speeds - even when my computer is in a state where my explorer file transfers are at 70MB/s.


You can see in this image first a crystaldiskmark read at 12000MB/s, followed by a write of about the same rate, and then all the short spikes you see after that are a slow 70MB/s file transfer, which is accomplished through a series of very fast but short duration spurts of IO on the device.

Without going into a ton of detail at the moment (it’s late), I’ve found that the problem seems to be related to something happening on the PCIe bus.

I noticed that for all the times that my drives are transferring files slow with that spiky performance graph, that the IPI events are going nuts. IPI are Inter-processor Interrupt Events I think. This first graph is an “unhealthy” IPI as far as I can tell - this is what the IPI Events look like during slow file transfers. Specifically, the part where the solid line is, is where the file transfer is happening.

image

And this is what they look like when they’re “normal” and the file transfers are fast:

My guess is that the signals are just getting “jammed” a bit on the bus.

My test today was to remove all the monitors except 1 and see if the IPI events got cleaner - and they did! When I plug in more monitors to the same GPU, the IPI Events get busier, and when I plug in a monitor on a second GPU, things get really bad - IPI looks unhealthy and file transfers plummet.

In this test, the second GPU is in slot 6 on the motherboard, and has redriver capabilities. So I bumped up the redriver gain by 1 and suddenly the slow down disappeared at low resolutions on GPU2’s monitor. Though at full 4k the slow down still happened (IPI looks unhealthy, file transfers slow). So I bumped the redriver up 1 more notch, and voila - IPI Events look healthy again and the transfer speeds are stable around 2.5GB/s no matter what I do on my computer. Seems like a bus signal issue?

I added a fourth monitor to a 3rd GPU (in slot 1 on the motherboard), and now the file transfer speed is unstable again, and this time it’s more unpredictable. IPI Events look like a solid line again.

Anyway, that’s as far as I’ve gotten tonight. Will do more testing tomorrow. If anybody has any more clues or can figure out what is causing this I’m all ears. Would love to know who needs to fix this sort of thing - is this a motherboard hardware issue? Is it a driver issue? Windows software issue? etc…

3 Likes

More details:

I’ve been messing with BIOS settings for the past 24 hours. I’ve disabled Global C-States, turned on PBO, changed the LCLK or “I/O Clock” from 1142 to 1200, changed APIC, ACPI, NBIO settings, I’ve messed with Redriver settings, etc. (Messing with the redriver settings definitely made a difference, oddly, but it also resulted in some performance issues and I had a couple WHEA error blue screens because of it, so I ramped down my changes)

I’ve managed to get into a better state where my lower speed is now around 200-300MB/s instead of 70MB/s. My fast transfer speed is around 3-4 GB/s (Going over 4GB/s is probably the fastest I’ve ever seen a windows file transfer that wasn’t using robocopy). BUT… I’m still getting slow transfer speed issues.

I was looking at CPU settings in Ryzen Master before setting all of the PBO and overclock settings in BIOS, but I’m using Ryzen Master still to check out the clock speeds, core activity, etc. and what I found is that when file transfers go quickly, most of the cores are still asleep. When file transfers go slowly, for some reason ALL of the cores are active and running at a low clock speed. When all of these cores come on, I believe that must be the IPI Events going crazy that I’m seeing in my previous post, and this seems to present some sort of bottleneck to the file transfer.

Here’s a shot of the slow file transfer:

Here’s a shot of the fast file transfer:

There was no difference in what was happening on my computer at the time. I just copy pasted a file one time and it was slow, and then I tried a few more times and then it was fast (You can see that the screenshots are only 61 seconds apart).

Not solved yet, but I feel like I’m getting closer. 24 hours ago my 4th monitor in the 3rd GPU meant that my file transfers were NEVER faster than 70MB/s, even with no programs running, and now I can transfer at up to just over 4GB/s with a bunch of programs running… it’s just unfortunately not staying at that speed all the time. More sleuthing to do - will report back.

At least partially. ASUS’ documentation’s unhelpfully sparse but the appendix of the user’s manual does confirm redrivers are on PCIEX16(G5)_5, 6, and 7 like you’d expect based on distance from the processor socket. Perhaps you could get some insight working back from that based on which of those three slots have GPUs inserted.

My working hypothesis would be errors from PCIe eye closure trigger the IPIs, after which performance collapses due to some form of error recovery and maybe link retraining. Not finding anything useful in the AMD IPI documentation I’ve looked through, though—mostly it says IPIs go to all cores for response, consistent with Ryzen Master.

There are also redrivers on four USB links (three Ethernet, one front panel) so, depending what exactly the redriver settings do, it’s potentially not (or not just) a PCIe issue.

2 Likes

Reporting back! But first, shoutout to lemma for chiming in on my predicament. I was getting lonely in here and appreciate your perspective - it helped!

So this is going to be the shorter and less technical post, and I’ll report back with more detailed writeup later.

The current update is that the issue has been resolved (pausing to hear the cheering from the crowd)! Lemma’s post made me come out of my stubborn hole of driving my main displays run on my 4090s and that led me on the path towards resolving it for good.

Essentially, with lemma’s corroboration of the bus signal maybe being the culprit, I moved all my monitors to the GPUs in slots 1 and 2. With all of the other changes I’ve made to my system while trying to figure out this issue, I was able to transfer files at 5.5GB/s and had no issues doing so while other programs were open (the “other changes” mainly being: turning on PBO, changing an L3 cache setting, and disabling C-States, of which I’ll share more details later). I did notice that file transfer speed would not stay at 5.5GB/s if other compute or drive intensive tasks were running, but the speeds were still in the 2-3 GB/s range, so that is forgivable.

While eveything was working great, this is not an ideal solution for me because I’d mainly like AI training to be the only thing running on the GPUs in slots 1-4 (RTX 6000 ada GPUs), while I maybe do some video editing, 3D modeling, gaming, etc. on the main 4K monitor or my main ultrawide. So after reading lemma’s post I decided that I should go back in and mess with the redriver settings.

I started by reading this post to get a feel for what I might need to do in the BIOS with the redriver settings: Help with WRX80E-Sage SE Render server - Hardware / Build a PC - Level1Techs Forums. This post was super helpful to me and gave me enough information to jump in and start messing with things. After applying settings about 10 times, having a couple blue screens on startup, and having some temporary drive failures on my M.2s plugged into PCIe slot 7, I finally got things running stable, and I can still hit 5+GB/s.

So what I think was happening is just as lemma stated - the bus errors were causing the CPU to freak out and all threads would turn on. The more the CPU freaked out, the more the transfer speed would suffer. The CPU freakout while transfer files made my CPU usage jump to 30% for the duration of the file transfer. Kinda crazy to see 30% CPU usage on a 64 core threadripper for just 70-300MB/s writes.

Anyway, long story long, everything is working now. I think the redriver settings are keeping the bus communication stable, and the other changes I made in the BIOS helped increase the speed of the transfers. After I confirm this by testing, I’ll post the technical results here in case somebody wants to copy my settings.

7 Likes