WRX90e SSD speed - writes super slow

maybe a wild theory but could this be electrical noise interference? it seems like the way you are getting back to higher speed is by making the gpu’s use more power instead of offloading unused voltage to ground.
even if the gpu’s are idling this generation of gpu’s use more power than previous gens and if motherboard manufacturers haven’t adequately beefed-up communication lanes against electrical noise interference from these high power gpu’s that could cause some weird problems.

Igor’s lab recently posted about the 12vhpwr/12v2x6 not offloading voltage effectively through the plugs ground pins and thereby causing more voltage to be offloaded to the motherboard grounds.

I’m not an electrical expert so someone else to chime in on this with more experience but I wanted to through my theory out there just in case.

Ground on the wrong track: Why only certain power connector pins melt and we should rethink PCs | Part 1 of 2 | igor´sLAB (igorslab.de)

1 Like

When you experience the slow SSD write speeds, is the CPU clocked slow?
Not exactly and apples to apples comparison, but Sapphire Rapids had super bad clocking behavior in Windows no matter what power plan was set; Windows kept sending the processor into a high C-state when it shouldn’t have and slowed everything down.

Thanks for the ideas.

I may try removing the third power supply and just running it off two to see if it could be a power interference issue. Otherwise, my power is very very clean. It’s a dedicated 240V outlet going to a dedicated UPS, that’s creating pure sine wave power for the PC. I haven’t checked it with an oscilloscope or anything, but it should be pretty clean.

Regarding the clock speed - when I copy files the clock speed does increase to 4.77, so I don’t think it should be CPU clock.

I saw your power setup and I’m envious, I’m building my system from scratch so a good ups is on the list but not until after the system is done. I do have brand new line from the power line to a new box with all new wiring so should be fine until that point.

what I mean by the electrical interference isn’t about the power being supplied though. what I mean is the power being constantly supplied to the motherboard isn’t always used completely and therefore is fed back to the grounds on the gpus and motherboard power plugs and if the number of grounds isn’t able to offload all the unused power from the motherboard it can cause feedback.

think of the motherboard as a lake and the gpu’s as rivers feeding water into the lake, think of the gpu and motherboard power plugs as a dam and the ground wires as gates in the dam to let water out, but what happens if more water than the gates can let out get fed into the lake? the dam overflows. in the case of a motherboard, power overflow from the grounds being unable to feed enough power back out of the motherboard would potentially lead to electrical interference on pcie lanes from the power not being bled off fast enough.

as well as the industry has had difficulties with getting signal stability with pcie 5.0.

" PCIe 5.0 design requires following new specifications for the space between signal traces to prevent crosstalk. The likelihood of crosstalk occurring is more significant with PCIe 5.0 designs due to the higher signaling rates . An additional challenge arises since PCB designs are increasingly smaller."

my thinking is with the issues these ASUS wrx90 boards have been having, them not being designed with enough shielding between pcie lanes to prevent cross talk is a possibility. add that as a design flaw and 6 high powered gpus and you get very weird system behaviors.

such as when your switching window size across 2 monitors or running cinebench it makes your gpu’s use enough power that the left-over power that was possibly causing interference is used up by the gpus and stops the signal interference temporarily.

Navigating PCIe 5.0 design challenges - Embedded.com

hopefully that makes sense.

2 Likes

Ah, understood what you mean now. Interesting, and thanks for the read!

I’m not exactly sure how to test that theory, but I am assuming/hoping that this is something more software/driver related.

Some interesting things I’ve noted as I’ve been playing around with things:

  • When I restart the machine, transfer speeds are always acceptable when windows first starts. Seemingly almost randomly, it will start slowing down soon after boot up.
  • I pulled the power from a couple GPUs so that only 4 started up - still got the same problem.
  • I reinstalled older drivers that I ran on my older threadripper system that also had a 4090 and an rtx 6000 ada in it (same GPUs just less of them) - still had the problem.
  • Because everything happens so randomly sometimes, I was starting to suspect that something in software internally is limiting the speed of the drives on accident. The drive speeds are just caught in the crossfire of it trying to save power somewhere.

So I tried disabling anything with the efficiency mode symbol in task manager, and removing system processes from the services list. One interesting note along the way was that when I disabled the efficiency mode in Edge and killing any other processes running in efficiency mode, my baseline shot up from 70MB to 700MB/s.

So that’s where I’m at now. This seems to bolster that something weird is going on with software or drivers trying to manage the power. It could still be the voltage/interference problem you suggest I suppose, but I hope not! I’m hoping this is a simple chipset update problem.

Also @Quilt… I don’t see any weird errors of any kind in the errors list or any strange things from HWInfo64, resource monitor, or performance monitor. No indication of anything weird at all. Memtests and CPU tests all run fine too.

Will report back if I figure anything else out.

1 Like

I’ll definitely be watching for any news on this issue since it seems to be pretty unique.
wanted to drop another suggestion, have you done any changes to windows power plans? and if so is " ultimate performance" an available option for you?

if not this explains it,

Ultimate Performance Power Plan in Windows 11 (elluminetpress.com)

also, there is a windows version called, “windows 11 for workstations” could be something to look into as well.

Testing on Linux might be worth a try.

Is your drive using the correct power as workstationbuild1087 said. Also check your device manager that the correct settings are there also.

Checking back in here:

  • I adjusted the power settings to the “ultimate” power plan, which I had to enable using the suggestions in that link. There were no visible changes to the advanced section of the power plan where you can manually change things, but I still think this somehow made some changes.
  • I also turned off a few services that I didn’t need, like search indexing.
  • I installed older video drivers that I was using successfully on my other threadripper machine with a similar GPU mix. I went back to version 546.01 and installed both the RTX drivers, and the 4090 drivers with the same version (RTX first, then 4090, video driver install only)
  • I disabled Widgets from the Local Group Policy Editor.
  • I disabled all startup apps - I was wondering if maybe they were causing some issues because when I first start the machine, it seems like the file transfer speeds are solid around 2GB for any size file transfer, but then the second or third time I try to copy files the speed would dive. (I tried HUGE transfers of 30+GB to make sure it had nothing to do with the cache).
  • I also disabled efficiency mode in Edge, and turned off all the caching and speed improvements, because these seemed to be started in efficiency mode in the background.

I may have done a few other small things, but it was difficult to tell what changes were making the biggest impact, except for the very first change to disable efficiency mode in Edge and stop all the edge processes. That got me back up to 700MB/s pretty instantly.

The rest of the adjustments I made seems to get me back to something more reasonable. I can transfer files at up to 2.4GB/s sustained now over a 48GB file transfer.

These adjustments are not something that should need to be made, but at least I can download files at reasonable speeds, transfer files between computers at reasonable speeds, and unzip file at reasonable speeds - whatever was happening before, it was bringing my entire system to a standstill.

I would call this “resolved” for now in my situation, but I think there’s something else at play here that somebody needs to fix. Chipset drivers, bios update, video drivers, or windows update.

Thanks all for the suggestions that I was able to investigate!

hi Devin, I’m not sure if you’ve seen my post but thought i would say something here just in case, this came up on reddit and while it could be an error on the motherboard reading I’d still check up on this since there were over voltage issues on Ryzen 7000.

one reason I find this interesting for threadripper users is in this gamers nexus video, https://www.youtube.com/watch?v=cbGfc-JBxlY in the first 30 seconds it mentions 0d qcode and early on with these threadrippers I saw a lot of 0d codes being talked about.

from reddit, Posted by user PXaZ

"My WRX90 board is now giving this error every time I boot up, with a 7965WX. I’ve reloaded the default settings and reset the CMOS. The full message:

CPU Over Voltage Error!
New CPU installed! Please enter Setup to configure your system.
S3 mode has been disabled. You may configure this sleeping state in BIOS menu.
Press F1 to Run SETUP
When I enter setup and look at the voltages, they’re very concerning to my eyes at least:

12V Voltage [+24.480 V]
5V Voltage [+10.200 V ]
3.3V Voltage [+4.080 V]
CPU Core0 Voltage [+2.264 V]
CPU Core1 Voltage [+2.264 V]
CPU VSOC Voltage [+2.264 V]
CPU VDDIO [+2.264 V]
VDD_11_S3 / MC Voltage [+4.080 V]
All the voltage values are displayed in red, which I interpret to mean “Something’s not right here!”

PSU is Seasonic Prime TX-1600. The board has two 12V cpu power connectors, and I have both of them connected to the “CPU / PCIE” section of the PSU.

Some of these voltages are double what they sound like they seem like they should be. That’s why I tried disconnecting either of the 12v cpu power connections. But if either is disconnected, the CPU won’t initialize even though other devices power on.

I hadn’t changed anything in the overclocking settings - it was all set on “auto”.

I first noticed this after adding an NVME drive to the #3 M.2 slot.

If I ignore it all the system runs, but I’m uncomfortable going ahead with such warnings, and also it’s annoying to have to go into setup every time I boot.

Does anybody have any ideas?"

link to original post,

CPU Over Voltage Error! New CPU installed! : threadripper (reddit.com)

Welp, I’m back.

Been using the PC for a few weeks without much issue. I’ve had a single lock-up when coming back to my computer after it had been idle for 12 hours, but other than that no issues.

UNTIL NOW.

So I’m doing some multi-gpu AI training on my machine, and to connect the GPUs together it uses a rendezvous backend so that the GPUs can sync information with each other. This puts a little bit of bus overhead into the system with the GPUs sending some syncing information back and forth, on the order of a few hundred MB/s up to maybe a GB/s of data transfer.

The problem is that while this is happening, my computer is noticeably chugging - the CPU looks nearly completely free, but the slow SSD transfer speed problem comes back, which makes my whole computer god awful slow. Downloads, programs, etc… everything is slow. I assume that the bus is somehow just overloaded.

I’ve tried training with smaller datasets so that the amount of information being transferred between GPUs is super small, too, only a few MB/s, which seems like it should be no problem. But the issue persists. No matter the scale of the information syncing between GPUs, it slows down SSD speed to 70MB/s(ish). This leads me to think that this is a surmountable problem, and that the issue is somewhere in the BIOS/software/firmware/drivers. Something fixable, because scaling down the overhead doesn’t seem to make a difference.

Anyway - I’m really hoping for some driver updates, windows updates, bios updates, chipset updates. It doesn’t seem like this should be happening.

I have more experimenting to do, but wanted to mention that there is definitely something weird going on.

This caught my attention. Is this the first time you’ve had it lock up while idle? I’ve had that happen a few times already, but with Linux. This is the first time I noticed a Windows user report a similar issue. Was it like: displays refuse to wake from power save mode, can’t ping the machine, etc, but the IPMI thinks everything’s just fine?

(I can’t really comment on your I/O issues because I built my system for the cores and memory channels rather than the connectivity, so I have very few devices on the PCIe bus.)

My Qcode on the motherboard lists AA as if the machine is running fine. All of my screens are black when idle (I have a blank screensaver come on after 20 minutes, but don’t shut off the displays), but I assume that whatever was on screen would have been frozen. The keyboard numlock and capslock buttons are not responsive (the lights don’t change on the keyboard) so I assume it’s just a system hang. or at least some sort of interrupt or bus issue.

In the past, when I was setting up the machine, I had lots of these system hangs related to plugging in a monitor (it was having trouble initializing multiple monitors and would sometimes hang while I was getting a new monitor plugged in). It makes me scared to unplug any of my monitors or plug a new monitor in.

Update on my bus issue:

I couldn’t figure out windows. I went to WSL2. I can train on all 6 GPUs there and not have any slowdown in SSD transfer speeds or the rest of my computer. I guess I’ll just be running most of my training in linux now. In linux the method for rendezvous between GPUs is different (NCCL instead of GLOO). I don’t know if that’s the difference, or if there’s just something wrong with Windows.

2 Likes

Dear Diary. Me again. Still having issues. I keep thinking I’m past the issue, but it keeps popping up again.

I can’t pinpoint what the issue is, but it is still happening where the drive speeds drop down to 70MB/s. It seems almost like an on/off switch.

Right now, a good example is that I turn on the machine, it allows file transfers around 2500MB/s, and then I launch WSL2, and the transfer speeds go back down to 70MBs. There are a whole bunch of apps that have an immediate effect on the speed. WSL2 is just one of many.

I started running Windows Performance Recorder and Windows Performance Analyzer trying to find some information. And what I found was very interesting - As I was looking at the active IO time for my file transfers and then dividing them by the size of what was transferred - it showed that my 70MB/s transfers were actually transferring at 3800MB/s but only in a small, short bursts at a time. I then looked at my regular file transfer speeds of about 2500MB/s (before I opened any apps) and noticed that there were fewer weird spikes in my file transfers and more constant communication. Lastly, I tried out crystaldiskmark and that shows extremely high and extremely constant speeds - even when my computer is in a state where my explorer file transfers are at 70MB/s.


You can see in this image first a crystaldiskmark read at 12000MB/s, followed by a write of about the same rate, and then all the short spikes you see after that are a slow 70MB/s file transfer, which is accomplished through a series of very fast but short duration spurts of IO on the device.

Without going into a ton of detail at the moment (it’s late), I’ve found that the problem seems to be related to something happening on the PCIe bus.

I noticed that for all the times that my drives are transferring files slow with that spiky performance graph, that the IPI events are going nuts. IPI are Inter-processor Interrupt Events I think. This first graph is an “unhealthy” IPI as far as I can tell - this is what the IPI Events look like during slow file transfers. Specifically, the part where the solid line is, is where the file transfer is happening.

image

And this is what they look like when they’re “normal” and the file transfers are fast:

My guess is that the signals are just getting “jammed” a bit on the bus.

My test today was to remove all the monitors except 1 and see if the IPI events got cleaner - and they did! When I plug in more monitors to the same GPU, the IPI Events get busier, and when I plug in a monitor on a second GPU, things get really bad - IPI looks unhealthy and file transfers plummet.

In this test, the second GPU is in slot 6 on the motherboard, and has redriver capabilities. So I bumped up the redriver gain by 1 and suddenly the slow down disappeared at low resolutions on GPU2’s monitor. Though at full 4k the slow down still happened (IPI looks unhealthy, file transfers slow). So I bumped the redriver up 1 more notch, and voila - IPI Events look healthy again and the transfer speeds are stable around 2.5GB/s no matter what I do on my computer. Seems like a bus signal issue?

I added a fourth monitor to a 3rd GPU (in slot 1 on the motherboard), and now the file transfer speed is unstable again, and this time it’s more unpredictable. IPI Events look like a solid line again.

Anyway, that’s as far as I’ve gotten tonight. Will do more testing tomorrow. If anybody has any more clues or can figure out what is causing this I’m all ears. Would love to know who needs to fix this sort of thing - is this a motherboard hardware issue? Is it a driver issue? Windows software issue? etc…

3 Likes

More details:

I’ve been messing with BIOS settings for the past 24 hours. I’ve disabled Global C-States, turned on PBO, changed the LCLK or “I/O Clock” from 1142 to 1200, changed APIC, ACPI, NBIO settings, I’ve messed with Redriver settings, etc. (Messing with the redriver settings definitely made a difference, oddly, but it also resulted in some performance issues and I had a couple WHEA error blue screens because of it, so I ramped down my changes)

I’ve managed to get into a better state where my lower speed is now around 200-300MB/s instead of 70MB/s. My fast transfer speed is around 3-4 GB/s (Going over 4GB/s is probably the fastest I’ve ever seen a windows file transfer that wasn’t using robocopy). BUT… I’m still getting slow transfer speed issues.

I was looking at CPU settings in Ryzen Master before setting all of the PBO and overclock settings in BIOS, but I’m using Ryzen Master still to check out the clock speeds, core activity, etc. and what I found is that when file transfers go quickly, most of the cores are still asleep. When file transfers go slowly, for some reason ALL of the cores are active and running at a low clock speed. When all of these cores come on, I believe that must be the IPI Events going crazy that I’m seeing in my previous post, and this seems to present some sort of bottleneck to the file transfer.

Here’s a shot of the slow file transfer:

Here’s a shot of the fast file transfer:

There was no difference in what was happening on my computer at the time. I just copy pasted a file one time and it was slow, and then I tried a few more times and then it was fast (You can see that the screenshots are only 61 seconds apart).

Not solved yet, but I feel like I’m getting closer. 24 hours ago my 4th monitor in the 3rd GPU meant that my file transfers were NEVER faster than 70MB/s, even with no programs running, and now I can transfer at up to just over 4GB/s with a bunch of programs running… it’s just unfortunately not staying at that speed all the time. More sleuthing to do - will report back.

At least partially. ASUS’ documentation’s unhelpfully sparse but the appendix of the user’s manual does confirm redrivers are on PCIEX16(G5)_5, 6, and 7 like you’d expect based on distance from the processor socket. Perhaps you could get some insight working back from that based on which of those three slots have GPUs inserted.

My working hypothesis would be errors from PCIe eye closure trigger the IPIs, after which performance collapses due to some form of error recovery and maybe link retraining. Not finding anything useful in the AMD IPI documentation I’ve looked through, though—mostly it says IPIs go to all cores for response, consistent with Ryzen Master.

There are also redrivers on four USB links (three Ethernet, one front panel) so, depending what exactly the redriver settings do, it’s potentially not (or not just) a PCIe issue.

2 Likes

Reporting back! But first, shoutout to lemma for chiming in on my predicament. I was getting lonely in here and appreciate your perspective - it helped!

So this is going to be the shorter and less technical post, and I’ll report back with more detailed writeup later.

The current update is that the issue has been resolved (pausing to hear the cheering from the crowd)! Lemma’s post made me come out of my stubborn hole of driving my main displays run on my 4090s and that led me on the path towards resolving it for good.

Essentially, with lemma’s corroboration of the bus signal maybe being the culprit, I moved all my monitors to the GPUs in slots 1 and 2. With all of the other changes I’ve made to my system while trying to figure out this issue, I was able to transfer files at 5.5GB/s and had no issues doing so while other programs were open (the “other changes” mainly being: turning on PBO, changing an L3 cache setting, and disabling C-States, of which I’ll share more details later). I did notice that file transfer speed would not stay at 5.5GB/s if other compute or drive intensive tasks were running, but the speeds were still in the 2-3 GB/s range, so that is forgivable.

While eveything was working great, this is not an ideal solution for me because I’d mainly like AI training to be the only thing running on the GPUs in slots 1-4 (RTX 6000 ada GPUs), while I maybe do some video editing, 3D modeling, gaming, etc. on the main 4K monitor or my main ultrawide. So after reading lemma’s post I decided that I should go back in and mess with the redriver settings.

I started by reading this post to get a feel for what I might need to do in the BIOS with the redriver settings: Help with WRX80E-Sage SE Render server - Hardware / Build a PC - Level1Techs Forums. This post was super helpful to me and gave me enough information to jump in and start messing with things. After applying settings about 10 times, having a couple blue screens on startup, and having some temporary drive failures on my M.2s plugged into PCIe slot 7, I finally got things running stable, and I can still hit 5+GB/s.

So what I think was happening is just as lemma stated - the bus errors were causing the CPU to freak out and all threads would turn on. The more the CPU freaked out, the more the transfer speed would suffer. The CPU freakout while transfer files made my CPU usage jump to 30% for the duration of the file transfer. Kinda crazy to see 30% CPU usage on a 64 core threadripper for just 70-300MB/s writes.

Anyway, long story long, everything is working now. I think the redriver settings are keeping the bus communication stable, and the other changes I made in the BIOS helped increase the speed of the transfers. After I confirm this by testing, I’ll post the technical results here in case somebody wants to copy my settings.

7 Likes

Alright, I hope this is the end of the saga. Here’s the summary.

After I made the previous post, the problem came back about 3 days later. So frustrating. So I spent the past week troubleshooting again. 2-3 hours every day. I spent sooo much time in Windows Performance Analyzer.

TLDR; - all of this troubleshooting was certainly a learning experience, but at the end of the day, I believe the main problem was caused by the High Definition Audio Device of the monitors, and disabling all 7 of them in the Device Manager was the key to solving the problem.

I believe these devices were causing a lot of havoc on the Interrupt system and sometimes would just error out and fire tons of interrupts constantly, which would cause all of the CPU cores to come on under certain conditions (like a file transfer) as it tried to figure out how to handle the load.

Along the way I did learn a few things that did have an effect on the overall performance of my computer and my real world SSD transfer speeds.

  1. The bios option “ACPI SRAT L3 Cache as NUMA Domain” = Enabled. This creates 8 NUMA nodes for windows and this boosted my ssd transfer speed by up to 2GB/s (!!!) which is a lot. I get file transfers nearly to 6GB/s (I don’t think I’ve actually seen it break 6GB/s, though). There’s a great article on NUMA settings - specifically the NPS settings that allow you to choose between 1, 2, 4 nodes that you can read about here: SNC/NPS Tuning For Ryzen Threadripper 7000 Series To Further Boost Performance Review - Phoronix. They do a great comparison in performance across different use cases. I found the best performance boost from this L3 Cache option, though. In every other scenario the SSD transfer speeds were usually around 3GB/s.

  2. Playing with the redriver settings did seem to help with bus noise, and my computer ran fine for about 3 days before it eventually hit the slow transfer speeds again. I believe the changes may have made the bus traffic slightly cleaner and helped it handle whatever weirdness was occurring. I could totally be making this up, like how putting stickers on a race car can add 5 hp on your butt dyno, but this did seem to also help with some video glitching I noticed on my highest bandwidth 144Hz 4K monitor that is plugged in to slot 5. Adjusting the redriver too high did cause BSOD and WHEA errors, FYI - so if you’re getting WHEA errors from devices on slots 5, 6, 7 - maybe the redriver settings are a thing to look in to.

  3. Turning on PBO in the bios also seemed to help raise the floor and ceiling of my ssd transfer speed issue. I used the maximum settings of my board.

  4. I played with a TON of other bios settings to mostly no effect. It was a great learning experience to investigate what all the settings do, but mostly nothing worth reporting (and hard to remember as I didn’t document every change).

Thanks for following along, and I hope all of this can help somebody else at some point!

5 Likes