WRX90e SSD speed - writes super slow

Hey Folks,

I have another thread that has since been solved: WRX90e won’t boot with 6 GPUs but now I’m having another problem with my SSDs.

I notice that in CrystalDiskMark I get great scores with my T700 SSDs, 11-12GB/s Read and 11GB/s write, but when I transfer files in windows, things write at 70MB/s - this is totally abysmal, as I’m working with giant 6-30GB files. I have been trying to investigate this in my free time over the past few days. Here’s what I’ve done:

  1. Assuming this was a driver issue and that this wasn’t happening originally, I reinstalled windows 11 completely. Sure enough, on fresh install, no drivers, file transfer speeds are about 1.4-1.5GB/s.
  2. While still disconnected from the internet, I installed the LAN drivers, Chipset drivers, and Video card drivers. In between each of those, I made sure the drive speed was still good. It actually improved slightly after installing the chipset drivers, and will peak around 2GB/s isntead of 1.5GB/s. Great.
  3. With all those drivers installed, I started to enable my other monitors in Nvidia Control Panel (there are 8 monitors in total, and up until this point I was only using 1 monitor). Lo and behold - WHEN I CONNECT TO MORE THAN ONE DISPLAY, THE WINDOWS FILE TRANSFER SPEED GOES TO 70MB/s! Disabling the monitors and restarting the machine will put me back in the previous state where the file speed is hitting that 1.5-2GB/s peak speed.

So something is happening when I turn on all 8 monitors, and it’s only happening to windows - CrystalDiskMark is still showing great scores. It’s just windows file transfer that is bogging down. There has got to be something going on at the software or driver level. If anybody has any tips or ideas please let me know.

In the meantime, I’ll be trying out a bunch of things:

  1. Change the GPUs that are being used for monitors.
  2. Decrease the number of monitors being used.
  3. Try different drivers, older drivers, etc.
  4. Remove a couple GPUs (the 4090s) so that there are only one type of GPU, or if it’s just fewer GPUs that solves it.

Will report back with what I find.

Edit:
System Specs:
CPU: 7985wx
RAM: 512 GB, 8 x 64GB Samsung M321r8ga0bb0-cqkmg
GPUs: 4 x 6000 ada, 2 x 4090
PSUs: 3 x 1600W Seasonic Titanium ATX 3.0
SSD: 8 x T700 (4 in slot 7 via a Hyper M.2 card)
8 monitors of various sizes.

Eight monitors…why so many?

I just do a lot of stuff on my computer at the same time - multiple coding environments, command windows, web browser, CAD programs, 3D printing software. The original thing that drove me to having so many monitors was CAD work - those 3D CAD environments you always want to be full screen, and you might be working on part of a project that requires a circuit board CAD file in one program, a part cad file that the circuit board will be placed in, and then the 3D printing software all open at the same time. In addition to documenting parts in a parts list while searching the web for different parts distributors. This easily takes up 5 or 6 windows. And switching between open applications is a pain when you have to go back and forth so often. I’ve been doing that for the past 7 years. I just recently added monitors 7 and 8 because I wanted more space.

3 Likes

Is the second display on another GPU? Perhaps there is some transfer of the framebuffer that is buggy and causing unreasonable load on memory/pcie?

Can you see some other symptoms when you connect the extra display? Strange CPU/memory use; WHEA errors; maybe look at latencymon?

Another thing to try: do you get the same issue with one monitor, but while keeping the GPUs busy with something?

Okay, I’ve been installing drivers and adding monitors one at a time, creating restore points along the way, etc…

Some findings:

  • Windows 11 on install with no drivers installed for LAN + Chipset + GPUs: SSDs initially write at about 1.5GB/s
  • Windows 11 + LAN+ Chipset + GPU drivers: SSDs write at about 2 GB/s (a small improvement)
  • Windows 11 + 23H2 Update + LAN + Chipset + GPU drivers: The SSDs can write at up to around 3GB/s - so the windows update actually helped significantly with potential max write speeds.

Now here comes the totally strange:

  • If I open the display properties, the drive speeds slow down and don’t recover their speed until quite a while after I close that window.
  • If I restart my computer, my hard drive speeds are restored for a time. (I do have my power settings to never shut off, performance mode all the things, so my drives shouldn’t be going into a low power state, and my PCIe Link State Power Management is off)
  • It seems that almost at random the speeds of the SSD writes will be good, but sometimes they will just drop to 70MB/s (ish).

Okay the MOST strange part (drumroll…):

  1. I start copying 30GBs of 6GB sized files…
  2. The speed is either horrible right at the start, or right after the cache runs out. but let’s assume it’s now transferring at 70MB/s.
  3. If I take a web browser window and start resizing it, such that the browser crosses between two displays that are connected to two different GPUs, the SSD speed starts to climb up until about 1.7GB/s.
  4. If I stop resizing the window, the speeds plummet back down to 70MB/s. If I stop resizing the window across the boundary of two displays plugged into two different GPUs, the speeds plummet back down. Last thing to note here, is that as I resize the window, it has to be performing a render call that happens on both displays (you have to resize it in a manner that the portion of the window that is being resized is on both screens)

When I first start resizing the window across two screens, it doesn’t smoothly render at first on both screens - there is a little bit of negotiation that seems to be happening across the GPUs so that it can render parts of the same thing on two devices. While this is happening it is a little laggy, and the SSDs speeds do not increase. It’s not until this negotiation has happened and the window start resizing more smoothly that the SSD speeds increase. I’ve tried making both displays active with videos, or even playing a single video across two screens, and this does not have any affect (the SSD write speeds are still slow). It seems that I really do have to resize the window and have it redraw to make the SSD speed increase.

It almost seems like there’s some sort of “blockage” in the lanes on the board. It seems crazy that forcing the GPUs to talk to each other would make the SSD speed increase, but here we are. If I can find the time to record and post a quick video of it I will.

Anybody have any ideas?

1 Like

Update: If I run cinebench at the same time as I transfer files, my file transfer speed rises and falls with each image generation pass it takes. Every time it processes an image the speeds increase, every time it prepares for the next image, the speeds fall.

maybe a wild theory but could this be electrical noise interference? it seems like the way you are getting back to higher speed is by making the gpu’s use more power instead of offloading unused voltage to ground.
even if the gpu’s are idling this generation of gpu’s use more power than previous gens and if motherboard manufacturers haven’t adequately beefed-up communication lanes against electrical noise interference from these high power gpu’s that could cause some weird problems.

Igor’s lab recently posted about the 12vhpwr/12v2x6 not offloading voltage effectively through the plugs ground pins and thereby causing more voltage to be offloaded to the motherboard grounds.

I’m not an electrical expert so someone else to chime in on this with more experience but I wanted to through my theory out there just in case.

Ground on the wrong track: Why only certain power connector pins melt and we should rethink PCs | Part 1 of 2 | igor´sLAB (igorslab.de)

1 Like

When you experience the slow SSD write speeds, is the CPU clocked slow?
Not exactly and apples to apples comparison, but Sapphire Rapids had super bad clocking behavior in Windows no matter what power plan was set; Windows kept sending the processor into a high C-state when it shouldn’t have and slowed everything down.

Thanks for the ideas.

I may try removing the third power supply and just running it off two to see if it could be a power interference issue. Otherwise, my power is very very clean. It’s a dedicated 240V outlet going to a dedicated UPS, that’s creating pure sine wave power for the PC. I haven’t checked it with an oscilloscope or anything, but it should be pretty clean.

Regarding the clock speed - when I copy files the clock speed does increase to 4.77, so I don’t think it should be CPU clock.

I saw your power setup and I’m envious, I’m building my system from scratch so a good ups is on the list but not until after the system is done. I do have brand new line from the power line to a new box with all new wiring so should be fine until that point.

what I mean by the electrical interference isn’t about the power being supplied though. what I mean is the power being constantly supplied to the motherboard isn’t always used completely and therefore is fed back to the grounds on the gpus and motherboard power plugs and if the number of grounds isn’t able to offload all the unused power from the motherboard it can cause feedback.

think of the motherboard as a lake and the gpu’s as rivers feeding water into the lake, think of the gpu and motherboard power plugs as a dam and the ground wires as gates in the dam to let water out, but what happens if more water than the gates can let out get fed into the lake? the dam overflows. in the case of a motherboard, power overflow from the grounds being unable to feed enough power back out of the motherboard would potentially lead to electrical interference on pcie lanes from the power not being bled off fast enough.

as well as the industry has had difficulties with getting signal stability with pcie 5.0.

" PCIe 5.0 design requires following new specifications for the space between signal traces to prevent crosstalk. The likelihood of crosstalk occurring is more significant with PCIe 5.0 designs due to the higher signaling rates . An additional challenge arises since PCB designs are increasingly smaller."

my thinking is with the issues these ASUS wrx90 boards have been having, them not being designed with enough shielding between pcie lanes to prevent cross talk is a possibility. add that as a design flaw and 6 high powered gpus and you get very weird system behaviors.

such as when your switching window size across 2 monitors or running cinebench it makes your gpu’s use enough power that the left-over power that was possibly causing interference is used up by the gpus and stops the signal interference temporarily.

Navigating PCIe 5.0 design challenges - Embedded.com

hopefully that makes sense.

2 Likes

Ah, understood what you mean now. Interesting, and thanks for the read!

I’m not exactly sure how to test that theory, but I am assuming/hoping that this is something more software/driver related.

Some interesting things I’ve noted as I’ve been playing around with things:

  • When I restart the machine, transfer speeds are always acceptable when windows first starts. Seemingly almost randomly, it will start slowing down soon after boot up.
  • I pulled the power from a couple GPUs so that only 4 started up - still got the same problem.
  • I reinstalled older drivers that I ran on my older threadripper system that also had a 4090 and an rtx 6000 ada in it (same GPUs just less of them) - still had the problem.
  • Because everything happens so randomly sometimes, I was starting to suspect that something in software internally is limiting the speed of the drives on accident. The drive speeds are just caught in the crossfire of it trying to save power somewhere.

So I tried disabling anything with the efficiency mode symbol in task manager, and removing system processes from the services list. One interesting note along the way was that when I disabled the efficiency mode in Edge and killing any other processes running in efficiency mode, my baseline shot up from 70MB to 700MB/s.

So that’s where I’m at now. This seems to bolster that something weird is going on with software or drivers trying to manage the power. It could still be the voltage/interference problem you suggest I suppose, but I hope not! I’m hoping this is a simple chipset update problem.

Also @Quilt… I don’t see any weird errors of any kind in the errors list or any strange things from HWInfo64, resource monitor, or performance monitor. No indication of anything weird at all. Memtests and CPU tests all run fine too.

Will report back if I figure anything else out.

1 Like

I’ll definitely be watching for any news on this issue since it seems to be pretty unique.
wanted to drop another suggestion, have you done any changes to windows power plans? and if so is " ultimate performance" an available option for you?

if not this explains it,

Ultimate Performance Power Plan in Windows 11 (elluminetpress.com)

also, there is a windows version called, “windows 11 for workstations” could be something to look into as well.

Testing on Linux might be worth a try.

Is your drive using the correct power as workstationbuild1087 said. Also check your device manager that the correct settings are there also.

Checking back in here:

  • I adjusted the power settings to the “ultimate” power plan, which I had to enable using the suggestions in that link. There were no visible changes to the advanced section of the power plan where you can manually change things, but I still think this somehow made some changes.
  • I also turned off a few services that I didn’t need, like search indexing.
  • I installed older video drivers that I was using successfully on my other threadripper machine with a similar GPU mix. I went back to version 546.01 and installed both the RTX drivers, and the 4090 drivers with the same version (RTX first, then 4090, video driver install only)
  • I disabled Widgets from the Local Group Policy Editor.
  • I disabled all startup apps - I was wondering if maybe they were causing some issues because when I first start the machine, it seems like the file transfer speeds are solid around 2GB for any size file transfer, but then the second or third time I try to copy files the speed would dive. (I tried HUGE transfers of 30+GB to make sure it had nothing to do with the cache).
  • I also disabled efficiency mode in Edge, and turned off all the caching and speed improvements, because these seemed to be started in efficiency mode in the background.

I may have done a few other small things, but it was difficult to tell what changes were making the biggest impact, except for the very first change to disable efficiency mode in Edge and stop all the edge processes. That got me back up to 700MB/s pretty instantly.

The rest of the adjustments I made seems to get me back to something more reasonable. I can transfer files at up to 2.4GB/s sustained now over a 48GB file transfer.

These adjustments are not something that should need to be made, but at least I can download files at reasonable speeds, transfer files between computers at reasonable speeds, and unzip file at reasonable speeds - whatever was happening before, it was bringing my entire system to a standstill.

I would call this “resolved” for now in my situation, but I think there’s something else at play here that somebody needs to fix. Chipset drivers, bios update, video drivers, or windows update.

Thanks all for the suggestions that I was able to investigate!

hi Devin, I’m not sure if you’ve seen my post but thought i would say something here just in case, this came up on reddit and while it could be an error on the motherboard reading I’d still check up on this since there were over voltage issues on Ryzen 7000.

one reason I find this interesting for threadripper users is in this gamers nexus video, https://www.youtube.com/watch?v=cbGfc-JBxlY in the first 30 seconds it mentions 0d qcode and early on with these threadrippers I saw a lot of 0d codes being talked about.

from reddit, Posted by user PXaZ

"My WRX90 board is now giving this error every time I boot up, with a 7965WX. I’ve reloaded the default settings and reset the CMOS. The full message:

CPU Over Voltage Error!
New CPU installed! Please enter Setup to configure your system.
S3 mode has been disabled. You may configure this sleeping state in BIOS menu.
Press F1 to Run SETUP
When I enter setup and look at the voltages, they’re very concerning to my eyes at least:

12V Voltage [+24.480 V]
5V Voltage [+10.200 V ]
3.3V Voltage [+4.080 V]
CPU Core0 Voltage [+2.264 V]
CPU Core1 Voltage [+2.264 V]
CPU VSOC Voltage [+2.264 V]
CPU VDDIO [+2.264 V]
VDD_11_S3 / MC Voltage [+4.080 V]
All the voltage values are displayed in red, which I interpret to mean “Something’s not right here!”

PSU is Seasonic Prime TX-1600. The board has two 12V cpu power connectors, and I have both of them connected to the “CPU / PCIE” section of the PSU.

Some of these voltages are double what they sound like they seem like they should be. That’s why I tried disconnecting either of the 12v cpu power connections. But if either is disconnected, the CPU won’t initialize even though other devices power on.

I hadn’t changed anything in the overclocking settings - it was all set on “auto”.

I first noticed this after adding an NVME drive to the #3 M.2 slot.

If I ignore it all the system runs, but I’m uncomfortable going ahead with such warnings, and also it’s annoying to have to go into setup every time I boot.

Does anybody have any ideas?"

link to original post,

CPU Over Voltage Error! New CPU installed! : threadripper (reddit.com)

Welp, I’m back.

Been using the PC for a few weeks without much issue. I’ve had a single lock-up when coming back to my computer after it had been idle for 12 hours, but other than that no issues.

UNTIL NOW.

So I’m doing some multi-gpu AI training on my machine, and to connect the GPUs together it uses a rendezvous backend so that the GPUs can sync information with each other. This puts a little bit of bus overhead into the system with the GPUs sending some syncing information back and forth, on the order of a few hundred MB/s up to maybe a GB/s of data transfer.

The problem is that while this is happening, my computer is noticeably chugging - the CPU looks nearly completely free, but the slow SSD transfer speed problem comes back, which makes my whole computer god awful slow. Downloads, programs, etc… everything is slow. I assume that the bus is somehow just overloaded.

I’ve tried training with smaller datasets so that the amount of information being transferred between GPUs is super small, too, only a few MB/s, which seems like it should be no problem. But the issue persists. No matter the scale of the information syncing between GPUs, it slows down SSD speed to 70MB/s(ish). This leads me to think that this is a surmountable problem, and that the issue is somewhere in the BIOS/software/firmware/drivers. Something fixable, because scaling down the overhead doesn’t seem to make a difference.

Anyway - I’m really hoping for some driver updates, windows updates, bios updates, chipset updates. It doesn’t seem like this should be happening.

I have more experimenting to do, but wanted to mention that there is definitely something weird going on.

This caught my attention. Is this the first time you’ve had it lock up while idle? I’ve had that happen a few times already, but with Linux. This is the first time I noticed a Windows user report a similar issue. Was it like: displays refuse to wake from power save mode, can’t ping the machine, etc, but the IPMI thinks everything’s just fine?

(I can’t really comment on your I/O issues because I built my system for the cores and memory channels rather than the connectivity, so I have very few devices on the PCIe bus.)

My Qcode on the motherboard lists AA as if the machine is running fine. All of my screens are black when idle (I have a blank screensaver come on after 20 minutes, but don’t shut off the displays), but I assume that whatever was on screen would have been frozen. The keyboard numlock and capslock buttons are not responsive (the lights don’t change on the keyboard) so I assume it’s just a system hang. or at least some sort of interrupt or bus issue.

In the past, when I was setting up the machine, I had lots of these system hangs related to plugging in a monitor (it was having trouble initializing multiple monitors and would sometimes hang while I was getting a new monitor plugged in). It makes me scared to unplug any of my monitors or plug a new monitor in.

Update on my bus issue:

I couldn’t figure out windows. I went to WSL2. I can train on all 6 GPUs there and not have any slowdown in SSD transfer speeds or the rest of my computer. I guess I’ll just be running most of my training in linux now. In linux the method for rendezvous between GPUs is different (NCCL instead of GLOO). I don’t know if that’s the difference, or if there’s just something wrong with Windows.

2 Likes