Do I need to RMA my GPU?

The_Drugs · July 26, 2021, 5:58am

Hello all! Apologies in advance for the long post, but I’m at the end of my knowledge on this. As some background, I purchased a 6900XT open box from Microcenter some weeks back. the card was bad out of the box, so I RMA’d with Asus, they sent back the card I have now. When I first got the card back, I started having issues, though the issues have changed throughout testing. I’ve tried to keep a good record of the troubleshooting I’ve done, so I’ll list everything in chronological order below.

tl;dr
I have tried a number of troubleshooting tests to figure out what’s going on with my system, details at the bottom. GPU crashes under stress tests, underperforms other systems with similar specs in games, and occasionally suffers stuttering and latency issues. I’m not sure if this is a GPU issue or a system issue, however. Though I believe the issues lie with the GPU.

To start, I initially installed it in my normal case, an ITX case that uses a riser. Once everything was installed, I immediately had huge issues with stuttering and latency issues, which I immediately thought would be the riser. I moved everything to a test bench, and had the same issues - ruling out the riser.

Taking a minute to describe the issue in a bit more depth, games would slowly start to “lag”, or run behind. The performance of the card would slowly drop, and audio would start to crackle and distort, then get stretched and slow down to the point where it was inaudible. This wasn’t isolated to a specific game, but certain games were worse than others. Cyberpunk took about 2 minutes to be unplayable, Monster Hunter World took about 10-15, Doom Eternal remained playable throughout, but the audio issues and small lags were noticeable throughout. Specifically, driving around in Cyberpunk, 1440p high/ultra settings I would start around 90-100fps, then drop to under 30. Monster Hunter would drop from 200+ fps to under 80fps (ruiner nergigante event quest, in his little arena), Doom would run at around 140-160 fps. Looking at power monitoring, the card was self-reporting lower than expected power draw, often in the 80-140w range as opposed to the expected 275w. Measuring with my Kill-a-watt confirmed power draw was low as well. Temperatures were well under control, often in mid 60s to low 70s, though sometimes going as high as 80c. Benchmarks/stress tests would complete, but with hamstrung performance. I tried both WQHL and latest drivers, and a driver reinstall for each with DDU, no changes.

After some googling, some folks said a windows reinstall could help, so that was my next step. At first, it seemed like it helped, though further testing confirmed the problem was still there, though to a much lesser extent. Now, games were playable, though performance was still below what I would expect to see. Doom had no audio crackling issues or performance issues, though the performance was a bit lower than expected. MHW was running slower that expected, but again, no issues. Cyberpunk continued to have issues, but they were much more manageable than before.

I continued to troubleshoot, with the first step being to check if, for some reason, pcie 3.0 vs. 4.0 might be causing some issues. I had set the card to run in 3.0 mode to avoid issues with my riser, and kept it like that when moving back to the bench. I tried flipping it back to 4.0, no changes. Next, I thought there might be some issues with the GPU not getting enough power, since this is all running on a SF600 600w PSU. I had never seen power draw go above 460w on my kill-a-watt, but you never know. Trying with a EVGA 1000w, as well as a Seasonic 1250w both still had the performance issues. I know both of these units are good for the wattage, so I ruled out PSU. At this point, following some other forum posts I tried testing my latency with Latencymon, it showed some serious issues and latency problems, confirming there was an issue. Previously, this system had worked fine for years so I didn’t expect any issues with hardware there, but I did check both of my SSDs to make sure they weren’t failing on me or throwing up issues, they’re both fine. As for the other components, memory and CPU are both running at stock, automatic speeds, the memory passed memtest86 overnight, so no issues there.

Continuing to troubleshoot, new drivers came out for the GPU. I figured it was worth a shot, and they have eliminated a good bit of the stuttering and audio issues. After updating drivers, the issue seemed to be gone, though performance remained an issue. there are many games where even at 4k, the GPU isn’t getting fully loaded, power draw is low, clock speed is low, etc. Everything “works” though, but at anywhere from a 20-40% performance penalty. I wanted to quantify this, and ran Time Spy for an easy number to compare to other systems. I don’t have any modern games with built in benchmarks, and getting specifics on gameplay for things like MHW and Doom Eternal can be difficult since most people don’t report what level they’re testing on. The test went fine, though my system was in the bottom 7% of all similar configurations. I chalked that up to it being an uncommon configuration, and the fact that my memory was running quite slow. I re-enabled some known-good memory settings, and things improved, moving up to the 60 percentile. I wanted to confirm that everything was ok, so I ran the Time Spy stress test and my system black screened. I figured it was probably memory being finicky, maybe I set a timing wrong, so I reset to stock and ran again - also a black screen. The system wouldn’t display anything after rebooting, so I had to boot to safe mode and uninstall drivers, but I was able to recover the system.

At this point, I was starting to suspect memory, or god forbid motherboard or CPU issues. I had the opportunity to test in two separate systems, though testing was a bit limited by time and what we had available. The first system I tested on was an old X58 system, less than ideal. It had a X5650 clocked to 4.4ghz though, and we were able to run at 4K, which helped to focus the stress on the GPU. In that system, everything ran fine, even on extended tests! Though, we weren’t able to test every game, just MHW and Doom. I was also able to test on a friend’s system, a 3900x running with a decent overclock. Everything was fine there too, and ran like a dream. No issues with stuttering, power draw, or other issues in either Cyberpunk or Doom. At this point, I assumed it had to be my memory, since “it’s always memory”. I borrowed some ram, known-good, from my bud, then went home and tried it. I had the same issues, ruling that out.

I was interested to see if I could replicate the crashing under load, and with either RAM kit, both at stock, both at XMP/known-good timings, the system will black screen during stress tests and cant load the display when rebooting. Interestingly, I went to bed after one crash, and when I tried to start the computer in the morning, the display loaded up without the need to go in and fiddle in safe mode, which further complicates things.

Looking back on the two systems we tested with, I’m concerned the X58 might not have been loading the 6900XT enough, and we didn’t really test for an extended period of time on my friend’s computer because of a lack of time. I don’t have a second system on hand I can test with here, though I’m working on an unrelated X58 project that should be ready here soon. I can use that to troubleshoot the GPU, though I still have concerns that the CPU can’t keep up, even at 4K.

Now, I find myself with a question of what is going wrong. At this point, I think it’s the GPU. I’ve ruled out the PSU, riser, and RAM, and previously the system never gave me any issues. However, with the GPU seemingly working fine in two other systems, I’m left wondering if something isn’t wrong with my system. I don’t have another card with comparable power requirements or performance to the 6900XT to test with, but everything works fine with my trusty old 290. The inverse is true as well, I don’t have a system close to my 3700x to test the GPU with, the closest I can get is a clocked to snot X58 + X5650. I’m looking for any advice or anything I might have missed here, I don’t want to pull the trigger on another RMA if that isn’t the issue, though I will if that’s what needs to happen.

I’ll include detailed specs and a line by line of all the things I’ve tested below, but please let me know if you have any questions! I have some Afterburner recordings/grabs of the power issues saved somewhere, if memory serves, and I could grab some more as well if needed.

My System

CPU - 3700x run at stock throughout
Mobo - Gigabyte B550I Aorus Pro AX
RAM - Crucial Balistix Max 4000mhz 2x8gb kit run at “stock” - whatever the motherboard decided that is, as well as a “safe” 3600 CL16 setting
Testing RAM - Corsair Vengence 3200mhz 2x8gb kit run at “stock”, as well as XMP
GPU - Asus 6900XT TUF
PSU - Corsair SF600 600w PSU
Testing PSU - EVGA G1 1000w, Seasonic 1250-X
SSDs - Intel 660P 1TB NVME, Silicon Power 1TB NVME

X58 Test System

CPU - X5650 Clocked to 4.4ghz, known stable OC
Mobo - EVGA X58 Classified Rev1 westmere modded for X5650, known-good
RAM - 12gb Corsair 1600 clocked at 14XXmhz, due to FSB OC
SSD - Generic Samsung 256gb
PSU - EVGA G2 1000w

Buddy’s System

CPU - 3900x known good OC on this, not sure the settings
Mobo - Gigabyte X570I Aorus Pro Wifi
Ram - 2x8gb G.Skill Trident Z Royal 3200mhz not sure of the settings on this, but known-good
PSU - Silverstone SFX SX750
that’s all I know for this one, but I know it works fine with his overclocked 3080 as well

Troubleshooting tests

tested on bench, without riser
Driver reinstall with DDU
WQHL Drivers, resinstall with drivers
Latencymon Tests
Clean windows install
PCI-e 3.0, PCI-e 4.0
Disabled onboard audio, used external DAC instead
two separate PSU tests, both known-good
tested in X58 system
tested in Buddy’s system
Testing with known-good RAM

Any help would be great! I would LOVE to have just missed something obvious.

GigaBusterEXE · July 26, 2021, 8:21am

does the card have a bios switch, make sure it’s in performance bios not silence bios, although you don’t want to have it in LN2 mode if it has that

Did you try adjusting the power slider in MSI afterburner or overclocking at all? First thing I do is always crank that power slider as it will usually boost higher or longer even without manually overclocking

What are the temps like could be throttling, the stock fan curve is garbage and on the 5000 and 6000 cards I have noticed if you put a 100% manual fan curve it will slowly over time drop speed, with a fan curve in MSI afterburner you can bandaid this issue as it constantly resends the fan speed target

I usually have it set to 65c for 100% but I don’t really care about fan noise

MetalizeYourBrain · July 26, 2021, 9:47am

Seeing all the extensive tests you went through, I’d say it’s the motherboard that might be defective at this point. Maybe it’s a power issue through the PCIe slot, maybe some traces weren’t done absolutely properly between the layers and it’s showing issues for that reason (PCIe 4 it’s VERY sensitive). Or even the VRM that’s overheating and cutting power to the CPU.
Everything is connected through it so it could influence every component deeply.

Beside the motherboard I’d say look at the CPU and make sure it’s running stable, but it’s longshot since you didn’t mention anything about it. Install the R9 290 you got back into the system and do some extensive stress tests.
If the CPU is working fine with the R9 290 I’d say try to RMA the motherboard and see how it goes.

I don’t think it’s the GPU just for the fact that you tested it in a 3900x system and it worked like a charm. That would not happen if something was wrong with it.

This is a very tricky issue to diagnose, but you went very deep and ruled out lots of potential issues. Just go the rest of the way and see what happens. Good luck and keep us updated!

The_Drugs · July 27, 2021, 2:59am

@GigaBusterEXE

Yes, I made sure it was set to performance, though I think I only tried the silent bios once. I could try that again too, I suppose.

I’m glad you mentioned that. Early on when I thought the PSU might be the culprit, I tried a basic undervolt, lowering the core voltage to 1150mv and the poewr slider to +10%. I also tried maxing the slider to 15%, all else stock, no changes good or bad. Sorry, I forgot to mention that since I wasn’t taking notes at the time

Highest I’ve seen during testing is 84c, with the junction at 102c. Typically though, mid to high 70s. That said, I wanted to rule out thermal throttling as an issue, so I’ve typically been running with the fans locked at 100% with afterburner to avoid it an issue.

@MetalizeYourBrain
That’s my thought as well, unfortunately. I’ll try and set the 290 up in the next day or two and leave it running stress tests for a few hours, see what that gets us. I’ll hopefully have the X58 ready to go soon too, surely I can find a game that will keep the 6900XT loaded without crushing the poor x5650 at 4K, so hopefully that can help too.

This is an interesting point, I think. Before this, I was running a 5700XT, never had any issues with the card, so I wouldnt have expected pcie4. I’ve not manhandled the motherboard or GPU, so I dont think I’ve broken anything, but if it really is that sensitive that could be the issue, I suppose.

That brings up another point I failed to mention - the issues get worse on the riser, and there’s even some very slight stuttering, though it’s barely noticeable. More than anything, the performance is noticeably lower, for no good reason, I even compared to pcie3.0 directly on the board to rule out any fringe cases of pcie4 providing an advantage. I figured it had to do with a dying power delivery on the GPU and the riser being just enough to push it over, though I suppose it could also be the added interference or “load”, so to speak, on the pcie slot. Any thoughts on that?

GigaBusterEXE · July 27, 2021, 3:17am

I suspect it might have something to do with x570 having a PCI-e 4 uhhhh what’s the word? It takes the signal and reamplifies it, it’s the reason why x570 is hotter than b550
Your system is b550 and your buddies is x570 so maybe that’s why he had no issues

The_Drugs · July 27, 2021, 3:34am

I’m not familiar with that at all, I’ll do some googling though and see if I can figure out what that’s all about and see if it’s caused issues for anyone else, thanks for the heads up!

MazeFrame · July 27, 2021, 6:29am

Like @MetalizeYourBrain mentioned, sounds like a motherboard issue.

Temperature wise, you could use something like HWInfo to log to file, then you have some numbers to look at after the crash.

MetalizeYourBrain · July 27, 2021, 2:54pm

Reading this makes me doubt my first attempt at pinpointing the issue. I’m pretty sure is the motherboard still, but instead of being due to the traces in the motherboard (which I’m sure you handled with care, there’s not even need to mention it) there might be some rf interference that’s being generated by the 6900XT somehow and it might resonate enough to be picked up by the traces on your particular motherboard. So it might be a shielding problem at this point, but still a problem with the motherboard.

And this kinda supports my theory because the longer you make an “antenna” the worse the problem will get.

But what’s still flawed in my reasoning is that you’re encountering this problem ONLY on your motherboard and not on you friend’s one or your X58 one. Is it because Gigabyte skimped on shielding?

Being everything solid on new devices they’re either working or dead, there’s no in between like swollen old caps that are about to blow up.

My tl;dr is that it might be an interference issue. To verify this try to lock the framerate in game (high framerate cause coilwhine, hence higher switching frequency) and see if it makes a difference. If everything plays fine at 60fps, we might be onto something.

P.S. I might be mistaken, but those Aours board are exhibiting issues more frequently than other boards. Might I suggest to switch up to another manufacturer, if you can return this one?

The_Drugs · July 31, 2021, 9:41pm

@MetalizeYourBrain @MazeFrame

I have an update on our motherboard concerns. I was able to borrow a motherboard from a friend, another gigabte, but know working well. This time an Aorus X470 gaming 7. We know the board works fine, before my friend switched to a new board he had run multiple cpu and gpu in it with no issues. I’m having the same problems as before, no changes.

I realize i didn’t detail the issues I’ve been having with stress testing, so here’s a run down. Essentially, with either motherboard, each with either ram set, ram at auto, 2133, xmp, or safe known-good timings, the behavior is the same. When launching Time Spy benchmark, the system will have some small hitches and flickering, which is whatever. Once the test starts to run, the system will hitch and flicker, stutter, and often will have a display driver crash, which it can recover from. Start the test again, everything typically works fine (though sometimes it will crash during the test). After the test, however, the system will typically crash when trying to submit to compare results. If it doesn’t crash, then there will be artifacting, flickering, crashing displays, displays not being recognized, etc. The behavior is very similar to an unstable overclock on older GPU, where it had enough power to be just stable enough, mostly, to not crash, but it was obvious something was wrong.

I had one or two runs testing on the new board where there were no issues with the benchmark, but the system would crash within the first 5 loops of the Time Spy stress test. Either way, it seems this sets it off. To troubleshoot on the new board, I made sure chipset drivers and others were installed correctly, checked windows for issues, tried both WQHL and beta drivers, and made sure to uninstal with DDU. I also tried the various memory configs mentioned earlier, no luck either.

The crashes are typically blackscreens leading to the system restarting, where nothing will show on the display. I have to manually power cycle a handful of times until I get to the windows recovery menu, then I can boot into safe mode, where I have to uninstall display drivers (DDU). Then I can get into windows again. One of the times it reset, the system mentioned it was checking my C: drive for issues, though it didn’t find any. I checked the disk health again just to be sure in Crystal Disk, no issues reported. I also ran some commands in command line on windows to check (I cant remember right now, sorry) but there were no issues reported. Further, I dont think it’s the SSD since the system works fine with other GPU, and the issues only show when the GPU is loaded.

Finally, I managed to capture some videos of the issues I’m having, which might be helpful. Watching them, it looks like a gpu issue, but I’m really not sure. With switching the motherboard out now though, we’ve really narrowed it down to either a cpu or gpu issue, though I have a hard time believing the CPU is failing. There aren’t issues with other configurations, and the cpu has never been overclocked or undervolted, I haven’t even enabled PBO, at least I’ve not flipped any of the manual switches in bios. I’ll drop the videos below!

Feel free to mute videos, there’s no talking, just a fair bit of background noise

1 - https://youtu.be/0KtEcJhIoeI
2 - https://youtu.be/JebMccOhbJo
3 - https://youtu.be/T-vw8hjlaqs
4 - https://youtu.be/yP805zvfnLU

MetalizeYourBrain · August 1, 2021, 7:58am

I reviewed a couple videos and red trough your post. I’m kinda busy atm so I’ll get back to you on this post soon.

At first glance looks like memory on the GPU is toast.

The_Drugs · August 4, 2021, 3:16am

Another small update. Card is still crashing in Time Spy and other stress tests and benchmarks, but can run Doom Eternal Just fine, and can run MHW, though not great.

I’ve uploaded a few Afterburner captures to demonstrate the issue. Audio stuttering is gone, though there were a few stutters and lags in MHW.

Both at 1440p, RAM at 3600 CL 18 (known safe)
Doom - Ultra nightmare, ray tracing enabled, Doom Hunter Base
MHW - all settings maxed, Volume Rendering Disabled, AA set to FXAA only, Hoarfrost reach hunting a Viper Tobi Kadachi in a four man SOS.

system · May 4, 2022, 9:16pm

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.