New Ryzen 2700x build, significant GPU stability problems

So I’m at wits end with my most recent build and looking for a second set of eyes to see if I’m on the right path.

I bought the following parts for an upgrade to my gaming/eventual KVM rig:

Ryzen 2700x
32gb of gskill pc3200 ram (2x16), running at 2133 until I figure out the GPU issue
Asrock Taichi X470
Radeon RX580
GTX1080 (from my previous build, rock solid previously)
RX Vega64 (when all the GPU issues kept showing up with 580+1080)
New PSU (1kw gold, EVGA I believe)

The issue it’s running into are GPU driver crashes, AMD and NVidia exhibit the same behavior. I’ve tested single cards, two cards, different pcie slots, different OSs.

RAM has passed memtestx86 with a clean bill of health after 3 full cycles.

Windows 10 RTM and 1809 exhibit the same issue, as soon as you stress the GPU, be it Firefox or Games, the display drivers crash.

Each OS attempt has been using a fresh format/ reinstall as well

I’ve run CPU stress tests with prime95 for 2+ hour runs, which had normal thermals during the run.

I tried running a GPU stress test from a Xubuntu live USB this morning to try and rule out windows entirely, which hard locked as well inside 15 seconds on fur mark

At this point, I’m down to it being either the motherboard or the cpu, but at this point I’m not sure which one might be the root. I’m tempted to lean towards the motherboard, since the instability seems to follow pcie (bus or power) loading (single 580 is more stable than 580+1080, which is more stable than 580+Vega, which is more stable than Vega+1080).

Any suggestions you all might have would be greatly appreciated.

How exactly does the system fail? Instant reboot or freeze or… Anything on the screen?

If your USB is persistent check the logs in /var/log
Kernel and syslog may have something of value.

In windows, it’ll be a string of driver failed and recovered messages, progressively getting worse until a full hard lock of the system.

I’ll have to check the logs on the stick to see if anything pops out at me, but it’s a hard lock, no mouse/keyboard responsiveness, can’t get out to the shell behind the GUI.

For both windows and Linux, it’s a hard lock requiring a forced poweroff via holding the button down.

Have you tried running one stick of memory at a time?
Memtest hasn’t always been accurate in my experience.

Is it a new motherboard? I had the same thing happen building mining rigs. Are you getting ‘system interrupts’ in the task manager? If interrupts overload, the system will lock up. It had nothing to do with the cards, RAM, or chip. I’m not sure there’s a fix. What worked for me when I needed to get a rig back up is to use a driver removal tool (all drivers), and reinstall the driver from scratch with only one card in (I’d try the 1080 first, then introduce the AMD - RX 580’s gave me the most trouble).

I can swap in ram from the intel build that’s being replaced, but I don’t think bad ram would give me the same behavior across three operating systems…GPU instability once I start pushing the pcie lanes in significant quantities.

As long as the GPU(s) aren’t in use, or drivers aren’t loaded for the 3D accel, the system is stable from what I’ve seen…memory and cpu stress tests are clean as a whistle.

Brand new board, yup…going from a 7700k to a 2700x for better KVM capabilities to pass through a GPU.

I do get slightly better stability with a single card vs multiple cards, but I still end up seeing issues with GPU drivers crashing/recovering in windows…we’re talking every few hours with just the 580, every 45 minutes or so with just the Vega64 or 1080, every 10 to 15 minutes with 1080+Vega64 or 580+1080, or every 5 to 10 minutes with the 580+Vega64.

Thinking back on it, it does seem to crash the drivers of the first 16x slot, with the 2nd slot being slightly less flaky.

Ultimate goal of the system is to ideally have the 1080 for Linux, with the Vega64 for KVM pass through to a windows VM.

…I blame this all on Linus and Wendell for showing me an even better ‘yo dog’ virtualization thing

I have a 2700X with X470 taichi with dual vegas and have no stability issues.

I’d suggest you maybe have a hardware problem with the motherboard (PCIe slot related perhaps, as you note one slot is less dodgy than the other).

You did remember to plug in at least an 8 pin (cant remember?) PCIe power into the motherboard (above the CPU ish when in a tower case)?

Yup, the 2nd motherboard power jack is plugged in.

That’s the way I was leaning as well, thro…I just wanted to get some second opinions before I have to rip the whole thing apart to exchange the motherboard.

Since it’s a local purchase (Microcenter is handy), I won’t have any problems with that part, thankfully.

Should I go for a straight up replacement to another Taichi X470, or is there another board people like as well?

I’m curious how you’re installing drivers. Are you installing them one at a time or all in one go? Even with a faulty PCIe lane, you should be able to get a 1080 stable on a fresh OS install with updated drivers from NVidia. Otherwise, you would not have been able to run a xubuntu stress test.

I think it’s a driver problem and likely a windows driver problem. Try a complete reinstall, only one GPU, and only the driver for that GPU.

For each permutation of testing, I did a reformat, install of chipset drivers, reboot and then install the GPU drivers, followed by a reboot. After that, I’d install Firefox, set it to playing YouTube videos (most consistent crash trigger) and then play a game on the second monitor. The crash times from earlier are all based on that set of testing…it’s been a long weekend and week so far of trying to isolate the issue.

I’m happy with my Taichi, but the Fatality is almost identical and supposedly the ASUS ROG board is great. And memory support is much better on it.

Mentioning just in case; With a Radeon graphics on a Ryzen system, do not take to habit to installing the graphics drivers using the “Clean” mode or be prepared to install the chipset drivers once again after that.

Too many cards, too many monitors, too many OSs…

Start with one OS, one GPU, one monitor. In Windows.
Clean format, no Win Auto install, manual install of drivers; as in manual, the kind you download in advance, on your own.

That done and all things stock (including monitor in case it can “OC”), run Unigine; free, demanding but not Furmark levels, Windows friendly.
Does the issue persist?

Swapped out the board for an Asus Crosshair VII and running a pair of SSDs and the Vega by itself under Windows 10 v1803. So far, so good…I’ve got it baking with Furmark and Prime95 right now and nothing’s burped yet. I’ll do more more load testing in the morning, but it’s been more stable in the hour run time so far on both benchmarks running simultaneously.

Ok, so there was a burp after I said that.

It’s a defective CPU. It’s sounding like the segfault bug cropped up on your CPU.

It could also be stock voltages aren’t stable.

In any case, your CPU definitely is at fault if the memory is fine under only memtesting. It’s RMA time.

Latest update…after rebooting last night, I set up Heaven running in a loop while firefox is playing two video streams on the GPU as well. So far, so good…8 hours into it without a burp from the GPU. If it keeps running like this for another couple hours, I’ll start adding storage back, run another round of testing and if that passes as well, add in the other peripherals.

Added the drives back to the mix and now it’s having GPU crashes again, but on the Crosshair VII, it has a little bit different implementation of the m.2 slots This started popping in storage device resets when the GPU crashed.

I’ve moved it to the 2nd m.2 slot to see if it’s happier there, but so far it’s back to benchmarking cleanly.

Update

The GPU drivers failed again with another storage event, following the port on the board it got moved to. Looks like I’ve got a m.2 SSD that doesn’t like this system on my hands.

I had similar troubles with my ASRock B450 Pro4, any in-game GPU activity (Furmark was fine) would cause issues either immediately or after some time (max 5 minutes). What I did not expect was the new NVME drive I got possibly causing this trouble. Removed it and so far GTA V has not crashed on me after 5 minutes, which is quite an improvement.

CPU is the Ryzen 3 2200G, by the way, integrated graphics turned off in UEFI.

Edit: put the NVME SSD back in, problem has returned. Motherboard issue or NVME drive issue?

Didn’t JayzTwoCents have the same issue on his system. Believe it was due to a bad cable.