How to troubleshoot multiple Nvidia GPUs failing in same way in single system?

System

Asrock X470 Taichi, Ryzen 3900X, 2x16 GB 3200 Mhz RAM, EVGA 1000 G3 PSU, dual boot of Windows 10 & Ubuntu 19.10.

GPUs are 1080 Ti from EVGA: SC2 & Black edition models.
Only other PCIE device is an EVGA Nu Audio card.

Symptom

The video out of the GPU will stop working as soon as the desktop appears (or would appear) after booting, irrespective of O/S. The only way to recover is to hard reset or power off (Windows) or use the magic SysRq keys (Linux).

When it happens

At its worst, as soon as the desktop appears. But it does not start out this severe — it progressively worsens. Prior to this level of severity, the desktop will randomly lose the video out after having run for some time forcing a hard reset / shutdown. Then the crashes occur more frequently, becoming common. System logs indicate a DWM crash in Windows.

What it happens with

This occurs with different known working 1080 Tis that work fine in other systems. That is, take them out of a good system, put them in the problematic system, and then after a few days, the problem starts appearing, and gets progressively worse until the desktop never displays because the GPU crashes so quickly.

No device in the system is overclocked (not RAM, CPU, GPU, or GPU RAM).

The first time this occurred, I assumed it was the GPU. I took the GPU out of the problematic system and put it into another known working system, and the GPU did output to the display, but the image was scrambled. So I took out the GPU and put it to one side for a week. Then when another GPU started having the exact same problem, I realized it must be something in the system itself that was causing it. When I retested in the problematic system the original GPU that first showed the problems, it worked just fine. . . . until today, when it crashed at random again.

How to troubleshoot?

What I have tried: downgrading the Asrock BIOS to from 1.0.0.4 Patch B to 1.0.0.3 ABB. It makes no difference. I’m farily certain it’s not a GPU driver issue, given how it happens under different driver versions (Windows / Linux drivers are different versions) and the GPUs are fine in other systems with the same driver.

Causes I can think of:

  1. Power supply unit itself delivering bad power output
  2. PSU cable
  3. Motherboard
  4. The PCIE audio card somehow causing electrical problems or incompatibilities on the PCIE bus (?)

Is there anything else I need to think of to troubleshoot this? Given the symptoms, is any one of these causes to be more likely? Is it probably an electrical problem? Is it possible that in the cause of troubleshooting the problem, the GPU could be permanently damaged if I don’t figure it out quickly enough?

Thanks in advance.

@damonlynch, Hi, and welcome to the Level1Techs forums. Sorry to say I don’t know “The Answer”, but thought I’d offer to discuss the problem and act as a sounding board.

Request for Info
You give a nice, clear description of the problem - which sounds quite challenging. I would like to ask a few questions to fill out the picture. You say that recently “it worked just fine… until today” - how long was that? (In guesstimated hours of operation, and in calendar days.) And typically, how long has it taken to progress from little or no problem to immediate crashing?

When you try a test setup, how long will it take to get a sense for whether there is improvement?

Does the system usually run 24/7 (apart from crashes), or is it routinely shut down?

Thoughts on Troubleshooting
Your list of possible causes seems pretty thorough. Power supplies have a reputation for causing strange problems, and I see you have that at the top of the list. I would add the CPU as another, not-too-likely, candidate. I invite you to consider some of the following ideas for gathering more data, hopefully leading to a solution. Sounds like you already have some of these in mind.

  • While the problem is severe, shut down & unplug the system for “a long time” - 8 hrs? 24 hrs? Make no other changes. When powered up again, does the problem temporarily clear up, like it would if the GPU had been swapped?

  • While the problem is severe, boot into the UEFI rather than an OS. Leave the system running the UEFI for long enough that you expect the system to crash. Does it crash, or is UEFI operation okay?

  • Simplify the system for testing.

    • Remove the audio card.
    • After saving or writing down the UEFI settings, clear CMOS & leave UEFI settings as near default as feasible - even without XMP.
    • Consider removing some or all the SDD & HD drives.
    • Consider removing all but a minimum of RAM.
    • As a desperation measure, consider removing components from the case and setting up a “test bench” layout. (This takes some care to avoid static electricity & short circuits which could cause damage, but you sound able to do the research and take the appropriate steps. Ask here if in doubt.)
  • Boot from a USB stick with memory test software (e.g. Memtest86), and run long memory tests. Does the system crash? Are memory errors reported?

  • Boot from a USB stick with a Live Linux distro, and run the system from that. Does the problem recur?

  • Try swapping components, most readily available (or least expensive) first. While your current components seem reputable, introduce a change by using another make/model when practical.

    • GPU - perhaps a low-end, low-power card.
    • Power supply & cables - 500W or 650W ought to be enough, especially for test.
    • CPU - maybe a 3600 or other less-demanding unit.
    • Motherboard - but we hope things don’t go this far…
  • When you find a stable configuration, gradually work back toward the desired configuration. Test stability after each change (or a few small changes). If the problem comes back, you may be able to identify what tips the balance.

Hopefully, you don’t have to go through all those steps, which could be pretty time-consuming. Maybe the next test you try will reveal The Answer (well, it could happen…).

Good luck, and please let us know how the testing goes. Come on back if you’d like to discuss further. And maybe some Real Experts on the forum will notice your post and chime in.

Try safemode, does it still crash then?
What do you temps look like

You might be dying, or you maybe just need to reflash the gpus bios, they get corrupted sometimes

Hey that kind of resembles situation I still have. For complicated reasons I can’t dismantle the system (liquid cooled), but at least board (X99 Classified) have DIP switches, so one card which was consonantly crapping out got disabled). IMHO DIPs to disable slots should be mandatory on any board with more than one slot.

Anyway when I plug the signal via 1st 1080Ti (identical EVGA FE)I get blank screen, but… after waiting about 2-3 minutes (RAID card+expander) I can hear the startup W7 sound being played. At first it was just intermittent problem with video, then it was totally degraded, system was booting with Vista style low-res loading screen. Safe mode also do not work. Even during POST fonts were distorted. In BIOS it was like at most 480p of total mess and pixels. Now I get no screen signal at all (all outputs). 2nd card works just fine. Very weird and I need other system first to take the mantle… to get to the bottom of this.

In truth I’m already year and a bit overdue with cleaning this, but got around the problem with using the machine no longer than actually it’s required. Maybe this year finally I’ll solve this puzzle.

That sounds to me like an actually defective card.
You might want to contact evga for an RMA.

Hi everyone, thanks for your comments and suggestions! To clarify, the system runs 24/7. My most recent experience with it shows that it can take up to 5 days for the problem to appear, but once it does, the interval between the crashes shortens, so it happens more and more frequently until it happens all the time (i.e. as soon as it starts). When I say “up to to 5 days”, it did that once. More typical is 2 days (e.g. swap in another 1080 Ti, and it will take about 2 days for the crashes to return) .

UEFI settings are very close to default, apart from enabling XMP & virtualization, disabling the onboard audio, and changing boot order.

It has never appeared with the system is in the UEFI, when running a plain terminal without X11 / Wayland (e.g. Ubuntu system recovery mode), or when running memtest86. It only appears when the O/S has or (is about to show) the GUI desktop. I didn’t yet try a live Linux USB when it’s in the state of crashing all the time. (A live distro would be using Nouveau, which would would be yet another variable to think about — worth a try I suppose).

Has there ever been a case where an HDD or an SSD would cause a problem like this? While I could remove these drives one by one, it’s not something I’d be happy about. The system has 4x HDD and 1x SSD, and all of them are required for what I use the PC for (working from home).

After more than 2 weeks of this problem, thus far I have definitevely ruled out the audio card and motherboard. When I asked EVGA what the problem could be, the tech I spoke with was emphatic that it was the motherboard. Turns out that’s wrong.

I’m currently using memtest86 to run a lengthy test. I guess it will take around 8 hours to complete 4 complete cycles. It’s error free after 2.

That leaves the PSU and CPU. I’ll swap in an EVGA 750 G2 for the EVGA 1000 G3 after the memory test completes. If that also fails to resolve it, that leaves the CPU as a possibility I guess (assuming the HDD/SSDs are not the culprit).

As for swapping in a weaker Nvidia card, I could do that. But ultimately I’d like this system to be able to work with a 1080 Ti class card, so that’s what I’m aiming for. If it worked with a slower card, while I’d be relieved it worked, I’d be unhappy with the arrangement.

I’m thankful I’ve never overclocked any component in this system! Quite apart from preserving the CPU warranty, if it had been overclocked it may have been even more unstable.

I really hope I can figure out what the problem is soon enough, because it’s a productivity killer. :cry:

@damonlynch Wow, sorry to hear you are still seeing this strange problem.

Do you know whether the OS continues to operate “normally”, just without display? Or does the OS crash / get stuck? For instance, run Linux and setup to allow SSH login from another machine. Then, when the display goes out, can you still login and use the command line? Can you access all the disk drives? (Or do something similar under Windows, but I am less familiar with the tools.)

I wouldn’t expect an HD or SDD to cause this sort of problem, but the behavior seems so bizarre it is hard to rule anything out.

Oh, what about checking the log files, particularly around the time of failure? I admit that logs can be voluminous and confusing, but something might stand out.

My idea of substituting another GPU is just to gather info, not as a permanent solution. Whether that does or doesn’t relieve the problem, it seems like a useful piece of information - though unfortunately not definitive. A couple of more ways to introduce change, to see if there is any effect, come to mind:

  • Underclock the CPU by, say, 10%; maybe it is on the ragged edge of failing.
  • Plug the GPU into another suitable motherboard slot.

I’d be inclined to try swapping out the PSU before either of those ideas, though. (When you do, use the cables for the other PSU. As you may know, it isn’t safe to assume they are the same, even from the same manufacturer.) Say, you might try:

  • Replace or swap power cables to the GPU. (Oooh, I like that… I see it is on your original list, too.)

Are you running any out-of-the-ordinary software that might play a role? Does it seem to matter whether the system is under load or just idling at the desktop?

I wish you luck, and that you can cling to sanity in the face of this maddening issue. :slightly_smiling_face:

when you say after 2 days it starts and progressively worsens I makes me think temps.
I wonder if the 1080ti heat soaks in some way.
have you re pasted the cards? how old are they? have they done a lot of work?
might be worth a try and think about airflow in your case

Given the cards are OK in another machine i’d say your issues are possibly

  • windows install related
  • power supply related
  • motherboard PCIe slot related (damaged slot(s), defective manufacture, etc.)
  • some sort of short/electrical issue caused by the weight of the card(s) vs. flex in the slot/motherboard, etc. check nothing is likely to contact/short anything.
  • Motherboard UEFI related (i doubt this as otherwise plenty of people would have issues with such a popular board).

… in rough order of likelyhood

Try to isolate each one of those things to confirm which component is the root cause.

Heat? I’m not convinced. most cards will just down-clock rather than crash these days, but that’s also a possibility. But it should be obvious if the card(s) are running hot in that box.

Thanks for your suggestions, everyone.

In the end I’ve narrowed it down to two causes: a failing PSU cable, or a failing EVGA powerlink, which I had previously overlooked as a potential culprit.

Having changed both the PSU cable and removed the powerlink, and having had exactly zero crashes since, it’s definitely one of those two things.

I’ve not been fastidious about figuring out which it is because (1) the powerlink is outside of its warranty and (2) having a stable system once again is so very much welcome I don’t want to mess with anything!

I suspect the problem is with the powerlink, because a little searching online turned up similar symptoms from a few other powerlink users.

So the #1 lesson I take away from this is don’t install additional things related to electrical power unless you really have to. The less things to go wrong, the better. An EVGA powerlink more of a vanity object than something that genuinely helps the system perform better.

The #2 lesson is that troubleshooting problems of this nature is not easy.

The #3 lesson is that while I’m not in a position to be able to afford a Puget type pre-built system, I can now understand better why some organizations and small business owners do find value in them. Paying the big price premium upfront that Puget & others command could be well worth it when it comes to troubleshooting problems of this kind, should you need it. For me , having built this system myself, solving this problem was not only costly in terms of overall time, but especially disruptive in the sense that it would constantly interrupt my work.

1 Like

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.