AMD Threadripper 3970X under heavy AVX2 load: Defective design? (No, but there is an issue)

I would say, unless you get a clear notification from Prime95 about a fatal numerical error, it’s not the same issue.

1 Like

@maximlevitsky, @FranzB. Yesterday I tested under latest Manjaro Linux (with new kernel 5.6 stock) and can confirm that Prime95 (mprime) is working fine on my Aorus Master (BIOS F5c) as it did under Windows 10/Server 2019. I also can confirm that PBO works fine under the new Linux release. However, ACPI errors still present on Linux boot messages. I also discovered ACPI errors “15” on Windows event viewer.

These ACPI errrors?

[ +0.000001] ACPI BIOS Error (bug): Failure creating named object [_SB.I2CC.WT4C], AE_ALREADY_EXISTS (20200326/dswload2-326)
[ +0.000036] ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20200326/psobject-220)
[

This should be harmless according to my investigations

These “ACPI 15” errors are quite interesting, I get them too.

I’ve been wondering in the past few weeks if they were related to the following issue:

Occasionally, a fifth blank menu entry in the Power menu wil appear (blank in the sense that it’s missing its label).

In the screenshot below, I don’t have the problem:

image

However what will happen sometimes (maybe in the first few minutes after boot?) is that there will be a fifth blank menu entry, and the display of that entry is causing the ACPI 15 errors (maybe because it’s looking for the label in some database and that label is missing?)

EDIT: In my case the “ACPI 15” errors are those:

The description for Event ID 56 from source Application Popup cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event: 

ACPI
15

The message resource is present but the message was not found in the message table

Yes, I have the same error on my Linux boot log

Yes that is the error I have in the event viewer.
I believe all TR 3erd gen have this error. Prior to the Aourus I had an Asrock TRX40 with the same and I returned it to dealer as It was unstable and dropping memory modules randomly. The Aourus has been rock solid stable. Besides those ACPI errors, but not visible impact

I get the same acpi errors as well, running debian sid with linux kernel 5.6.7, everything is stable. The only issue is the audio over spdif doesn’t work but I believe thats being addressed.

Any thoughts if this is fixed on TRX40 Aorus Master or it only affected the Xtreme?

I’m debating between ASUS TRX40 Prime Pro S and TRX40 Aorus Master. The Master has more features but I really just need a solid board and ASUS is usually pretty good in the BIOS department.

I have similar ACPI errors in my kernel log on my company’s Dell PowerEdge R6515 Epyc 7002P server also on Ubuntu 19.10. I think it’s just a platform thing, I don’t see any negative impact.

Hi,
I have problems similar to yours. What was the verdict regarding to what causes the crashes? The discussion seems to bounce around different motherboards and I got a bit confused if it is CPU, memory or motherboard …

Here is what I have and how crashes occur.

Motherboard: Gigabyte Aorus TRX40 Xtreme rev 1.0
BIOS: F4l (released on 2‎020/09/25)
CPU: AMD Threadripper 3990x
RAM: 8x32Gb, ECC unbuffered Micron/Crucial MTA18ASF4G72AZ-3G2B
PSU: 1500w, be quiet! Dark Power Pro 12
OS: openSuSE Leap 15.2, kernel 5.3

Crash: black screen after compiling Linux kernel 5.10.2 in parallel on 64 cores (make -j 64)

The system works fine when used under regular activity - editing files, browsing (Firefox up to a dozen open tabs), etc

When I compile kernel on a single core, when all files compiled sequentially everything is fine too, thought it takes several hours.

When I tried to compile the kernel in parallel, loading 64 cores (make -j 64) within 1-2 minutes the system crashes with black screen. The temperatures of CPU (ccd’s, io chiplets) looks ok, they never go above 65-70C up to the crash moment. That is reproducible. The system log shows lots of “machine errors”, (“mce” errors in log file). Essentially an error per each core (system sees 128 core).

I downloaded memtest86-Pro to check the memory sticks and the story is similar. When I am running the test on single CPU, all memory checks are ok (it took about 98 hours). When I try to ru m memtest86 in parallel, utilizing all CPUs, I get lots of errors. The worst part is that memtest86 hangs after that.
I also removed all memory sticks except one and was checking one memory stick after one. And again, when I do checks in parallel memtest either hangs and becomes unresponsive, or reports errors and then becomes unresponsive. I did it with 4 sticks.
It is hard to believe that so many sticks are bad, so either there is an issue with the CPU or motherboard/BIOS.

Also, I am confident that CPU heating is not an issue - it has a good air cooling and besides sensors monitoring, the air blowing by CPU cooler is not hot. In contrast my old HP Z820 Xeon workstation can work as a heater in winter.

I would greatly appreciate if you have any input as to what in your opinion is a culprit of that.

I have seen dead DIMMs and CPUs in the past but nothing like that. Ideally I would love to have a dual socket EPYC with lots of RAM, but it it a different price rang. Also, seeing what is going on with the current setup makes me wary of getting my hands on EPYCs if there is a high rate of “bad apples” in AMD CPUs.

Thank you and best regards,
Dima.

I had a similar experience with a Ryzen 3900X and 3,600 MHz RAM. Everything was fine until I loaded up all the cores.

I never proved it but I think it needed an increase in the SoC voltage. Be careful if you increase that. Don’t go too far.

I’ve never tried anything in this but it seems legit: https://www.techpowerup.com/review/amd-ryzen-threadripper-3000-overclocking-deep-dive-asus-rog-zenith-ii-extreme/6.html

I know you are not overclocking but the BIOS default settings may not be enough. And also, you did check for BIOS updates? You might also try contacting Gigabyte support.

Hi,
thank you for the reply. The f4l BIOS version is the latest listed on Gigabyte’s website. I will post the same question there too.

I will try to play with SoC voltages when do some reading , just not to burn CPU accidentally.

The problem with Gigabyte boards like Aorus Xtreme is that they were designed for gamers and in that world most of the load is on GPU. So, my guess is that they did not really tested full CPU load, especially for 280W 3990x. And, yes, I am not overclocking it. The only BIOS settings that were changed are those that are related to virtualization.

I also noticed that the metal plate below CPU that covers M.2 drives became very hot durin Memtest86 run.

If one wanted to test voltage is enough, as @zlynx mentions , one might try down-clocking the CPU, but keep the current voltage, and see if problem persists?

It should be the same (test wise) as raising voltage with same clocks?

Hello, I am scouring the internet to hopefully find out why my Threadripper is BSOD’ing on me with the stop code “DPC Watchdog Violation

 

  • OS - Windows 10, x64, Full retail
  • 3 months old hardware
  • CPU - 3960x
  • GPU - Zotac 3090
  • MotherBoard - Asrock TRX40 Taichi
  • Power Supply - Corsair AX1000
  • Desktop

 

 

 

 

Here is a rough timeline of what happened:

  • On September 24th the 3090 came out and I built a completely brand new PC with it
  • After 2 days of setting up Windows 10 exactly how I like it, I try installing the Astro C40 controller software but it hangs
  • After hanging it eventually BSODs my entire system
  • Giving up after another day of trying to use the C40, I start to use my PC controller-less
  • Over the next 2 months my PC would sometimes BSOD randomly
  • November 28th my PC BSOD’ed one final time before the main Sabrent SSD Windows was on never showed up in the BIOS again (the second SSD full of video games is still ok)
  • December 2nd I pop in my brand new 980 Pro SSD and attempt to rebuild all the data I lost from the SSD failure
  • December 8th I rebuild (not recover, REBUILD) most of my lost data and get Windows almost back to where I want it to be
  • From December 2nd-December 9th I run a game called FF11, leaving it on at one point for 3 days straight without shutting the game down and I never get any BSOD or crashes
  • December 9th Cyberpunk comes out, I see that it has the “fully supports controller” tag on Steam
  • I go into Cyberpunk and mess with the graphics settings
  • I close Cyberpunk while trying to troubleshoot why it won’t show up in my streaming software
  • I update my 3090 Nvidia drivers for the Cyberpunk launch day drivers
  • I idownload and install the Astro C40 software
  • I run the Astro C40 software
  • The program hangs
  • I get a horrible feeling in the pit of my stomach
  • My entire PC BSODs
  • PC restarts and I immediately uninstall the Astro C40 software
  • I run Cyberpunk
  • PC BSODs
  • I restart my PC and run Cyberpunk again and everything is fine and dandy
  • I stream Cyberpunk every day since December 10th at 8 hours a day with no BSOD
  • December 16th I run Cyberpunk and it hangs at game boot up. I try to Alt-Tab but nothing happens. I immediately recognize this as a sign of me BSODing. I BSOD 1 minute later
  • I come on here asking for help
  • One would assume Cyberpunk is the cause of all of this, but as my timeline shows, I have encountered this wayyyy before Cyberpunk was released
  • My PC randomly BSOD’s from this point on whenever I launch Cyberpunk, and it always happens upon during initial boot up of the game

Try now, if not i can bump your TL again

ty. I edited my post with the links to all 5 of my minidumps

 

One thing to note:

  • From September 28th-November 28th I used the full retail latest version of Windows 10
  • During that time my PC BSOD’ed multiple times, with November 28th being the final one before all of my data was lost permanently
  • From December 2nd-present I use an older version of Windows (1903), because I was trying to rule out Windows updates as the root of the issue (I still BSOD, so it is not)

Have you updated your drivers and stuff, do you overclock? A quick google search showed that it could be the ssd driver or the nvidia driver. Also, I don’t know if this help,try run windows under a vm in linux and passthrough the gpu

I don’t OC.

As for the SSD drivers, is that something I download from the manufacturer of the SSD or Windows Update?

Cuz I didn’t realize SSD had drivers from manufacturers except for encryption or data migration

Try this link :

Does anyone know if the AVX2 heavy load problem has been fixed in the Gigabyte TRX40 boards, either by the new revision 1.1 or by the new BIOS release in February?