AMD Threadripper 3970X under heavy AVX2 load: Defective design? (No, but there is an issue)

FranzB · April 28, 2020, 7:13am

I would say, unless you get a clear notification from Prime95 about a fatal numerical error, it’s not the same issue.

DrDocumentum · April 28, 2020, 8:20pm

@maximlevitsky, @FranzB. Yesterday I tested under latest Manjaro Linux (with new kernel 5.6 stock) and can confirm that Prime95 (mprime) is working fine on my Aorus Master (BIOS F5c) as it did under Windows 10/Server 2019. I also can confirm that PBO works fine under the new Linux release. However, ACPI errors still present on Linux boot messages. I also discovered ACPI errors “15” on Windows event viewer.

maximlevitsky · April 29, 2020, 7:25am

These ACPI errrors?

[ +0.000001] ACPI BIOS Error (bug): Failure creating named object [_SB.I2CC.WT4C], AE_ALREADY_EXISTS (20200326/dswload2-326)
[ +0.000036] ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20200326/psobject-220)
[

This should be harmless according to my investigations

FranzB · April 29, 2020, 7:41am

These “ACPI 15” errors are quite interesting, I get them too.

I’ve been wondering in the past few weeks if they were related to the following issue:

Occasionally, a fifth blank menu entry in the Power menu wil appear (blank in the sense that it’s missing its label).

In the screenshot below, I don’t have the problem:

However what will happen sometimes (maybe in the first few minutes after boot?) is that there will be a fifth blank menu entry, and the display of that entry is causing the ACPI 15 errors (maybe because it’s looking for the label in some database and that label is missing?)

EDIT: In my case the “ACPI 15” errors are those:

The description for Event ID 56 from source Application Popup cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event: 

ACPI
15

The message resource is present but the message was not found in the message table

DrDocumentum · April 29, 2020, 1:07pm

Yes, I have the same error on my Linux boot log

DrDocumentum · April 29, 2020, 1:12pm

Yes that is the error I have in the event viewer.
I believe all TR 3erd gen have this error. Prior to the Aourus I had an Asrock TRX40 with the same and I returned it to dealer as It was unstable and dropping memory modules randomly. The Aourus has been rock solid stable. Besides those ACPI errors, but not visible impact

terroreek · April 30, 2020, 2:16am

I get the same acpi errors as well, running debian sid with linux kernel 5.6.7, everything is stable. The only issue is the audio over spdif doesn’t work but I believe thats being addressed.

chris719 · June 5, 2020, 4:40am

Any thoughts if this is fixed on TRX40 Aorus Master or it only affected the Xtreme?

I’m debating between ASUS TRX40 Prime Pro S and TRX40 Aorus Master. The Master has more features but I really just need a solid board and ASUS is usually pretty good in the BIOS department.

chris719 · June 10, 2020, 4:42am

I have similar ACPI errors in my kernel log on my company’s Dell PowerEdge R6515 Epyc 7002P server also on Ubuntu 19.10. I think it’s just a platform thing, I don’t see any negative impact.

DimaB · January 4, 2021, 7:36pm

Hi,
I have problems similar to yours. What was the verdict regarding to what causes the crashes? The discussion seems to bounce around different motherboards and I got a bit confused if it is CPU, memory or motherboard …

Here is what I have and how crashes occur.

Motherboard: Gigabyte Aorus TRX40 Xtreme rev 1.0
BIOS: F4l (released on 2‎020/09/25)
CPU: AMD Threadripper 3990x
RAM: 8x32Gb, ECC unbuffered Micron/Crucial MTA18ASF4G72AZ-3G2B
PSU: 1500w, be quiet! Dark Power Pro 12
OS: openSuSE Leap 15.2, kernel 5.3

Crash: black screen after compiling Linux kernel 5.10.2 in parallel on 64 cores (make -j 64)

The system works fine when used under regular activity - editing files, browsing (Firefox up to a dozen open tabs), etc

When I compile kernel on a single core, when all files compiled sequentially everything is fine too, thought it takes several hours.

When I tried to compile the kernel in parallel, loading 64 cores (make -j 64) within 1-2 minutes the system crashes with black screen. The temperatures of CPU (ccd’s, io chiplets) looks ok, they never go above 65-70C up to the crash moment. That is reproducible. The system log shows lots of “machine errors”, (“mce” errors in log file). Essentially an error per each core (system sees 128 core).

I downloaded memtest86-Pro to check the memory sticks and the story is similar. When I am running the test on single CPU, all memory checks are ok (it took about 98 hours). When I try to ru m memtest86 in parallel, utilizing all CPUs, I get lots of errors. The worst part is that memtest86 hangs after that.
I also removed all memory sticks except one and was checking one memory stick after one. And again, when I do checks in parallel memtest either hangs and becomes unresponsive, or reports errors and then becomes unresponsive. I did it with 4 sticks.
It is hard to believe that so many sticks are bad, so either there is an issue with the CPU or motherboard/BIOS.

Also, I am confident that CPU heating is not an issue - it has a good air cooling and besides sensors monitoring, the air blowing by CPU cooler is not hot. In contrast my old HP Z820 Xeon workstation can work as a heater in winter.

I would greatly appreciate if you have any input as to what in your opinion is a culprit of that.

I have seen dead DIMMs and CPUs in the past but nothing like that. Ideally I would love to have a dual socket EPYC with lots of RAM, but it it a different price rang. Also, seeing what is going on with the current setup makes me wary of getting my hands on EPYCs if there is a high rate of “bad apples” in AMD CPUs.

Thank you and best regards,
Dima.

zlynx · January 4, 2021, 9:16pm

I had a similar experience with a Ryzen 3900X and 3,600 MHz RAM. Everything was fine until I loaded up all the cores.

I never proved it but I think it needed an increase in the SoC voltage. Be careful if you increase that. Don’t go too far.

I’ve never tried anything in this but it seems legit: https://www.techpowerup.com/review/amd-ryzen-threadripper-3000-overclocking-deep-dive-asus-rog-zenith-ii-extreme/6.html

I know you are not overclocking but the BIOS default settings may not be enough. And also, you did check for BIOS updates? You might also try contacting Gigabyte support.

DimaB · January 4, 2021, 9:28pm

Hi,
thank you for the reply. The f4l BIOS version is the latest listed on Gigabyte’s website. I will post the same question there too.

I will try to play with SoC voltages when do some reading , just not to burn CPU accidentally.

The problem with Gigabyte boards like Aorus Xtreme is that they were designed for gamers and in that world most of the load is on GPU. So, my guess is that they did not really tested full CPU load, especially for 280W 3990x. And, yes, I am not overclocking it. The only BIOS settings that were changed are those that are related to virtualization.

I also noticed that the metal plate below CPU that covers M.2 drives became very hot durin Memtest86 run.

Trooper_ish · January 4, 2021, 9:39pm

If one wanted to test voltage is enough, as @zlynx mentions , one might try down-clocking the CPU, but keep the current voltage, and see if problem persists?

It should be the same (test wise) as raising voltage with same clocks?

JugsOfHolyness · January 15, 2021, 10:56am

Hello, I am scouring the internet to hopefully find out why my Threadripper is BSOD’ing on me with the stop code “DPC Watchdog Violation”

OS - Windows 10, x64, Full retail
3 months old hardware
CPU - 3960x
GPU - Zotac 3090
MotherBoard - Asrock TRX40 Taichi
Power Supply - Corsair AX1000
Desktop

Here is a rough timeline of what happened:

On September 24th the 3090 came out and I built a completely brand new PC with it
After 2 days of setting up Windows 10 exactly how I like it, I try installing the Astro C40 controller software but it hangs
After hanging it eventually BSODs my entire system
Giving up after another day of trying to use the C40, I start to use my PC controller-less
Over the next 2 months my PC would sometimes BSOD randomly
November 28th my PC BSOD’ed one final time before the main Sabrent SSD Windows was on never showed up in the BIOS again (the second SSD full of video games is still ok)
December 2nd I pop in my brand new 980 Pro SSD and attempt to rebuild all the data I lost from the SSD failure
December 8th I rebuild (not recover, REBUILD) most of my lost data and get Windows almost back to where I want it to be
From December 2nd-December 9th I run a game called FF11, leaving it on at one point for 3 days straight without shutting the game down and I never get any BSOD or crashes
December 9th Cyberpunk comes out, I see that it has the “fully supports controller” tag on Steam
I go into Cyberpunk and mess with the graphics settings
I close Cyberpunk while trying to troubleshoot why it won’t show up in my streaming software
I update my 3090 Nvidia drivers for the Cyberpunk launch day drivers
I idownload and install the Astro C40 software
I run the Astro C40 software
The program hangs
I get a horrible feeling in the pit of my stomach
My entire PC BSODs
PC restarts and I immediately uninstall the Astro C40 software
I run Cyberpunk
PC BSODs
I restart my PC and run Cyberpunk again and everything is fine and dandy
I stream Cyberpunk every day since December 10th at 8 hours a day with no BSOD
December 16th I run Cyberpunk and it hangs at game boot up. I try to Alt-Tab but nothing happens. I immediately recognize this as a sign of me BSODing. I BSOD 1 minute later
I come on here asking for help
One would assume Cyberpunk is the cause of all of this, but as my timeline shows, I have encountered this wayyyy before Cyberpunk was released
My PC randomly BSOD’s from this point on whenever I launch Cyberpunk, and it always happens upon during initial boot up of the game

mutation666 · January 15, 2021, 11:11am

Try now, if not i can bump your TL again

JugsOfHolyness · January 15, 2021, 11:46am

ty. I edited my post with the links to all 5 of my minidumps

One thing to note:

From September 28th-November 28th I used the full retail latest version of Windows 10
During that time my PC BSOD’ed multiple times, with November 28th being the final one before all of my data was lost permanently
From December 2nd-present I use an older version of Windows (1903), because I was trying to rule out Windows updates as the root of the issue (I still BSOD, so it is not)

C6H6 · January 18, 2021, 10:45am

Have you updated your drivers and stuff, do you overclock? A quick google search showed that it could be the ssd driver or the nvidia driver. Also, I don’t know if this help,try run windows under a vm in linux and passthrough the gpu

JugsOfHolyness · January 20, 2021, 10:36pm

I don’t OC.

As for the SSD drivers, is that something I download from the manufacturer of the SSD or Windows Update?

Cuz I didn’t realize SSD had drivers from manufacturers except for encryption or data migration

C6H6 · January 21, 2021, 1:18am

Try this link :

nbartowski · March 18, 2021, 12:50pm

Does anyone know if the AVX2 heavy load problem has been fixed in the Gigabyte TRX40 boards, either by the new revision 1.1 or by the new BIOS release in February?