AMD Threadripper 3970X under heavy AVX2 load: Defective design? (No, but there is an issue)

An update on our investigation with AMD: [SOLVED] 3970X - Prime95 stability?

Still would be nice if Asrock and Msi boards,
could also being tested for the said issue.
But there seem to be enough complaints for AMD,
to take a close look to it.
And they seem to have figured out the issue with Gigabyte boards.

So what was the fix in the end? I don’t see it in the thread here.

Was it just a powerstage problem with the motherboard specifically?

@Jimster480 I’m preparing a conclusion post.

1 Like

To add some information for anyone searching for it…

I had issues on one core with the 16k AVX2 Prime95 load when running the CPU at stock. Running PBO or auto OC made the problem go away.

I updated my BIOS (Zenith II Extreme Alpha) to 0902 last night, and its completely stable at stock now.

EDIT:
Looks like that BIOS update included “02. [Q][E] Update CastlePeakPI1.0.0.3 Patch B”

I apologize for the lack of recent updates on this topic. Obviously the current health crisis has not sped up the process.

On March 7th I wrote in this thread:

It turns out that is not the case, at least on my system: I recently switched back to the GIGABYTE TRX40 Aorus Xtreme motherboard (after a few weeks on the ASUS Zenith II Extreme Alpha) and the fact is that GIGABYTE’s latest BIOS version (“F4d”, AGESA 1.0.0.3 B) does not fix the instability under Prime95.

As can be witnessed in this thread, AMD has been extremely responsive and helpful. They do have a fix for the instability that works on my system, but either GIGABYTE screwed up when merging it into their F4d BIOS, or they introduced another issue.

That’s the current situation. At this point, and per my current understanding of the situation, I believe the pressure should be put on GIGABYTE, not AMD.

Ideally GIGABYTE would wake up, get in touch with us and join our conversation with AMD. Unfortunately there’s no sign that they’re willing to do that.

I’m personally done with switching motherboards. I’ve spent far too much time on this issue.

(I’m marking back this topic as unsolved.)

1 Like

Hello @FranzB , I saw this thread and Since I have a 3960x with Aorus Master board I went ahead to test Prime95.

I currently can’t reproduce any issue with Prime95 under Windows 10 or Server 2019 (I have a dual boot system). My BIOS version is F5c

However, I tried a live Ubuntu Linux 20.04 and downloaded Prime95. I started to perform the torture test but after all threads are started I am getting a “killed” message on the terminal window. Doesn’t seems to be related to the issue you have but maybe an issue with using Ubuntu Live.

Unfortunately, Ubuntu 20.04 runs with Kernel 5.4, and only the newer Kernel 5.6 fully supports the Ryzen 3000 series power states, and PBO system. I can see that currently Ubuntu defaults to the lower PState (2.200 Mhz). So I decided not to install Ubuntu on one of my hard drives. I can also see a lot of ACPI errors under linux on boot (Oddly looks like Ubuntu supported as declared on the AMD’s CPU page is false)

I had the same issue on my TRX40 designare - PRIME seemed to work on windows and was dying like that on Linux.
This was fixed by latest firmware F4C I think.
EDIT: maybe not the same issue - I got the issue of some torture threads failing like what started that discussion.

I would say, unless you get a clear notification from Prime95 about a fatal numerical error, it’s not the same issue.

1 Like

@maximlevitsky, @FranzB. Yesterday I tested under latest Manjaro Linux (with new kernel 5.6 stock) and can confirm that Prime95 (mprime) is working fine on my Aorus Master (BIOS F5c) as it did under Windows 10/Server 2019. I also can confirm that PBO works fine under the new Linux release. However, ACPI errors still present on Linux boot messages. I also discovered ACPI errors “15” on Windows event viewer.

These ACPI errrors?

[ +0.000001] ACPI BIOS Error (bug): Failure creating named object [_SB.I2CC.WT4C], AE_ALREADY_EXISTS (20200326/dswload2-326)
[ +0.000036] ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20200326/psobject-220)
[

This should be harmless according to my investigations

These “ACPI 15” errors are quite interesting, I get them too.

I’ve been wondering in the past few weeks if they were related to the following issue:

Occasionally, a fifth blank menu entry in the Power menu wil appear (blank in the sense that it’s missing its label).

In the screenshot below, I don’t have the problem:

image

However what will happen sometimes (maybe in the first few minutes after boot?) is that there will be a fifth blank menu entry, and the display of that entry is causing the ACPI 15 errors (maybe because it’s looking for the label in some database and that label is missing?)

EDIT: In my case the “ACPI 15” errors are those:

The description for Event ID 56 from source Application Popup cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event: 

ACPI
15

The message resource is present but the message was not found in the message table

Yes, I have the same error on my Linux boot log

Yes that is the error I have in the event viewer.
I believe all TR 3erd gen have this error. Prior to the Aourus I had an Asrock TRX40 with the same and I returned it to dealer as It was unstable and dropping memory modules randomly. The Aourus has been rock solid stable. Besides those ACPI errors, but not visible impact

I get the same acpi errors as well, running debian sid with linux kernel 5.6.7, everything is stable. The only issue is the audio over spdif doesn’t work but I believe thats being addressed.

Any thoughts if this is fixed on TRX40 Aorus Master or it only affected the Xtreme?

I’m debating between ASUS TRX40 Prime Pro S and TRX40 Aorus Master. The Master has more features but I really just need a solid board and ASUS is usually pretty good in the BIOS department.

I have similar ACPI errors in my kernel log on my company’s Dell PowerEdge R6515 Epyc 7002P server also on Ubuntu 19.10. I think it’s just a platform thing, I don’t see any negative impact.

Hi,
I have problems similar to yours. What was the verdict regarding to what causes the crashes? The discussion seems to bounce around different motherboards and I got a bit confused if it is CPU, memory or motherboard …

Here is what I have and how crashes occur.

Motherboard: Gigabyte Aorus TRX40 Xtreme rev 1.0
BIOS: F4l (released on 2‎020/09/25)
CPU: AMD Threadripper 3990x
RAM: 8x32Gb, ECC unbuffered Micron/Crucial MTA18ASF4G72AZ-3G2B
PSU: 1500w, be quiet! Dark Power Pro 12
OS: openSuSE Leap 15.2, kernel 5.3

Crash: black screen after compiling Linux kernel 5.10.2 in parallel on 64 cores (make -j 64)

The system works fine when used under regular activity - editing files, browsing (Firefox up to a dozen open tabs), etc

When I compile kernel on a single core, when all files compiled sequentially everything is fine too, thought it takes several hours.

When I tried to compile the kernel in parallel, loading 64 cores (make -j 64) within 1-2 minutes the system crashes with black screen. The temperatures of CPU (ccd’s, io chiplets) looks ok, they never go above 65-70C up to the crash moment. That is reproducible. The system log shows lots of “machine errors”, (“mce” errors in log file). Essentially an error per each core (system sees 128 core).

I downloaded memtest86-Pro to check the memory sticks and the story is similar. When I am running the test on single CPU, all memory checks are ok (it took about 98 hours). When I try to ru m memtest86 in parallel, utilizing all CPUs, I get lots of errors. The worst part is that memtest86 hangs after that.
I also removed all memory sticks except one and was checking one memory stick after one. And again, when I do checks in parallel memtest either hangs and becomes unresponsive, or reports errors and then becomes unresponsive. I did it with 4 sticks.
It is hard to believe that so many sticks are bad, so either there is an issue with the CPU or motherboard/BIOS.

Also, I am confident that CPU heating is not an issue - it has a good air cooling and besides sensors monitoring, the air blowing by CPU cooler is not hot. In contrast my old HP Z820 Xeon workstation can work as a heater in winter.

I would greatly appreciate if you have any input as to what in your opinion is a culprit of that.

I have seen dead DIMMs and CPUs in the past but nothing like that. Ideally I would love to have a dual socket EPYC with lots of RAM, but it it a different price rang. Also, seeing what is going on with the current setup makes me wary of getting my hands on EPYCs if there is a high rate of “bad apples” in AMD CPUs.

Thank you and best regards,
Dima.

I had a similar experience with a Ryzen 3900X and 3,600 MHz RAM. Everything was fine until I loaded up all the cores.

I never proved it but I think it needed an increase in the SoC voltage. Be careful if you increase that. Don’t go too far.

I’ve never tried anything in this but it seems legit: https://www.techpowerup.com/review/amd-ryzen-threadripper-3000-overclocking-deep-dive-asus-rog-zenith-ii-extreme/6.html

I know you are not overclocking but the BIOS default settings may not be enough. And also, you did check for BIOS updates? You might also try contacting Gigabyte support.

Hi,
thank you for the reply. The f4l BIOS version is the latest listed on Gigabyte’s website. I will post the same question there too.

I will try to play with SoC voltages when do some reading , just not to burn CPU accidentally.

The problem with Gigabyte boards like Aorus Xtreme is that they were designed for gamers and in that world most of the load is on GPU. So, my guess is that they did not really tested full CPU load, especially for 280W 3990x. And, yes, I am not overclocking it. The only BIOS settings that were changed are those that are related to virtualization.

I also noticed that the metal plate below CPU that covers M.2 drives became very hot durin Memtest86 run.