Hi,
I’m the perplexed owner of an AMD Threadripper 3970X CPU on a GIGABYTE TRX40 Aorus Xtreme motherboard.
I’m writing this thread as a kind of followup/answer to @DerAlbi’s post here (“3970X - Prime95 stability?”) and to my own thread over at Reddit (“Prime95 errors with AMD Threadripper 3970X build”). Sorry, I’m new here and I can’t use links in my post yet…
TL;DR:
It appears that several (many?) 3970X parts —and possibly 3960X and 3990X parts too— are unable to perform correct AVX2 computations under heavy load (presumably heavy current, certainly not heavy temperature) when the data set fit in L2 cache. We need to start a conversation with actual engineers at AMD and/or GIGABYTE, and we need to get precise answers: Is there a systemic issue with the Threadripper 3000 CPU series? Or a high rate of defective products?**
Update #1
Let’s ask the question: Is anyone able to run Prime95 with min and max FFT sizes set to 16K, in-place FFTs, 64 threads, with AVX2 on an AMD Threadripper 3970X? If so, what motherboard are you using? What settings (frequency and voltage)?
The Story
I ordered the 3970X a few days after it was announced back in November and I built my rig in January. Full current specs of the system below [1].
I immediately had a host of issues, including errors in Prime95 and even blue screens under Windows. I reported my experience over at Reddit [2]. Long story short: I determined by trial and error that one core seemed to be defective (!?). The system became stable again when I disabled core #32 in Windows, and I was then also able to pass Prime95. I ended up RMA’ing the CPU and, in order to avoid too long of a downtime, bought a second CPU while waiting for the first one to be reimbursed…
At the time I wrote in the Reddit thread (in a post titled “Update #5”) that with the new 3970X part, my system was fully stable and that I was able to pass Prime95…
An Unfortunate Development
I recently found out that’s not entirely true. The system is fully stable, but I still can’t pass Prime95. Here’s a quick summary of all the tests I’ve done so far:
- MemTest86 v8.3, 4 passes (~8 hours), RAM at 3466 MT/s [PASS]
- MemTestPro 7.0 (paid version of HCI MemTest), 700% (~5 hours), RAM at 3466 MT/s [PASS]
- AIDA64 6.20.5300 System Stability Test (full system test except local disks) [PASS]
- IntelBurnTest v2.54 (based on Intel Linpack), maximum stress level, all available RAM, 20 runs [PASS]
- OCCT 5.5.3 test, large data set, 64 threads, AVX2 [PASS]
- Google stressapptest 1.0.9, all available RAM [PASS]
- Prime95 v29.8 build 6 torture test, 64 threads, min/max FFT = 4K, in-place FFTs, AVX2 (~1 hour) [PASS]
- Prime95, min/max FFT = 8K, in-place FFTs, AVX2 [PASS]
- Prime95, min/max FFT = 16K, in-place FFTs, without AVX2 [PASS]
- Prime95, min/max FFT = 16K (in-place FFTs or not), with AVX2 [INSTANT FAIL]
I’m also using this system fairly heavily on a daily basis, and at times even punishingly so (I’m a rendering software engineer doing large C++ builds and heavy CPU, GPU or CPU+GPU renders) and as far as I can tell, it’s completely stable: I never had a blue screen or any kind of hard lockup, no weird errors in Event Viewer, no application crashes.
Yet, the system fails on Prime95 at any FFT size >= 16K if AVX2 is enabled.
What’s interesting is that Prime95 passes when min and max FFTs sizes are set to 4K or 8K, but it fails instantly (in less than a second) when they’re set to 16K.
In Prime95, a 8K FFT is using 8 * 1024 * 8 bytes (presumably sizeof(double)
), i.e. 64 KB. A 16K FFT is using 128 KB. So as soon as >= 128 KB of data are accessed in place (i.e. a data set that fits comfortably in the 512 KB L2 cache) under heavy AVX2 load, the CPU produces incorrect results. What’s interesting is that it passes with 8K FFTs (which also don’t fit in the 32 KB L1 data cache).
Provisional Conclusion
At this point my conclusion is the following:
At soon as 128 KB or more of L2 cache are accessed under heavy, all-core AVX2 load, one or several of the compute units of the 3970X (AVX2? AVX? probably not ALU) produce incorrect numerical results.
It’s not a thermal issue since it fails within a second. It seems likely to be a power delivery issue. Something Vdroop something something? This is where my hardware knowledge currently stops, sadly.
Power and Temperature Considerations
A note power consumption: My system draws around 195-240 W at idle (as measured by my PSU) with the CPU alone drawing about 70-80 W [3]. I’ve configured Windows’ power plan for maximum performance but I’ve set the minimum processor state to 70 % in an attempt to reduce power draw at idle. It’s possible that there are better settings for maximum performance under full load with best power economy at idle: I’m all ears!
Finally, a note on CPU temperatures: At idle the CPU hovers around 39-50 °C and tops around 72-78 °C under full load. I’m using the best air cooling setup I could think of and get my hands on, but it’s still air cooling, and my system is installed in a closed case (but with extreme attention to airflow).
Appendix
[1] Full system specs:
- AMD Ryzen Threadripper 3970X
- GIGABYTE TRX40 Aorus Xtreme (BIOS version F4a, latest as of Feb 2020)
- Noctua NH-U14S TR4-SP3 + additional NH-U14S fan
- HYPERX Fury RGB 4x16 GB DDR4 3466 MT/s (InfinityFabric at 1733 MHz)
- NVIDIA GeForce 2080 Ti Founder Edition
- Corsair HX1000i PSU (1000 W)
- Dark Base Pro 900 Black
- Windows 10 Professional version 1909, build 18363.657 (fully patched)
- Everything running at stock, no overclocking (apart from the XMP profile)
[2] Reddit: “Prime95 errors with AMD Threadripper 3970X build”
[3] HWiNFO64 at idle: