Unstable Threadripper 3960x but passes stability tests

Hi All,

I have a Threadripper System that is unstable using SoundPLAN 8.2 and XMrig 6.17.0 (Newer versions only, old version is stable) but seems to pass tests like Prime95 and Memtest.

I really just want soundPLAN to work don’t care about XMrig but thought it could be useful. SoundPLAN SPcalc.exe is a heavily multi-threaded workload.

XMrig just closes randomly between 5seconds and 5mins.
SoundPLAN just closes randomly too.

We have spent months on trying to fix this, so hopefully the community may be able to help.

The build info is as follows:
*Windows 10

  • Threadripper 3960x with NH-U14s
  • Asus Prime TRX40-Pro
  • Cosair Vengeance RGB Pro (CMW32GX4M4C3600C18) 2x8GB
  • MSI 3070Ti Suprim X
  • Seasonic Pime TX-850W
  • Samsung 500G 970 EVO Plus NVME

To date we have:
Upgraded/Downgraded BIOS and Drivers. Installed all updates. Tried different ram, all component temps seem fine, switched power supplies, reinstalled the OS, tried a different SSD, Tried RX580, reinstalled the CPU, rebuilt the whole system (many times). We have tried so many bios settings I couldn’t list them all here.

It might be just me but it also seems a tiny bit more stable using a sata SSD rather than the NVME.

Any ideas is much appreciated?

Do you have access to any newer IC memory modules? Older ICs were only tested against Intel because AMD had no DDR4 CPUs. Anecdotal evidence, but I had a lot of trouble getting any cheaper memory to run, even at JEDEC speeds, on a 2600 when I first got it. Never mind 3200mhz XMP speeds, it wouldn’t even do 2133 on 2400mhz memory.
Switched it out for 32GB dimms at 3200mhz JEDEC, and everything was solid as a rock from that point on. Wouldn’t OC worth a damn, but it never gave me trouble at stock speeds.
32GB udimms require modules of a density that wasn’t available at the time first and second gen Ryzen launched, and 3200mhz JEDEC memory also wasn’t on the market at that time afaik, so getting sticks that fit that spec is pretty much guaranteed to get you ICs that were tested against AMD and Intel.

Not really science, but it’s my superstition.

Check Windows event log - if programs are closing unexpectedly due to a processor exception then there should be a log of the exception.

Any other CPU or memory intensive benchmarks that show the same behaviour? Cinebench, CPU-Z benchmark?

Try running one of those programs which is closing under a debugger like Windbg, that will show if there is an exception or the program’s exit code.

Try booting a Linux live-usb and running a benchmark to see if the same thing is happening, if so - check the exception report in dmesg, or run the program under gdb.

Try upgrading BIOS to the latest version.

Try enabling PCIe error reporting in the BIOS, that will log errors to Windows event log or Linux dmesg. Options will be called “NBIO > Enable AER Cap”, or something mentioning “AER”.

If you can try some ECC memory that will definitely help narrow it down.

Ok, thanks for your feedback. I just ordered some Kingston 16GB DDR4 ECC 2666MHz UDIMM (KSM26ED8/16HD). Should be here within the week so I will try it and get back to you.

There seems to be only limited data in the windows Event Viewer.

Cinebench and CPU-Z benchmarks run no trouble at all.

When MS Edge crashes it displays the error STATUS_ACCESS_VIOLATION.
The other programs just close. But can’t see the errors in the Event Viewer.

I will try Windbg and see if I have any luck.

I couldn’t find PCIe error reporting in the bios. I looked under NBIO but there was no AER Cap. I also searched AER but couldn’t find it or something similar.

Have tried multiple BIOS versions. Currently on the latest 1603.

I just ordered some Kingston 16GB DDR4 ECC 2666MHz UDIMM (KSM26ED8/16HD) so hopefully that arrives quickly and can rule that out.

Hi All,

The ECC memory I ordered arrived and unfortunately I’m still having the same problem.

I have installed two modules of 16GB Kingston KSM26ED8/16HD.

I managed to get some Windows Event Viewer Data. This is what it gave me:

Faulting application name SPCalc.exe, version 8.2.0.0, time stamp 0x60927a0e
Faulting module name RKernel7.dll, version 8.2.0.0, time stamp 0x60927a05
Exception code 0xc0000005
Fault offset 0x00000000001da6f9
Faulting process id 0x505c
Faulting application start time 0x01d85b9465d4d87e
Faulting application path CProgram FilesSoundPLAN 8.2SPCalc.exe
Faulting module path CProgram FilesSoundPLAN 8.2RKernel7.dll
Report Id f59bda2f-cae7-4ad4-a064-dfce662714a2
Faulting package full name
Faulting package-relative application ID

I think it’s a memory access violation. Same as what MS Edge was saying.

I updated Cinabench to R23 and after about 5mins of multi-core bench it becomes not responding. I had HWinfo64 running at the same time and all temps seems good. CPU package was around 86 degrees.

Any thoughts?

Do you get any ECC error reports now?

I always have to search to figure out how to do this in Windows, but apparently you can set the Windows Event viewer filter to:

  • Logged: Any time
  • Event logs: System
  • Event Sources: BugCheck, eventlog, Eventlog, MemoryDiagnostics-Results, MemoryDiagnostics-Schedule, StartupRepair, WHEA-Logger, amdsata
  • Includes Event IDs: 19,1001,6008,1002,1137,1208,1213,1101,1201,1103,11

or in Linux: edac-util -v

I re-downloaded Cinebench R23 and it completes it’s 10min test with no errors.

In Event Viewer there is a ‘Hareware Errors’ section however it doesn’t have any errors regarding ECC anywhere else. Just seeing if I can find any settings in the BIOS or to turn windows on to report it.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.