Background
There seems to be inherent instability that may be getting worse over time in certain enthusiast gaming CPUs.
Intel offers this guidance around issues, current as of June 24. Intel notes their investigation is ongoing and the guidance offers possible mitigations, but these mitigations do not always resolve issues enthusiasts and gamers are facing.
community.intel.com – 18 Jun 24
June 2024 Guidance regarding Intel Core 13th and 14th Gen K/KF/KS instability…
Intel and its partners are continuing to investigate user reports regarding instability issues on Intel Core 13th and 14th generation (K/KF/KS) desktop processors. We appreciate the Intel community’s patience on the matter and will continue to share…
Level1Techs Has An Idea
If we need to gather a lot of data, what better way than game crash telemetry databases?
See also the video
Intel has a Pretty Big Problem
Error Rate
The game crash telemetry was interesting; a lot of systems were in the crash database but the crash rate per unit time of play is not straightforward to estimate with the way game crashes are typically logged.
More Problems with this approach
When AMD had a similar problem – it was possible to murder AMD cpus in some scenarios with Asus, and to a lesser extent Gigabyte, boards with boards venturing outside recommended
Some will write off these issues with Intel cpus as inevitable consequences of chasing the performance crown. Once a CPU has degraded, maybe it is not possible to recover stability?
A Better Approach – Datacenter Usage
Unhappy with what we found analyzing game crash databases, I decided we needed a new approach.
It would be better to control the system population better, and and the configuration of machines experiencing issues.
These CPUs can also be leased inside a datacenter for game servers and tasks that run well with high single core clock speeds. This typically means that you get error correcting memory and a different chipset motherboard – W680. This is the ideal data source for further analysis.
W680 is potentially a huge help in isolating a voltage and clock problem here because W680 is much more conservative in terms of clocks and watts.
Do we still see issues with W680?
Yes. In a test population of more than 210 W680-based systems, 47.1% of these systems experience at least one incident of instability over a 168 hour test window. This distribution is the same to within 0.4% between Asus brand W680 and Supermicro W680 based boards.
One datacenter technician told us they no longer offer for sale 13th gen CPUs, and they had replaced 13th gen with 14th gen CPUs for customers experiencing issues.
If this were just an eTVB issue, one would think that W680 would be immune, or at least, have a lower rate of crashing.
What Did we find that was most stable?
Our population of systems included 128gb (4x32gb) and 96gb (2x48gb) systems. The 2x48gb were stable with the W680 default power configuration (0x123 microcode was the latest available as of 7/10/2024 on W680) and 125W tdp. Multiplier limit of 53, memory speed cap of 5000 for 1dpc and 4200 for 2dpc.
Using ECC memory with W680 is also recommended.
Some systems were stable with DDR5-4400 (2dpc) and DDR5-5200 (1dpc) per spec, but surprisingly a lot of these systems that had been stable at these speeds months ago needed to be stepped down just a bit to reattain stability.
It does seem like there is a lot of evidince here for degredation over time even with W680s very conservative settings.
Moving Forward
Intel will need to offer warranty services; maybe something similar to what they did with OC insurance for 10th gen CPUS?
The uncertainty here no doubt frustrates many gamers.
I put together this thread for next steps