Last Time…
Intel has a pretty big problem
(youtube.com)
I’ll update this thread with more info as I get time.
TL;DR What’s the update?
Intel Core 13th and 14th Gen Desktop Instability Root Cause Update - Intel Community
Intel Microcode 0x12B is expected to prevent further degredation of 13th and 14th gen CPUs, and intel has offered a root cause explanation which fits pretty well with available data.
Our opinion: The combination of intel CPU microcode, (incorrect) temperature assumptions on both the part of intel and motherboard manufacturers, manufacturing quality variability in CPUs and motherboard manufacturers pushing CPUs too far are the root cause of the problem.
We also think the evidence strongly suggests this problem was been known internally for a long time, but perhaps the number of affected CPUs was estimated to be lower than in reality. Even if it had not been known internally for a long time, it still took too long to “We got you covered” in terms of warranty (both for commercial and end-user customers).
Do you need to replace your CPU?
The problems creep in at the low end of the voltage range. If your CPU does not work at stock memory settings on the intel recommended settings, then your CPU has degraded and needs to be replaced. If you lock the voltage and it stablizies, that should be treated as a stop-gap measure and you should plan on replacing your CPU.
There isn’t a good tool from intel for testing this instability. Y-cruncher and other torture tests also aren’t a good test. From the video, even just setting the performance profile (because it changes the voltages the CPU requests) can stablize a damaged CPU.
Intels processor diagnostic tool has been proven to be unreliable for detecting this issue because it is more akin to a torture test.
The Breakdown
Context: For the video / follow-up we are looking at game servers in hosted environments and the behavior over the last several months. Key takeaways:
- Microcode and bios versions, across the board, from late 2023 until 0x129B, play it fast and loose with internally reported temperatures.
- Datacenter W680 systems with 2U air cooling and higher average thermals fail less
- Datacenter W680 systems with glycol cooling and lower average thermals fail more often
- Datacenter W680 systems mining crypto currency fail less often than datacenter systems running gaming workload (i.e. minecraft, but also others)
- Data somewhat suggests gaming workloads that triggered boosts, when the CPU was both busy and stalled due to pipeline bursts or cache misses, led to faster degredation (1)
(1) Gaming CPU dies faster when boosting more during… gaming? Gaming CPU can’t… game?
This is the most interesting thing to come out of the data, and I don’t have absolute certainty about what I’m seeing. Here’s what I think is happening: Prior mirocode and possibly the eTVB algorithm wasn’t very smart about deciding when to boost. The CPU would request higher voltage expecting to have load to bring the voltage down based on the workload but the reality is that the code was too branchy or unpredictable and the CPU could not boost. This means the CPU was blasted with a high voltage.
Couple that with mis-calibrated load lines from the CPU, which can also skew voltages under loading conditions, and that explains a lot of voltage-related failure.
We know that intel takes into account temperatures when deciding to boost. This is, we think, why cpus that were cooled better tended to fail more often. Perhaps not all parts of the chip were as cool as intel thought when deciding to boost? Perhaps motherboard manufacturers did something that has a knock-on effect of causing intel to mis-measure temperature from the internal sensors?
It is easy to test old microcode on boards in controlled scenarios and see that internally-reported temperatures will vary wildly, but root-causing this aspect of how we got here just wasn’t interesting enough for this video.
Someone I know put it more succinctly than I could: Intel plays stupid games with spec ambiguity (or keeping an eye on what motherboard makers are doing), Intel wins stupid prizes.
I do think that motherboard manufacturers have a lot more responsibility for this problem than is apparent in the public eye, to be clear, but that’s why I spent all this time down to brass tacks on W680 class motherboards. These motherboards are 100% * (Well, 99%, not quite, The Asus W680 is out of spec out of the box imho) more conservative than their Z series desktop counterparts. And that has proven to be an awesome datapoint.
All CPUs?
Some CPUs seem unaffected by this over a long period of time… at the majority of them.
The CIO of one firm I was talking to has estimated they’re going to do field replacement of about 22% of this class of processor they have in the field across a wide variety of systems. For now between the microcode revisions and preventing the CPUs from requesting less than 1 volt and the situation has improved dramatically.
One can imagine the frustration of using a CPU for weeks on end with stability in y-cruncher that then immediately has trouble running apt upgrade with the ondemand
performance governor 20 minutes later.
What about laptops based on HX series CPUs?
Given that the data seems to indicate 1) cpus that have potato-class VRM and that have less than ideal cooling fail less 2) laptop CPUs have less overall voltage then it seems like laptop CPUs that are based on the desktop dies are less likely to be affected.
Level1 Opinion: We think some of this is also rooted in a quality control issue or at least manufacturing quality variance, which may address issues. The game crash databases have an alarming number of similar decompression failures, but across the board failures on all generations of gaming laptop are significantly elevated vs desktop counterparts.
This was not the target of our investigation, and we will likely have more of an update on this later.