Intel's Got a Problem - 13900K/KS/KF 14900K/KS/KF Crashing

Background
There seems to be inherent instability that may be getting worse over time in certain enthusiast gaming CPUs.

Intel offers this guidance around issues, current as of June 24. Intel notes their investigation is ongoing and the guidance offers possible mitigations, but these mitigations do not always resolve issues enthusiasts and gamers are facing.

community.intel.com – 18 Jun 24

June 2024 Guidance regarding Intel Core 13th and 14th Gen K/KF/KS instability…
Intel and its partners are continuing to investigate user reports regarding instability issues on Intel Core 13th and 14th generation (K/KF/KS) desktop processors. We appreciate the Intel community’s patience on the matter and will continue to share…

Level1Techs Has An Idea
If we need to gather a lot of data, what better way than game crash telemetry databases?

See also the video

Intel has a Pretty Big Problem
Error Rate
The game crash telemetry was interesting; a lot of systems were in the crash database but the crash rate per unit time of play is not straightforward to estimate with the way game crashes are typically logged.

More Problems with this approach
When AMD had a similar problem – it was possible to murder AMD cpus in some scenarios with Asus, and to a lesser extent Gigabyte, boards with boards venturing outside recommended

Some will write off these issues with Intel cpus as inevitable consequences of chasing the performance crown. Once a CPU has degraded, maybe it is not possible to recover stability?

A Better Approach – Datacenter Usage
Unhappy with what we found analyzing game crash databases, I decided we needed a new approach.

It would be better to control the system population better, and and the configuration of machines experiencing issues.

These CPUs can also be leased inside a datacenter for game servers and tasks that run well with high single core clock speeds. This typically means that you get error correcting memory and a different chipset motherboard – W680. This is the ideal data source for further analysis.

W680 is potentially a huge help in isolating a voltage and clock problem here because W680 is much more conservative in terms of clocks and watts.

Do we still see issues with W680?

Yes. In a test population of more than 210 W680-based systems, 47.1% of these systems experience at least one incident of instability over a 168 hour test window. This distribution is the same to within 0.4% between Asus brand W680 and Supermicro W680 based boards.

One datacenter technician told us they no longer offer for sale 13th gen CPUs, and they had replaced 13th gen with 14th gen CPUs for customers experiencing issues.

If this were just an eTVB issue, one would think that W680 would be immune, or at least, have a lower rate of crashing.

What Did we find that was most stable?
Our population of systems included 128gb (4x32gb) and 96gb (2x48gb) systems. The 2x48gb were stable with the W680 default power configuration (0x123 microcode was the latest available as of 7/10/2024 on W680) and 125W tdp. Multiplier limit of 53, memory speed cap of 5000 for 1dpc and 4200 for 2dpc.

Using ECC memory with W680 is also recommended.

Some systems were stable with DDR5-4400 (2dpc) and DDR5-5200 (1dpc) per spec, but surprisingly a lot of these systems that had been stable at these speeds months ago needed to be stepped down just a bit to reattain stability.

It does seem like there is a lot of evidince here for degredation over time even with W680s very conservative settings.

Moving Forward
Intel will need to offer warranty services; maybe something similar to what they did with OC insurance for 10th gen CPUS?

The uncertainty here no doubt frustrates many gamers.

I put together this thread for next steps

15 Likes

We have seen this since day 1 of the 13th gen since SOP involves several days of burn in (and sweet benchmark scores).

Undervolting the CPU with a mild overclock typically regained stability, but more than one has been back soft bricked after XMP default settings are set by the end user after several months of use at stable undervolted overclocks.

Undervolting along with only populating 1 DIMM per channel yielded us the most reliable configurations and allowed decent overclocks including locked all core turbo which I have running in several environments trying to avoid Threadripper builds.

There appears to be a correlation between DIMM population and overclocking but so far it’s find a stable balance and move on.

I do not have any dumps to contribute, but will start collecting from now on.

Any other consumer skus you are chasing?

EDIT: Posted before seeing the video, looks like I restated your findings
After talking to my lead tech, he said underclocking DIMMs was a quick fix for the worst systems. He just sent out a 14900KF with DDR5-6000 locked to 3100 as that was the only way he could get it to post. The customer will be replacing with AMD for his next build after being a lifelong intel fanboy. As an aside, coolers do not appear to affect stability.

6 Likes

Sounds like memory controller issue? If systems are largely stable after downclocking RAM to 3400 or so?
Gosh, this reminds me of the Pentium 3 1.13GHZ instability issues… hopefully Intel will welcome this information and get to work creating this branch of investigation in their labs.

2 Likes

This is kind of insaine. Intel still has brand recognition. At the end of the day that can we what makes or breaks a company. The negative PR from this is going to scar them for a long time.

Video was cool. I hope we get more of these kinds of deep dives.

Also added

6 Likes

Before I watched your video I thought I was going crazy with the amount of RMA’s our company had to process specifically relating to these two Intel SKU families. Lenovo, Dell, HP, refurbished, non-refurbished. Definitely going to be bringing this up in our next meeting with my higher ups regarding new customer purchases in the future.

4 Likes

More of this please - this was fantastic and it’s something Wendell/L1Techs is uniquely positioned to do.

3 Likes

Buildzoid has been complaining about the intel memory controller since 11th gen. Now “silicon degradation” isn’t the same thing , but I think it’s interesting to note the quality of products being lowered and wonder if there is some correlation there

2 Likes

Part of the reason I decommissioned my 14900K system was due to intermittent NVME errors causing silent data corruption. ZFS was catching it on linux, but Windows’ NTFS (obviously) wasn’t. I thought initially it was my RAM overclock (DDR4-4400, was stable on my 12900k in the same motherboard but who knows) or the CPU overclock (nothing crazy - undervolted, too, and would pass 48+ hours of memory tests and y-cruncher/occt/etc). So I reset back to factory settings, including DDR4-3200, but was still getting silent corruption, albeit at a slightly lower frequency. (To be clear, this is on 3 different NVME drives, 3 different brands, one gen4 two gen3, which were completely stable beforehand.)

Ended up building a Threadripper system to replace it, reusing all of the problematic SSDs, and am completely problem-free now. I had wondered if it was maybe a motherboard problem, or if there was something genuinely wrong with the CPU, but couldn’t RMA it to find out since I delidded it to try to keep temps in check :confused: Apparently it’s better to just bounce off 100C for hours on end rather than run direct-die cooling to maintain a more reasonable 65C.

5 Likes

well this sucks - glad my most recent system is a 11th Gen

1 Like

I’ve been bit by this bug hard! I decided to move from x299 back to mainstream platforms, so picked up a z790 mb and a 14900K when it became available. (Even pre-ordered it). Right off the bat, bad memory. Wound up having to return my Dominator Titanium DDR5 kit for an RMA replacement. Installed it, still weird BSODs on windows 11 pro, odd things like programs closing with no errors, programs getting into a close and re-launch loop in the background, sub-par performance, the whole nine yards. Updated bios. Updated drivers, updated everything I could. Finally, Intel suggested using their new ‘revised default’ settings that the newest bios were incorporating. These settings crank the performance levels through the floor. They are WAY sub-standard settings compared to what advertised specs say the hardware is running at. That helped a LITTLE, but still had all sorts of weird issues going on. Checked for viruses, nothing. Even reinstalled windows to try and escape it. No luck. Intel finally RMA’d the cpu. I get the new replacement and it’s got markings on the contacts that made me very unwilling to risk installing it in my board. SO, I contacted them, and got email the other day that they’re going to RMA that one too.

Now I see this video and realize there are more people out there having this same issue or very similar issues. I even went and bought a new cpu locally and it was DOA. (13900K in this case). Then I bought a 14700k and it too wouldn’t post past the cpu fault on my board. Tried 3 boards, and 3 procs and not one combo would post. This is the WORST hardware reliability issue I have ever seen. This makes the Voodoo5 and the lack of drivers for it look like a fun time!

I have an AMD cpu, it sucks. It’s slow as hell and frankly I wouldn’t go back to AMD normally, but right now, I’m wondering what to do. I have a ton of money tied up in this 14900k build that I can’t even use! Right now, I’m back on the x299 box just to have a system that works.

If anything I’ve been going through can help put together a larger picture of what is going on here, please, just let me know! I’d be more than happy to help figure this out once and for all!

Thanks!

2 Likes

That reminds me: That’s another symptom I was having! Getting constant CRC errors on brand new nvme drives over and over and over and over when trying to do simple downloads or copy paste.

1 Like

Refreshing to watch videos on these puzzling issues with hard, concrete evidence and irrefutable facts that cleanly disproves a lot of the common assumptions right off. Some real investigative journalism right there. Great to watch and as a hardware enthusiast I am very curious to see where this goes and how it plays out.

That the issue has spanned two generations in a row and Intel still hasn’t outright stated what the cause is or named mitigations to minimize it is not a good sign.

1 Like

I work on the side as Free lancer beta testing games and checking game crash logs etc… helping devs to optimize their games and act as outsource QA for some studios.

most of my work is under NDA so i cant say much but I can tell you this much. yes intel have an issue XD. however i didn’t make the connection that its only 13900k/14900k. so this this thread and your video shed some light on my hypothesis. thanks

1 Like

That’s the main issue I see, if the cause is not properly addressed, no matter how many cpus they will RMA, the same thing will most likely occur again.

I cannot think how they will properly fix that, even if they do a recall, people have invested in the platform as well and i think the new ones will be on a different mobo generation.

It’s very very sad for the end users…

1 Like

The post was flagged, so at the end of the post I will give a VERY BRIEF description of why I think my experience with Ryzen might be applicable to the Intel problem.

From my extensive testing with Ryzen from 3rd Gen onwards, I might have some info to set you onto the right track, even though the problem you are describing is for Intel.

I have been testing along the lines that nobody else has in the Tech Media or Youtube as far as I can see, and I have written a guide on configuring Ryzen 3rd/4th/5th Gen on any motherboard that is used internally at AMD.

It is a lot easier to demonstrate (takes me about 10 seconds) than it would to explain.

It involves hard crashing my system and is reproducible 100% of the time.

The explanation - which is quite extensive - would take way longer to to write than to explain in a post, because of the caveats, the misunderstandings etc.

So if there is any interest you can reach me on Discord under the username “michaelnager”

My avatar is a the Hubble Telescope picture that looks like someone (God?) giving the middle finger :grin:

As for my name, the Michael part is obvious and “Nager” is the German word for a rodent and a mouse is a kind of rodent; so my name literally means, “Mickey Mouse”. That’s what I started off with and worked backwards to Michael Nager.

Just FYI, my first ever overclock was an Intel 8088 from 4.77 MHz to 6MHz.

So here is the the very brief description of the problem I have found.

For the CPU to achieve the higher single core performance it needs to punt in a lot more voltage, that voltage however has to drop for multicore.

The problem however is that there is a limit to how much voltage can be dropped, i.e. how big of a jump can be made, without the CPU hard crashing.

My feeling is that something similar is happening with Intel CPUs that have been pushed to their limits.

I don’t have an Intel CPU, but I am not a bigot or a tribalist, and if I think that I have some insight that might be of aid to solving an Intel problem, then I will post it.

If this is not good enough for you Wendel to be classed as “relevant” then I will delete it and you can go about your merry way.

The thing is that there is a lot more to it than what I can explain in a post, and we can either do the whole posting back and forth thing, or you can give me a quick voicecall on Discord and I can SHOW YOU.

The thing is that the whole posting back and forth thing can lead to us being here until one of us dies, and I am 65 years old, so in my case that is a definite possibility.

2 Likes

I am in a RMA Right now, my 14900k has arrived this Morning, will Test it After work and give a Feedback

1 Like

I have been pulling my hairs out with one w680 13700 proxmox node over the past few months and at this point maybe this is a related issue? Seems accelerated by iommu usage:

What’s curious is that i get kernel errors/crashes from a different Module every time, similar to the randomness mentioned in the video

1 Like

14700k w/ Asus W680 here. Using Win 11, Solidigm nvme driver shows error in event viewer (something detecting error from raid port 5; drive is not being used in raid) However nothing shows in SMART, nor have I experienced any corruption.

I still have a 12900k sitting around, might put that back in to see if the error goes away.

@wendell The earlier BIOSes had settings at 4096w and 700A for this board. I always disabled MCE though.

1 Like

fingers cross… its going in, now.

It has this Rating, my defektive had Overall 99 points

iam in Windows now… i am going to watch youtube… which was as hard as it sounds not stable with my last 14900k… tomorow i will try some UE5 games… with stock settings… which as hard as it sounds was not stable without an underclock…

1 Like