Intel's Got a Problem - 13900K/KS/KF 14900K/KS/KF Crashing

@wendell The degrading over time behavior is something that looks similar to the issues I had with multiple AM4 systems trying to run XMP on them.

TL:DR is full XMP degraded the stability over span of between just hours from first boot with XMP to two weeks or a month, dialing back XMP made it more stable, but over time those systems’ stability degraded slowly anyway. Limiting/disabling the turbo on those cpus also brought stability. Replacing kits with new ones seemed to restart the issues and it seemed to degrade the memory kit and not the CPU itself necessarily. Fun fact ist that few kits that I checked could run through the memtest64 night after being annoyingly unstable without errors.

So one thing that I don’t see in the info here is how are the affected systems brought back to stability apart from dialing back settings - did anyone try replacing the memory kits instead of the CPUs after dialing back the XMP on the unstable system brought back some of the stability?

Do we have data on the memory kits “class” of those servers? Are those cheap’o kits or some quality stuff. If server providers are buying them at bulk and looking for best price, maybe the specific kits that look optimal price-wise that did work on previous gens and AMD systems flawlessly are not supported properly?

Would love to know in what state the CPUs and memory kits from these unstable system are separately - working with rest of the hardware new.

2 Likes

@wendell
there is a programm that bruteforces those undocumented stuff. called sandsifter. its older but they managed to find a root switch, they just pulled one of the undocumented stuff.

dunno if it will help, will find something or will work.

regards

1 Like

I think it’s worth mentioning that I have an unraid system running on a 13700 non-k CPU that produces constant machine-level errors. Nothing that ever causes a crash. But they won’t stop no matter what I tweak. I’ve given up and told the reporting system to ignore them.

3 Likes

I’ve had lots of issues back in the day with Ryzen 7 3700X that after RMA had me switch to 10700k which is very stable. I was contemplating switching to 13/14th gen but now I am afraid of instability of Intel and my past bad experience with AMD.

3 Likes

I’ve always used second hand parts myself. The way I “glamorize” this is by telling myself that once a component has been used 3-5 years and ends up on the second hand market, it’s battle tested! So you end up paying less for the most stable stuff :laughing:

With all these issues, up and down the stack from desktop to workstation, from both AMD and Intel, maybe it’s true?

2 Likes

Seems like intels memory issues are haunting them again. One relatively similar issue came to my attention when was upgrading to 11th gen:
multiple generations before it had architectural cache memory issues that were triggering on speciffic codepaths thus more common only in some games. More details can be found this reddit or overclocker(dot)net post:
unable to post direct link so put this into google to reach it: asus_maximus_13_and_rocket_lake_the_rules_have

Please post the stress test parameters you used. You seem to mention at least three tools. And y-crunch itself has a ton of parameters.

Try y crucher stress test

Stress test
1200 test duration
Enable all tests

0 to deep atress test.

That and 7zips bencjmark for compress dexompress will reveal issues for most cpus

6 Likes

The folks at Digital Extremes also noticed those instability.
They don’t mention servers side of things, but the graph on a specific crash in Nvidia driver says a lot about the state those CPU are currently.

1 Like

Alderon’s in as well. Might not be anything to it but their phrasing’s more general than just i9s.

2 Likes

The degradation over time is what makes this really frustrating. It’s clearly an issue during or before fabrication, IMO. The biggest change between Alder Lake and Raptor Lake was cache, so that would be where I look first. A cache issue would also lead to decompression errors becoming a common occurrence.

There’s no way Intel is going to acknowledge this for retail, or even client-side enterprise machines. It’s just going to be too expensive to fix, if all caches of the 8+16 dies are prone to fail.

The datacenter folks might just get a free upgrade to Arrow Lake, to stave off AMD.

3 Likes

Seems like intels ultra aggressive power and thermal designs finally caught up with them in end/user visible way. Some proverbial bucket finally spilled, and end users notice. Wonder what combination of factors is it this time.

What surprises me it has not happened sooner, upper end i7s and i9 were truly insane in last few generation, and getting more insane.

There was a fellow here on the forums cluelessly asking why is he getting bad temperature and throttling on 14900KS:

  • 24 core heterogeneous cpu
  • up to 300-400W real world power draw (Cpu only !)
  • thermal ceiling is reached immediately even with high end air cooler or custom hydro

=> these are not reasonably binned and configured platforms even in ideal environment and ideal platforms.

While KS is the extreme that proves the point, lower ones are off the hook either.

Way too much power running through too little area. Now if you activate XMP and incidentally overvolt IMC, magic smoke might happen soon. Add in mobo vendor insane hidden defaults for little bit more suprise and sprikle MultiCoreEnhancement on by default for shit and giggles.

I just cannot imagine how something like that, at near cutting edge litography nodes, can work long term or on sustained workload.

Also small tldr:

If cpu cannot reliably operate on advertised clocks on stock setting, then they are defective product , full stop. RMA aggressively and demand full refund. Throw motherboard in for good measure. Reason for malfunctions does not matter in the end, and motherboard is useless for defective cpu line.

Beancounter and managements are banking on there being confusion and will be hot potato throwing blame between MB and system integrator as long as necessary for easy legal return windows to run out.

EDIT2:
Its insane to know some people are running frankenmoster servers on this platform and actually using it for real world production. Game servers yes, but still production.

It doesn’t matter if you use W series chipset and supermicro boards, its still consumer platform with all the relevant corners cut at design phase.

There are enterprise grade hardware readily available that’s not that expensive and not that significantly slower. You have to spend money to make money.

Talk about self-inflicted wounds here.

EDIT3:

Wendell: … Not so fast there Chuckles!

That quote deserves to be stolen and broadly applied elsewhere, huehue

Sidenote - is the Steves comment about server providers loosing trust in intel due to this actually reasonable?

  • this is contained to 13th and 14th generation consumer lineup, not xeon line
  • absolute majority of actually used hardware in enteprise market is not affected or theoretically impacted at all
  • hosting and supporting this kind of setup is niche market, no?
  • we talking the chipzilla here, even if some smaller server hosters threatened to boycott them over this, what do they care? Few tens thousands of lost sales is drop in a bucket for them.
2 Likes

Has anyone seen shades of this degradation on the Minisforum AR900i with the 13900HX ? Mine has been quiet, but I wonder if I should be stresstesting it to validate, and if so, with what exactly.

@wendell If you have a How-To Stress and capture data that would be constructive for you, please share. I bet many tinkerer folks have this box and it falls within the CPU blast-radius right ?

2 Likes

I haven’t but have had the same thought and think this is an interesting question. The 185H might be worth a look too to see what’s happening with Intel 4. Probably there’s a good bit less H series than K series data, though, so statistical power needed for a conclusion might be lacking.

Also haven’t seen anything showing Emerald Rapids isn’t affected. Xeon 5th gen does have roughly a third the power density and clocks 1-2 GHz lower in the parts I’ve checked, which IMO makes it plausibly distinct from the desktop parts under a power-clock root cause hypothesis.

This is only anecdotal, but I’ve had a 4th gen Xeon W running full tilt at 400-500 watts 24/7 for the past 11 months; it’s been OC’d to within a fairly small margin of stability and there has been no instability encountered the past year.

4 Likes

U can second this on a sample size of two spr workststion cpus

3 Likes

Spr meaning Sapphire Rapids? I wonder if Xeon E-2488 and 2486 might the most susceptible candidates as they’re Raptor Lake and clocked to 5.6 GHz like the 14700K and KF.

@wendell Did you have ever do detailed voltage monitoring on known troublesome setup? Is intel default freq/voltage curve sane on these high end parts?

There was some anecdote on reddit mentioning that some voltage domain (cpu VID?) being pumped up to 1.5V in plain simple single core workloads on defaults.

Might be error in communication, but what if it weren’t?

If it was observed cpu vid, now that is positively insane. I haven’t mucked about OCing for a long time, but in 22nm intel era that was threshold of insanity that guaranteed degradation and was used only for extreme overclocking.

And that silicon was way more forgiving than intel 14++++++nm or whatever 13/*14 series is now.

@lemma

Do sapphire rapids and emerald rapids share design with current consumer 13/14 gen lineup? heterogenous designs are clearly not entering enteprise lineup, question is what made the cut into workstation designs.

2 Likes

Yeah, so this is true but its also complicated. If your ac dc loadline match and are in spec then generally (giant kvetching here) limiting the multiplier to 53 prevents the voltage blast

Howver i have small sample size datasets showing issues on 13700t soooooo

4 Likes

You forgot the not so fast there Chuckles !

4 Likes