Intel’s Got a Problem - Things to Try

wendell · July 21, 2024, 4:31am

Background

Watch:

and

Here is the Gamers Nexus survey if you can submit information about your system and problems you are experiencing.

First Steps and How To Think About It

Update your bios. full stop This is the single most important thing you can do. You can save your current bios configuration to an OC profile, and then it might help to load optimized defaults.

You may, or may not, be able to set your XMP profile. Don’t assume that your XMP profile will work. I have also found certain kits of memory with certain CPUs that may have started to degrade are stable at neither default JEDEC standard memory speeds nor the XMP profiles (that may have previously been stable). In this special you might be able to load the XMP profile and then set the ddr speed manually to DDR5-5600 or DDR5-4400.

Technically, Intel guarantees only DDR5-4000 (1) on nearly any motherboard you might have, so you may very well have to run DDR5-4400/4400 especially when we’re talking about 2-rank 16,24,32 or 48gb dimms.

It seems to be the case that systems which had previously been stable with terrific memory clocks are no longer able to attain those memory clocks… so it is necessary to fall back to jedec speeds.

(1) This is somewhat debatable; some intel docs say that 2 DPC motherboards mean you always use the slower/worse 2DPC ram speed even when using only 1 dimm per channel, whereas other documentation for the Alder Lake chipset has the expected 1DPC means 1 dimm installed per channel and 2 DPC means two dimms installed per channel).

Monitoring Stress Testing

get HWinfo 64. If you run into trouble, post a HWINFO64 report here. It’ll tell us a lot of useful info about your intel system.

On intel baseline defaults your system should not be overheating, crashing or have any serious warnings in hwinfo64, even after running stress tests for hours (next steps).

Look at P core and E core voltage in hwinfo64, particularly when your system is under load. We think, but aren’t sure yet, 1.6v and above is too much. Maybe 1.5v is pushing it? We don’t really know for sure, but we do know that i7 and i5 cpus are experiencing less issues AND those cpus tend to operate with lower boost voltages.

Stress Testing

Falcon Northwest has found that installing nvidia drivers over and over again, if you have an nvidia gpu, is a good test of stability. Install 10 times, no crashes, good to go?

I have also found y-cruncher to be a good stress test. On linux: Enable All Tests, set test duration to 1200 seconds and run the stress test. It’ll run forever. If you’re 24 hour stable, then you’ve probably got a reasonably stable systems. If you get errors, skip to the next section.

Another good test is 7-zips bult-in benchmark, or winrar’s built-n benchmark.

If your system passes y-cruncher but you still have errors, keep reading.

If you pass, but the system is does weird stuff, then you may be having PCIe errors. On a small subset of systems I observed PCIe errors from the network cards and PCIe storage that was unexplainable. I think you should reply below with more info about your system. It is also helpful to look over the windows event logs under system to see if you have any WHEA errors, or red X hardware errors. More research here is needed.

Errors from Stress Testing?

The very first is to load setup defaults, disable multi-core enhancement if that’s on by default (Asus it usually is on by default) or select the Intel baseline profile. Paradoxically, most likely you do not want a failsafe profile. Because so many people have had issues these seem to run the CPU at max voltages which could be bad? I didn’t like what the failsafe profile does on the ROG strix Z790 in particular, anyway, vs the intel baseline.

Then first real step would be to back off on memory clocks from the XMP profile first to 5600 and then to 4800, 4400 and finally 4000. If you have 4 sticks of memory, it may be necessary to start at 4000 and then drop down to a glacial DDR5-3600.

If that still fails, then bump your memory speed back to no faster than 5600 (or 4000 if you have 4 dimms installed) and set a multiplier limit on your CPU to 54 (in the case of an i9, or 50 in the case of other CPUs) . Limiting the multiplier both reduces the power consumed and the single-thread/light threading voltage required to attain the limited boost. This could prevent (further) degredation. That’s just a guess, but the results from testing datacenter machines (including non-k 13900/14900) are encouraging about reducing/preventing further degredation.

If that’s stable then try increasing the multiplier limit until you find where it is unstable. Since we’re doing the most basic of basic diagnostics, it is not necessary to change load lines, or ac, or anytuhing.

Do you have an Asus gaming motherboard? If so under ai tweaker V/F point offset – post a screenshot if the VF point offset table and what CPU you have. This tells us what voltages asus is going to apply depending on how loaded the CPU is. This might be inferable based on what we know from a hwinfo64 dump and observing sensor data during 1t/nt cinebench runs. However this table isn’t present in this way in bios for a lot of cpus like T-series (low power) LGA1700, and its better to have the info in this format rather than tring to piece it together from hwinfo64 dumps and load observations. Anyway…

No errors from stress testing, but PCIe Errors?

It seems like I’ve seen so many of these in the wild, and it fits well with Steve’s contaminant theory. Do you have a samsung SSD? It needs a firmware update. As I was exploring this issue, buggy samsung firmware send me chasing a number of red herrings. Fortunately, this is now resolved in the latest firmwares for 970/980/990 samsung SSD.

Next you could try disabling e-cores. This is unlikely to help make stable a system as well as the above changes make it stable, in my experience, though,

TODO add some more here; let me know what problems you’re having.

No errors but on linux you get soft lockup bugs?

Everything is fine in y-cruncher, but then you see stuff like this:

[279762.794259] watchdog: BUG: soft lockup - CPU#17 stuck for 20895s! [kworker/17:2:34843]
[279762.794877] Modules linked in: tls uas usb_storage qrtr cfg80211 binfmt_misc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi input_leds joydev soundwire_generic_allocation x86_pkg_temp_thermal snd_hda_codec_realtek soundwire_bus intel_powerclamp nouveau snd_hda_codec_generic snd_soc_core coretemp snd_hda_codec_hdmi snd_compress ac97_bus snd_pcm_dmaengine kvm_intel snd_hda_intel mxm_wmi snd_intel_dspcfg drm_gpuvm snd_intel_sdw_acpi drm_exec nls_iso8859_1 snd_hda_codec gpu_sched kvm drm_ttm_helper snd_hda_core ttm hid_generic snd_hwdep cmdlinepart snd_pcm drm_display_helper spi_nor mei_pxp mei_hdcp usbhid cec irqbypass snd_timer mtd hid rapl wmi_bmof rc_core i2c_i801 snd mei_me intel_pmc_core intel_cstate spi_intel_pci i2c_algo_bit
[279762.794910]  soundcore spi_intel i2c_smbus mei intel_vsec pmt_telemetry pmt_class acpi_pad acpi_tad mac_hid dm_multipath msr efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 crct10dif_pclmul crc32_pclmul polyval_clmulni nvme polyval_generic ghash_clmulni_intel sha256_ssse3 nvme_core e1000e sha1_ssse3 igc intel_lpss_pci ahci ucsi_acpi xhci_pci nvme_auth intel_lpss typec_ucsi libahci xhci_pci_renesas idma64 typec video wmi pinctrl_alderlake aesni_intel crypto_simd cryptd [last unloaded: cpuid]
[279762.794935] CPU: 17 PID: 34843 Comm: kworker/17:2 Tainted: G S      W    L     6.8.0-38-generic #38-Ubuntu
[279762.794937] Hardware name: Supermicro Super Server/X13SAE, BIOS 3.3 06/24/2024
[279762.794938] Workqueue: events linkwatch_event
[279762.794940] RIP: 0010:native_queued_spin_lock_slowpath+0x83/0x300
[279762.794942] Code: 00 00 f0 0f ba 2b 08 0f 92 c2 8b 03 0f b6 d2 c1 e2 08 30 e4 09 d0 3d ff 00 00 00 77 61 85 c0 74 10 0f b6 03 84 c0 74 09 f3 90 <0f> b6 03 84 c0 75 f7 b8 01 00 00 00 66 89 03 5b 41 5c 41 5d 41 5e
[279762.794944] RSP: 0018:ffffa56e4185bb40 EFLAGS: 00000202
[279762.794945] RAX: 0000000000000001 RBX: ffff94e113cdb448 RCX: 0000000000000000
[279762.794946] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff94e113cdb448
[279762.794947] RBP: ffffa56e4185bb68 R08: 0000000000000000 R09: ffff94e1190161fc
[279762.794948] R10: 0000000000000000 R11: 0000000000000000 R12: ffff94e119016134
[279762.794949] R13: ffff94e113cdb448 R14: ffff94e119016000 R15: 0000000000000000
[279762.794950] FS:  0000000000000000(0000) GS:ffff94ec4f480000(0000) knlGS:0000000000000000
[279762.794951] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[279762.794952] CR2: 00007ffcaf588928 CR3: 000000036ba3c000 CR4: 0000000000f50ef0
[279762.794953] PKRU: 55555554
[279762.794954] Call Trace:
[279762.794954]  <IRQ>
[279762.794955]  ? show_regs+0x6d/0x80
[279762.794957]  ? watchdog_timer_fn+0x206/0x290
[279762.794960]  ? __pfx_watchdog_timer_fn+0x10/0x10
[279762.794962]  ? __hrtimer_run_queues+0x10f/0x2a0
[279762.794964]  ? rcu_core+0x1d2/0x390
[279762.794966]  ? hrtimer_interrupt+0xf6/0x250
[279762.794969]  ? __sysvec_apic_timer_interrupt+0x4e/0x150
[279762.794971]  ? sysvec_apic_timer_interrupt+0x8d/0xd0
[279762.794972]  </IRQ>
[279762.794973]  <TASK>
[279762.794974]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[279762.794977]  ? native_queued_spin_lock_slowpath+0x83/0x300
[279762.794979]  _raw_spin_lock+0x3f/0x60
[279762.794981]  e1000e_get_stats64+0x23/0x140 [e1000e]
[279762.794995]  dev_get_stats+0x66/0x160
[279762.794997]  rtnl_fill_stats+0x40/0x140
[279762.795000]  rtnl_fill_ifinfo+0x92f/0xfe0
[279762.795002]  rtmsg_ifinfo_build_skb+0xa4/0x120
[279762.795004]  rtmsg_ifinfo+0x4d/0xb0
[279762.795006]  netdev_state_change+0x95/0xa0
[279762.795008]  ? dev_activate+0xd7/0x120
[279762.795010]  linkwatch_do_dev+0x5b/0x70
[279762.795011]  __linkwatch_run_queue+0xdf/0x200
[279762.795014]  linkwatch_event+0x31/0x40
[279762.795015]  process_one_work+0x16c/0x350
[279762.795018]  worker_thread+0x306/0x440
[279762.795021]  ? _raw_spin_lock_irqsave+0xe/0x20
[279762.795022]  ? __pfx_worker_thread+0x10/0x10
[279762.795024]  kthread+0xef/0x120
[279762.795026]  ? __pfx_kthread+0x10/0x10
[279762.795028]  ret_from_fork+0x44/0x70
[279762.795030]  ? __pfx_kthread+0x10/0x10
[279762.795031]  ret_from_fork_asm+0x1b/0x30
[279762.795034]  </TASK>

This crash comes from a suspect 13900k but that otherwise passes y-cruncher and compress-7ziip in long duration test runs. This crash occured on micorocde 0x125 with July 23 bios on an asrock rack LGA 1700 board. The CPU does not show requests for more than 1.35v (which is approx what it seems to be receiving).

This kind of crash is frustrating because this crash happened between stress-test runs while the system had been idle for an hour or two for instructions on which test to run next. While running normal linux background tasks.

All fixed?

I hope this helped.

If your system is stable w/ycruncher post a screenshot and let us know your cpu, motherboard and other relevant info.

If this helps you stablize, let us know, particularly if it meant you had to adjust your memory speed, multiplier or something else.

TryTwiceMedia · July 23, 2024, 4:17pm

TL;DR: Fixes

As for the why, RTFP(ost) above.

Wendell has obviously listened to this community and summed up a ton of of posts into one with elaboration and corroboration from major players in the industry.