Debian Linux Stable on Pro WS W680-ACE IPMI (application segfaults, kernel panic)

Hi all,

I have been troubleshooting for an issue for a couple of weeks and I am running out of ideas… Lately, I have been seeing applications segfault [1,2] randomly after a BIOS upgrade on my Asus Pro WS W680-ACE IPMI and yes I updated the Intel ME software to the latest as well.

I am on the latest BIOS as of 7/5/2024:
ASUSTeK COMPUTER INC. System Product Name/Pro WS W680-ACE IPMI, BIOS 3603 05/27/2024.

Also attempted: I have tried the BIOS defaults and tweaking various BIOS settings but to no avail, the segfaulting (and sometimes random kernel crashes occur). I built my own kernel tested with 6.9.7 and 6.10-rc6 and was also seeing the segfault occuring and kernel crashes (random).

[1] Yesterday:
Jul 4 14:50:09 machine kernel: munin-html[170056]: segfault at 9 ip 000056101420a12a sp 00007ffc2c696190 error 4 in perl[561014121000+195000] likely on CPU 8 (core 16, socket 0)

[2] Shortly after boot:
Fri Jul 5 20:05:07 machine kernel: traps: iwatch[2082] general protection fault ip:5559574ddad9 sp:7ffe67f599a0 error:0 in perl[5559573b5000+195000]

I’ve reached out to the LKML and debian-user@ but I have not gotten very far (I cannot post the links here but if you Google the following you will see the threads:

6.9.7: kernel panic: RIP: 0010:btrfs_clone_write_end_io+0x1e/0x60 [btrfs] (dmesg included)

Re: 6.1.0: NVME drive goes offline randomly even with: nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Thinking it may have been related to BTRFS, I switched over to using ext4 from btrfs and the segfaulting applications (different ones) continue, I am waiting to see if the kernel is going to panic again (likely).

I’ve run Intel’s processor diagnostic tool and memtest86/memtest86+ and everything comes back clean. Proof at bottom.

I have also re-seated the Linux NVME drives (previous I was running BTRFS in RAID-1, now I am using MDADM w/RAID-1 + EXT4 on top of it).

CPU check and Memory Check:


The firmware on my 2 x Samsung 990 Pro 4TBs is the latest:

smartctl output:

Model Number:                       Samsung SSD 990 PRO with Heatsink 4TB
Firmware Version:                   4B2QJXD7
Model Number:                       Samsung SSD 990 PRO with Heatsink 4TB
Firmware Version:                   4B2QJXD7

Aside from re-installing my entire system from scratch, which would take awhile; does anyone have any thoughts or ideas what may be causing these issues?

Thanks,
Justin

PSU and voltages also appear to be OK:

Curious if anyone has any suggestions of where to from here?

1 Like

Re-enabling the following and will give it some time to see if more segfaults occur and/or kernel panics happen:
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Does anyone else that uses this board have similar issues?

Segfaults continue:
[Fri Jul 5 21:10:08 2024] traps: munin-html[53053] general protection fault ip:555754db809e sp:7ffc20e47250 error:0 in perl[555754cd0000+195000]

There is a post in the Proxmox forums where the only thing that ended up working was swapping the board out:

Proxmox post—-from another person using the same/similar board:

Finally, after another complete hardware change yesterday, I had no more crashes since more than 16 hours.

So maybe there is an issue with the Asus Pro WS W680 Boards, with the chipset and/or with the i9-13900 (non-K) if the hardware is getting older.

#1: Disabled ARC using primarycache=none → still crashes
#2: Set aio=threads on all VMs → still crashes (2024-05-02, SSH possible)
#3: Set cache=none on all VMs → still crashes (no kernel logs?, 2024-05-03, SSH not possible)
#4: Set intel_iommu=off → still crashes (no kernel logs?, quite fast after only 1 hour, 2024-05-03, SSH not possible)
#5: Updated BIOS to 2008, 2024-05-03 (still crashes)
#6: Install intel-microcode from debian sid → 2024-05-03 still crashes (Current revision: 0x00000122 ← Updated early from: 0x0000011f)
#7: Disable KSM → 2024-05-03 disabled (still crashes)
#8: Deal with MSRs ← options kvm ignore_msrs=1 report_ignored_msrs=0 set in kvm.conf (still crashes)
#9: go back to kernel 6.5 and leave all the modifications in place (still crashes)
#10: Set pcie_aspm=off and pcie_port_pm=off (still crashes)
#11: Set intel_idle.max_cstate=0 and processor.max_cstate=1 (still crashes)
#12: Set intel_pstate=disable (still crashes)
#13: Turn off APST using nvme_core.default_ps_max_latency_us=0 (still crashes)
#14: Disable GPU Power Management via i915.enable_dc=0 (still crashes)
#15: /sys/block/nvmeXn1/queue/scheduler from none to mq-deadline (still crashes)

#16: Lower the RAM from DDR5-4400 to DDR5-4200 (still crashes)
#17: Revert some of the changes, disable ASPM in the BIOS (still crashes)
#18: Let Hetzner change the complete hardware, revert most of the changes ← working

Try this.

Max multiplier 52, no xmp, ddr5 4200 max memory rate.

Suggest y cruncher or phoronix test suite cpu/memory focuxed benchmarks to detect instability. hours pts/compress-7zip or y-cruncher testing is really good.

Know how gamers are suffering with 13/14th gen instability? You too.

I am troubleshooting similar errors across hundreds of machines :slight_smile:

you may need to bump the ddr voltage 0.1 to 0.05 volts, I see you’re doing 4200 but… 4200 is probably all you can hope for.

And yes, I seem to be tracking degredation over time – interesting given that W680 is not only more conservative out of the box but physically cannot deliver ~~300w we have been told is degrading these cpus.

I already have a video on patreon/floatplane. I hope that the story is picked up and investigated further. 13th/14th gen may have some deep issues.

I have a feeling this is going to be bigger news this week, going forward. I have become doubtful this kind of thing can be fixed with microcode/bios after seeing so many having issues.

2 Likes

Quick question, I’ve always favored stability so I’ve never OC’d anything, I have been taking a look for awhile now and have not been able to find the option to change these values, is this possible with the Asus Pro WS680 IPMI board?

Additionally, I performed a fresh install of Debian stable and when I run btrfs balance (after adding the drive for a BTRFS RAID1/mirror), it crashes around 10-60 seconds after each boot now.

I found a post elsewhere that suggested enabling Full Performance for each of the SSDs to prevent them from dropping offline, but right now that is not even happening, the machine is completely kernel panic’ing right away.

When I boot to Windows 11 on the WD Red NVME SSD, I ran various benchmarks Cinebench, etc, I could not get the system to fail for over 20-30 minutes. Whereas in Linux, I can get an almost immediate failure when I try to btrfs balance…

Are there any known issues with Linux and Samsung 990 PRO 4TB drives? I could try some other NVME drives but not sure if that would be a waste of time or not.

1 Like

There are some firmware issues with Samsung NVME drives but these should be “patched” (worked around) in a relatively recent kernel.

What happens if you reset to default settings and disable the following:
Disable Intel Turbo Boost Max and there also seems to be something that’s called “ASUS Performance Enhancement 3.0” that you also can try to disable.

I’d also try to set “Performance Core Ratio” to [Sync All Cores] and the same for “Efficient Core Ratio”.

Boot performance mode → Max Non-Turbo Performance

To my knowledge official specs says 3600MT on 4 DIMMS (dual rank)
https://edc.intel.com/content/www/us/en/design/products/platforms/details/raptor-lake-s/13th-generation-core-processors-datasheet-volume-1-of-2/002/processor-sku-support-matrix/

Best regards,
Daniel

1 Like

this screen

set all core ratio limit lower, 52/53 and intel default should be here
and dram to 4200 if 2 dimms per channel

2 Likes

Updated, I could set 52 for the P-core limit but not for the E-cores, going to untar some backups to see if the crash recurs.

Someone on Intel’s forums also tried this:
“14900ks unstable” intel.com

So if it’s useful for others, my setup is basically:
** Reset bios to defaults, configure XMP*
** P core ratio capped at x57, all core.*
** E core ratio capped at x44, all core.*


1 Like

Thank you for the suggestions and pointing me in the right direction, never have I experienced an issue with a CPU before…

Follow-up & questions:

  1. By enabling these features, I will gain stability (awesome!)
  2. Is it worth exploring an RMA or will another i9-14900k have the same issue due to this series of chips being defective?
  3. When setting the ALL-Core Ratio Limit to 57 and 44 respectively, approximately how much overall performance is lost?
  4. Will the CPU be stable with p57/e44 ratios moving forward or will it degrade over time such that I would need to reduce the ratios even further?

Initial testing results so far with decompression of ~1TB of backups:

p core ratio 52
e core ratio 44
change to Max Non-Turbo Performance
No crash
p core ratio 57
e core ratio 44
leave Max Turbo default (Auto)
No crash

Just to confirm what I am seeing is real above, I reset the BIOS to Optimized Settings which defaults back to Intel Default settings, this is where things are unstable.

This causes a crash usually within a few a few minutes without the above options enabled:

btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt

And… BOOM! Crash in less than 2 minutes, almost guaranteed!

Now for the full test:

p core ratio 57
e core ratio 44
leave Max Turbo default (Auto for now)

Re-test by running the balance of 871GB of data:

btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt

Success! I literally forgot what a stable system was supposed to be like after so many crashes and problems. This is very early testing so far but given the BTRFS balance ran for 27 minutes and succeeded with no CPU errors with these settings is very promising.

BTRFS balance completed and no kernel CPU issue (wow!)

1 Like

Machine is mostly stable, still some crashes, will reach out to Intel next:

HTML::Template Debug ### /etc/munin/templates/partial/navigation.tmpl : line 21 : parsed VAR r_path

HTML::Template Debug ### /etc/munin/templates/partial/bottom_navigation.tmpl : line 31 : TMPL_IF link start

HTML::Template Debug ### /etc/munin/templates/munin-categoryview.tmpl : line 7 : parsed VAR name

*** stack smashing detected ***: terminated
Aborted

1 Like

Update Friday, July 19th, 2024:

  1. Updated Asus Pro W680-ACE IPMI to latest BIOS: PRO WS W680-ACE BIOS 3701
  2. Replaced my Intel i9-14900k with an RMA from Intel 2024-07-19
  3. Booted machine.
  4. Set BIOS to Intel recommended defaults and did not reduce p-core and e-core to (57/44)

Now monitoring to see if the issue is resolved.

2 Likes

Hi jpiszcs, I believe I may have a similar or same issue. Newly built Asus W680 IPMI setup with i5-13600k/2x48GB 5200MT running XMP.

I am not able to boot to windows installer at all for some testing and updating ME firmware, before i put proxmox on the machine.
Get a nice bluescreen as the installer gui launches.

I have modified lower and upper power limits to match the 13600k 125W/181W, on bios 3701, turbo boost on boot setting, but this didnt seem to impact it.
Ran memtest86+ for 24 hours and no errors at all.

I am pulling out my hair slowly with this issue. Never had to do so much troubleshooting than with this motherboard (reset switch polarity with the IPMI board: FFS!)

Dont know what to do or test next.

Hi gobanana,

I am not sure if 13600k is affected the same as the 13900k/14900k series; however, for the prior CPU that I had that was impacted, the thing that made everything stable in my case was:

steps:
0. reboot, set bios defaults, using intel default profile

  1. reduce p-core ratio to 57
  2. reduce e-core ratio to 44
  3. save & reset

I have saved a picture of this when I had my old CPU before I updated the BIOS I wanted to save what it looked like so I did not forget how to set it:

One other item I see you are running the memory at 5200MT running XMP, if the above does not help, I would also try to disable XMP set the memory to 4200mhz as well and then check again.

If both of those things do not help at all, I would have to defer to @wendell @diizzy who were both very helpful with the issue I had to see if they have other ideas on this one.

According to Gamers Nexus, the 13600k is affected as well:
https://www.youtube.com/watch?v=gTeubeCIwRw (13m 54s)

2 Likes

Thanks for the reply.
I too saw that GN video recently. This is scary now, but thankfully for the future.

I appear to have resolved my issue.
Utterly bizarre in my opinion, but I forgot I had a mellanox cx3 in there all the time.
I took it out when I realised and voila, I was able to boot up instantly and continue Install of server 2019 to test the board and install the new ME firmware.

I then searched around and it seems to be a problem with alder lake onwards CPU with mellanox cx3 and the E cores :man_facepalming:
Can’t win!

I tested under Linux just fine and that has thankfully been rock solid for past day or 2 now.
Using proxmox currently as this is part of my all same same hardware parts 3 server cluster for proxmox. The mellanox works excellent with proxmox thankfully, so only needed to pull it out for testing windows on bare metal.

2 Likes