Hi all,
I have been troubleshooting for an issue for a couple of weeks and I am running out of ideas… Lately, I have been seeing applications segfault [1,2] randomly after a BIOS upgrade on my Asus Pro WS W680-ACE IPMI and yes I updated the Intel ME software to the latest as well.
I am on the latest BIOS as of 7/5/2024:
ASUSTeK COMPUTER INC. System Product Name/Pro WS W680-ACE IPMI, BIOS 3603 05/27/2024.
Also attempted: I have tried the BIOS defaults and tweaking various BIOS settings but to no avail, the segfaulting (and sometimes random kernel crashes occur). I built my own kernel tested with 6.9.7 and 6.10-rc6 and was also seeing the segfault occuring and kernel crashes (random).
[1] Yesterday:
Jul 4 14:50:09 machine kernel: munin-html[170056]: segfault at 9 ip 000056101420a12a sp 00007ffc2c696190 error 4 in perl[561014121000+195000] likely on CPU 8 (core 16, socket 0)
[2] Shortly after boot:
Fri Jul 5 20:05:07 machine kernel: traps: iwatch[2082] general protection fault ip:5559574ddad9 sp:7ffe67f599a0 error:0 in perl[5559573b5000+195000]
I’ve reached out to the LKML and debian-user@ but I have not gotten very far (I cannot post the links here but if you Google the following you will see the threads:
6.9.7: kernel panic: RIP: 0010:btrfs_clone_write_end_io+0x1e/0x60 [btrfs] (dmesg included)
Re: 6.1.0: NVME drive goes offline randomly even with: nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
Thinking it may have been related to BTRFS, I switched over to using ext4 from btrfs and the segfaulting applications (different ones) continue, I am waiting to see if the kernel is going to panic again (likely).
I’ve run Intel’s processor diagnostic tool and memtest86/memtest86+ and everything comes back clean. Proof at bottom.
I have also re-seated the Linux NVME drives (previous I was running BTRFS in RAID-1, now I am using MDADM w/RAID-1 + EXT4 on top of it).
CPU check and Memory Check:
The firmware on my 2 x Samsung 990 Pro 4TBs is the latest:
smartctl output:
Model Number: Samsung SSD 990 PRO with Heatsink 4TB
Firmware Version: 4B2QJXD7
Model Number: Samsung SSD 990 PRO with Heatsink 4TB
Firmware Version: 4B2QJXD7
Aside from re-installing my entire system from scratch, which would take awhile; does anyone have any thoughts or ideas what may be causing these issues?
Thanks,
Justin