Perf: Interruption took too long -> Causing system Freeze

Back in December last year, i build myself a new “server” to run Plex, Homeassistant, Nextcloud and some other stuff.
The build is:

Ryzen 7 - 1700
AsRock x370 Taichi
8 GB Corsair RAM (From the MB supplier list)
Some WD Drives
Samsung 960 256GB
Radeon cheap gfx
And a power supply, case and so on.

Running headless Debian.

The problem have been there from the start, every now and then the system just freezes, nothing to do, all the hardware is still on, but i can’t connect to the machine at all, a hard reset does the trick.

I’ve been through all of my kernel logs now, and i see one problem that persists:
[225733.784299] perf: interrupt took too long (3160 > 3133), lowering kernel.perf_event_max_sample_rate to 63250

Sometimes the perf error is there a single time, sometimes it’s three times.

It’s driving me crazy now, and i can’t seem to correct the error, or find out why and when the error is created.
I really need some tips and tricks to find out what’s causing this, and how to fix it.

I have no clue why this happens but maybe…

The second Post from chithanh

I’ve actually checked for Segfault errors with the killryzen script, but never found any, but was suspecting that in the first place.
Might be worth a try to contact AMD.

Any chance you could ping your gateway every 30s (with a timeout) and store that into a log - it could be your network is wonky.

Edit: there’s also arping that you could try from the outside of the machine. ARP is fairly low dependency and can help you figure out if your hardware is totally frozen or if it’s something else.

Edit2: read about watchdogs, there’s a kernel modules you can configure and then use from a daemon that’ll allow the hardware to reboot if the watchdog stops resetting the watchdog timer (e.g. something freezes)

Could set a cron job to ping from my pfsense to my server, to see if it is totally frozen or not.

The watchdog timer is a ok fix, but i would rather find the source to the cause of the freeze, to fix it, but could set it up, easier to use nextcloud when the server is working :wink:

Actually from server to pfSense would be more useful, that way when you reboot to regain connectivity you could look at your cron output from prior to reboot.

If/when you see interesting things there, you could add more things to log / like kernel stack traces etc.

I have the same problem here. I figured out that the bluetooth enabled do it happen more often. And checking the logs I can verify that before each crash ocours the same perf interrupt…