Back in December last year, i build myself a new “server” to run Plex, Homeassistant, Nextcloud and some other stuff.
The build is:
Ryzen 7 - 1700
AsRock x370 Taichi
8 GB Corsair RAM (From the MB supplier list)
Some WD Drives
Samsung 960 256GB
Radeon cheap gfx
And a power supply, case and so on.
Running headless Debian.
The problem have been there from the start, every now and then the system just freezes, nothing to do, all the hardware is still on, but i can’t connect to the machine at all, a hard reset does the trick.
I’ve been through all of my kernel logs now, and i see one problem that persists:
[225733.784299] perf: interrupt took too long (3160 > 3133), lowering kernel.perf_event_max_sample_rate to 63250
Sometimes the perf error is there a single time, sometimes it’s three times.
It’s driving me crazy now, and i can’t seem to correct the error, or find out why and when the error is created.
I really need some tips and tricks to find out what’s causing this, and how to fix it.
Any chance you could ping your gateway every 30s (with a timeout) and store that into a log - it could be your network is wonky.
Edit: there’s also arping that you could try from the outside of the machine. ARP is fairly low dependency and can help you figure out if your hardware is totally frozen or if it’s something else.
Edit2: read about watchdogs, there’s a kernel modules you can configure and then use from a daemon that’ll allow the hardware to reboot if the watchdog stops resetting the watchdog timer (e.g. something freezes)