I have a machine running ubuntu server 17.10 that every few days becomes unresponsive. The machine runs several containers that host postfix/dovecot, nginx, mariadb and plex. By “unresponsive” I mean the machine and its containers disappear from the network. They don’t respond to communication and their dhcp leases expire without being renewed. However, the lights on the machine remain lit, the fans continue to spin and even the nic activity lights blink. If I remove the network cable, the nic lights go out and plugging the cable back in brings the lights back on, but there is still no network activity from the machine.
The machine’s cpu is a ryzen 1800x. I doubt there is anything wrong with the cpu. I put in a 1200 and the same problem persisted. I have run memtest with both the 1200 and 1800x. Memtest doesn’t report any errors.
Syslog and kern.log are useless as far as I can tell. I have included parts of each below.
An example of Syslog of a crash that happened around 3:30 pm (I realized the problem around 4:30 and physically reset the computer):
Mar 12 15:24:01 Ryzen-Server CRON[23633]: (root) CMD (/usr/persistent-log/persistent-log.sh)Mar 12 15:25:01 Ryzen-Server CRON[23638]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 12 15:26:01 Ryzen-Server CRON[23643]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 12 15:27:01 Ryzen-Server CRON[23655]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 12 15:28:01 Ryzen-Server CRON[23660]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 12 16:35:29 Ryzen-Server rsyslogd: [origin software="rsyslogd" swVersion="8.16.0" x-pid="1464" x-info="http://www.rsyslog.com"] start
Mar 12 16:35:29 Ryzen-Server systemd-modules-load[552]: Inserted module 'nct6775'
Mar 12 16:35:29 Ryzen-Server systemd-modules-load[552]: Inserted module 'iscsi_tcp'
Mar 12 16:35:29 Ryzen-Server systemd-modules-load[552]: Inserted module 'ib_iser'
kern.log is basically the same, it logs nothing about an error:
Mar 11 23:59:06 Ryzen-Server kernel: [ 20.747302] FS-Cache: Loaded
Mar 11 23:59:06 Ryzen-Server kernel: [ 20.756870] Key type cifs.idmap registered
Mar 12 16:35:29 Ryzen-Server kernel: [ 0.000000] Linux version 4.15.9-041509-generic (kernel@tangerine) (gcc version 7.2.0 (Ubuntu 7.2.0-8ubuntu3.2)) #201803111231 SMP Sun Mar 11 16:34:36 UTC 2018
Mar 12 16:35:29 Ryzen-Server kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.9-041509-generic root=/dev/mapper/Ryzen--Server--vg-root ro
Mar 12 16:35:29 Ryzen-Server kernel: [ 0.000000] KERNEL supported cpus:
Mar 12 16:35:29 Ryzen-Server kernel: [ 0.000000] Intel GenuineIntel
Mar 12 16:35:29 Ryzen-Server kernel: [ 0.000000] AMD AuthenticAMD
Mar 12 16:35:29 Ryzen-Server kernel: [ 0.000000] Centaur CentaurHauls
From the logs, activity just seems to stop.
So far, the only thing that I can think to do is physically reset the machine when it “hangs.” So I attached a raspberry pi though an electrical relay to the motherboard’s front panel reset pins.
If the raspberry pi can’t ping the machine for several minutes, the pi closes the relay and the machine resets. But that doesn’t actually diagnose what is causing the hang.
The only other interesting hardware in the machine is a hauppauge wintv-quad card. I have had that card in other machines and never had a problem.
Any suggestions would be appreciated.