Sporadic Ubuntu Server 17.10 "crashes"

cburn11 · March 14, 2018, 7:54pm

I have a machine running ubuntu server 17.10 that every few days becomes unresponsive. The machine runs several containers that host postfix/dovecot, nginx, mariadb and plex. By “unresponsive” I mean the machine and its containers disappear from the network. They don’t respond to communication and their dhcp leases expire without being renewed. However, the lights on the machine remain lit, the fans continue to spin and even the nic activity lights blink. If I remove the network cable, the nic lights go out and plugging the cable back in brings the lights back on, but there is still no network activity from the machine.

The machine’s cpu is a ryzen 1800x. I doubt there is anything wrong with the cpu. I put in a 1200 and the same problem persisted. I have run memtest with both the 1200 and 1800x. Memtest doesn’t report any errors.

Syslog and kern.log are useless as far as I can tell. I have included parts of each below.

An example of Syslog of a crash that happened around 3:30 pm (I realized the problem around 4:30 and physically reset the computer):

Mar 12 15:24:01 Ryzen-Server CRON[23633]: (root) CMD (/usr/persistent-log/persistent-log.sh)Mar 12 15:25:01 Ryzen-Server CRON[23638]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 12 15:26:01 Ryzen-Server CRON[23643]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 12 15:27:01 Ryzen-Server CRON[23655]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 12 15:28:01 Ryzen-Server CRON[23660]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 12 16:35:29 Ryzen-Server rsyslogd: [origin software="rsyslogd" swVersion="8.16.0" x-pid="1464" x-info="http://www.rsyslog.com"] start
Mar 12 16:35:29 Ryzen-Server systemd-modules-load[552]: Inserted module 'nct6775'
Mar 12 16:35:29 Ryzen-Server systemd-modules-load[552]: Inserted module 'iscsi_tcp'
Mar 12 16:35:29 Ryzen-Server systemd-modules-load[552]: Inserted module 'ib_iser'

kern.log is basically the same, it logs nothing about an error:

Mar 11 23:59:06 Ryzen-Server kernel: [   20.747302] FS-Cache: Loaded
Mar 11 23:59:06 Ryzen-Server kernel: [   20.756870] Key type cifs.idmap registered
Mar 12 16:35:29 Ryzen-Server kernel: [    0.000000] Linux version 4.15.9-041509-generic (kernel@tangerine) (gcc version 7.2.0 (Ubuntu 7.2.0-8ubuntu3.2)) #201803111231 SMP Sun Mar 11 16:34:36 UTC 2018
Mar 12 16:35:29 Ryzen-Server kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.9-041509-generic root=/dev/mapper/Ryzen--Server--vg-root ro
Mar 12 16:35:29 Ryzen-Server kernel: [    0.000000] KERNEL supported cpus:
Mar 12 16:35:29 Ryzen-Server kernel: [    0.000000]   Intel GenuineIntel
Mar 12 16:35:29 Ryzen-Server kernel: [    0.000000]   AMD AuthenticAMD
Mar 12 16:35:29 Ryzen-Server kernel: [    0.000000]   Centaur CentaurHauls

From the logs, activity just seems to stop.

So far, the only thing that I can think to do is physically reset the machine when it “hangs.” So I attached a raspberry pi though an electrical relay to the motherboard’s front panel reset pins.

If the raspberry pi can’t ping the machine for several minutes, the pi closes the relay and the machine resets. But that doesn’t actually diagnose what is causing the hang.

The only other interesting hardware in the machine is a hauppauge wintv-quad card. I have had that card in other machines and never had a problem.

Any suggestions would be appreciated.

reikoshea · March 14, 2018, 9:07pm

How old is the PSU you are using? That sounds like a lock up caused by a 0A draw on the 12V rail. Most decent PSUs made after haswell (2013) should support 0A. If it’s older you might consider upgrading.

Otherwise, you might try flashing your bios and disabling the C6 processor state in the bios might help you out. There’s been some long discussions on this on the kernel mailing lists, but no resolution in OS as of yet.

cburn11 · March 14, 2018, 9:23pm

I had not considered the power supply. It maybe 3 years old, but I have some others that I can swap.

The machine has hung in the middle of the night when there would have been little activity, but it has also hung when it was under heavy use while plex was actively streaming and recording. Does that tend to lessen the likelihood the problem is related to a low power processor mode?

MisteryAngel · March 15, 2018, 1:14am

I wouldnt be suprised if its just Ubuntu 17.10 not playing very nice with Ryzen.
The 17.10 release of Ubuntu seems to have allot of quirky issues,
especially with new hardware like Ryzen.
Allthough i noticed that you are using a later 4.15 kernel.
But the 18.04 LTS releases will be out towards the 28th of April.
I suppose that those releases will play nicer with Ryzen.

I personally have never really felt that the 17.10 release of Ubuntu was very stable atall.
They also have refreshed their 17.10 iso´s a couple of times allready.
17.10.04 is the newest if i´m correct.

Maybe you could try a different distribution like Debian, Fedora or Open Suse.
Just to see if you can replicate the issue.
Allthough it might be allot of work though.

cburn11 · March 15, 2018, 3:18am

I have tried different kernels hoping to resolve the issue, but it persists. The problem also existed when I was using the kernel that came with 17.10 .

I’m not married to 17.10 and I knew that support was ending in July. But I had not considered switching earlier than the 18.04 release.

But if it’s possible to migrate an lxc container from ubuntu to debian ( I have no idea if it is off the top of my head; but from lxc’s point view, how much difference could there be?) that would be worth a shot.

SudoSaibot · March 15, 2018, 3:49am

As I understand, Ryzen benefits from Kernel 4.10+.
Debian 9 is running Kernel 4.9 by default.

If you’re feeling daring, make a backup, then change your sources.list to the bionic repos and dist-upgrade. 18.04 uses Kernel 4.14 I believe.

Dynamic_Gravity · March 15, 2018, 5:51am

You can try the 18.04 release of Ubuntu Mate, they always ship earlier than the vanilla ubuntu so if you really need to get version 18 I would give this a shot.

blackfire · March 17, 2018, 11:32pm

I like you’re pi idea. Anything in the apparmor logs

cburn11 · March 18, 2018, 5:50pm

I was not even aware that apparmor had its own log, so I have not checked it.

If you are referring to apparmor entries in syslog, there are no entries logged near the crash times. The machine crashed last night and there seemingly was nothing relevant in syslog:

Mar 17 19:33:01 Ryzen-Server CRON[15295]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 17 19:34:01 Ryzen-Server CRON[15325]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 17 19:35:01 Ryzen-Server CRON[15343]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 17 19:36:01 Ryzen-Server CRON[15382]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 17 19:37:01 Ryzen-Server CRON[15412]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 17 19:38:01 Ryzen-Server CRON[15430]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 17 19:39:01 Ryzen-Server CRON[15449]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 17 19:40:01 Ryzen-Server CRON[15503]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 17 19:40:15 Ryzen-Server dhclient[4728]: PRC: Renewing lease on br2.
Mar 17 19:40:15 Ryzen-Server dhclient[4728]: XMT: Renew on br2, interval 10100ms.
Mar 17 19:40:15 Ryzen-Server dhclient[4728]: RCV: Reply message on br2 from fe80::6a1c:a2ff:fe12:66b5.
Mar 17 19:41:01 Ryzen-Server CRON[15537]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 17 19:42:01 Ryzen-Server CRON[15555]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 17 19:43:01 Ryzen-Server CRON[15574]: (root) CMD (/usr/persistent-log/persistent-log.sh)
Mar 17 19:45:02 Ryzen-Server rsyslogd: [origin software="rsyslogd" swVersion="8.16.0" x-pid="1546" x-info="http://www.rsyslog.com"] start

That’s 10 minutes of log prior to the crash which happens around 7:43 pm. And logging starts back up again as the machine is rebooting a couple muinutes later at 7:45.

cburn11 · March 18, 2018, 5:56pm

Last night’s crash was the first crash after I had swapped power supplies. That tends to rule out the issue being power supply related. Although it is possible that I have two bad power supplies.

But it turns that it is trivial to move a lxc container from an ubuntu host to a debian host. I just tar’d the containers with the --numeric-owner flag, copied them to the new host, and they all spun right up without issue.

So my next step I think is to slap in a new drive with whatever the current stable debian release happens to be and see what happens.

Thanks for all the suggestions.