Repeatable system freezes under GPU load with Threadripper & Ubuntu 18.04

Anjum_Sayed · January 29, 2019, 6:27am

Hi guys, I’m getting repeatable system freezes whenever I use PyTorch. After around 5-20 minutes of training, my system freezes (no mouse/keyboard, no tty, no remote SSH, no displayed error messages, reset button required). I’m using 4 workers in the data loaders, so the CPU sits at around 20% usage during training.

Here’s my system details: Ubuntu 18.04, Nvidia driver 410 (installed from ppa:graphics-drivers/ppa), CUDA 10.0, cuDNN 7.4.2. 3 GPUs (1080, 2x2080Ti). Thermals (from nvidia-smi) seem normal. MB: Asus Zenith Extreme, CPU: Threadripper 2950X, RAM: 128GB of Corsair Vengeance LPX (3200MHz sticks but running at 2666Mhz). PSU rated to 1500W. The images used for the training are stored on a Samsung SSD connected via SATA.

Here’s what I’ve tried: Update Linux kernel to 4.19 (didn’t help so went back to the Ubuntu default). Reseated all the GPUs & RAM. Tested each GPU individually. Fresh install of Ubuntu. Stock Python 3.6 & Anaconda 3.7. I tried 1 hour of 100% CPU usage just to make sure it’s not a CPU/RAM thing and also GPU burn seems fine.

I thinking it could be some sort of PCIe issue since it only occurs under GPU load (regardless of GPU and PCIe slot). Or a multithreading issue since I’m using 4 workers (I’m currently running with 1 worker and will see if that causes a freeze).

Any ideas to what I should try next and what logs would help diagnose the issue?

Update I still got a freeze with a single worker, so it’s unlikely to be a multithreading issue

noenken · January 29, 2019, 1:52pm

Update the BIOS to the latest version, disable any energy efficiency stuff in it. Also check the exact parts number of the memory and search for it in the motherboard’s compatibility list.

I have no clue about the software but of course searching google for other examples of that behavior might also help. (I am assuming you did that already but… )

Could it be mixing pascal and turing? Maybe only use one architecture at a time, just to test it?

Methylzero · January 29, 2019, 4:09pm

Check this Ryzen PSAs I have posted:
https://forum.level1techs.com/t/psa-testing-for-and-solving-random-freezes-on-ryzen/

I do not have access to a TR system, but in principle this could be similar.
Also you could try turning IOMMU on/off

Anjum_Sayed · January 30, 2019, 1:56pm

Thanks for the suggestions guys. Here some more stuff I tried after scouring these forums:

Set the BIOS according to VFIO BIOS settings for Zenith Extreme
Also forced the BIOS to use PCIe Gen 3
Set pcie_aspm=off in GRUB

It ran stable under load for around 24hrs before crashing again.

Currently using the latest BIOS. My memory kit isn’t listed on the QVL sheet, but there is a near identical 2666MHz kit listed, which is the speed I’m running at (I have 2 kits of 4x16GB 3200Mhz 16-18-18-36).

Right now I’m running memtester and will see if I can reproduce the issue with 1 stick of RAM. I’ll try the Pascal & Turing cards separately again, but when I tried a couple of weeks back it didn’t help

@Methylzero I’ll also try your stress test and report back

So many late nights wasted trying to figure out what is going on with this thing!

TuxKey · January 30, 2019, 3:12pm

interesting thread… Awsome hardware Anjum my compliments…
same cpu i was looking at… but without the expensive gpu’s…
just starting to learn coding and bad at math so machine learning is a universe away… still exciting stuf…

i just wanted to ask did you look at your logging? or how your system is behaving just before the crash… (dmesg, journalctl,top)
reading the pytorch github page issue 1659… i would agree with a guy expecting errors or at least a sign things are going to go bad before they do… i have bin out of the game for a while as a Unix Admin but usually unix based systems are extremely good in keeping the OS floating. doing it’s best to put processes to sleep give penalties to process all in the effort of going on… so when your system freezes… it’s because the system did not have time to manage the load. or there is a piece of software / hardware that’s not playing nice…
the Kernel update is something easy to try and easy to reverse…
i do like the way your tackling the problem… the pcie_aspm setting was a good one… in line with the comment about other power management setting.
i don’t know but isn’t one of the reason mem sticks are sold in packs that they are validated to run as a set? also don’t think that’s you problem because your running 2 kits as a pair… so any small difference should not be a problem…

On github i’m reading a reinstallation of the NVIDIA driver solved one issue. problem is he describes the issue a bit different then you do.
meaning he could be describing a different problem with the same final result…

TuxKey · January 30, 2019, 3:26pm

just remembered you need to make your journald logging is persistent. That’s the first thing i do when asked to look in to a problem… see if i have good logs on the system to diagnose…
went digging in to my own notes… here is a copy paste…
btw here is a small howto for using journalctl…
was refreshing for my hahaha…it’s bin a while…

here is the site…:

but i prefer my notes…

! First make sure you have persistant journald logging !

mkdir -p -m2775 /var/log/journal
grep journal /etc/group
chown :systemd-journal /var/log/journal
ps -ef|grep journal
killall -USR1 systemd-journald. (you could also just use systemctl restart systemd-journald)
journalctl -b-1 -p err ; query systemd journal last boot;

Anjum_Sayed · January 30, 2019, 3:55pm

Thanks @TuxKey! Definitely stick at the coding & math stuff, it seems daunting at first but trust me it gets easier and is great fun (and there’s plenty of great resources out there),

Everything in htop seems fine just before the crash (No lag, moderate CPU, RAM at about 10% and no swap usage). If it crashes while I’m using it, the mouse/keyboard stops working and the screen freezes (i.e. the clock doesn’t change). If it crashes overnight, the monitors won’t wake up when I touch the mouse/keyboard. The HDD light has no activity which means that it is no longer reading the training data from the SSD; this leads me to believe it’s not just a desktop manager crash, but something at the driver/kernel level.

I can across that GitHub issue before, which led me to think it was a num_workers thing, but I don’t think it’s what’s happening here (I get the issue with num_workers=0), and I think they fixed that issue in PyTorch 1.0. I’ve also got pretty quick at reinstalling the drivers now that I’ve done it so many time in the last few weeks One thing I haven’t tried yet is to see if I can replicate the issue using TensorFlow (i’ll give it a go as soon as they release a Python 3.7 build in a few days).

Thanks for the journalctl shout - that’s exactly the type of logging I was looking for (been using Linux for years, but my knowledge of all these type of tools is minimal). I’ll report back if I can get a useful log at the next crash.

Here’s my dmesg output:

[Wed Jan 30 15:57:23 2019] secureboot: Secure boot could not be determined (mode 0)
[Wed Jan 30 15:57:23 2019] #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31
[Wed Jan 30 15:57:23 2019] PPR NX GT IA GA PC GA_vAPIC
[Wed Jan 30 15:57:23 2019] PPR NX GT IA GA PC GA_vAPIC
[Wed Jan 30 15:57:24 2019] Couldn’t get size: 0x800000000000000e
[Wed Jan 30 15:57:24 2019] PKCS#7 signature not signed with a trusted key
[Wed Jan 30 15:57:24 2019] PKCS#7 signature not signed with a trusted key
[Wed Jan 30 15:57:24 2019] nvidia: loading out-of-tree module taints kernel.
[Wed Jan 30 15:57:24 2019] nvidia: module license ‘NVIDIA’ taints kernel.
[Wed Jan 30 15:57:24 2019] Disabling lock debugging due to kernel taint
[Wed Jan 30 15:57:24 2019] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 410.78 Sat Nov 10 22:09:04 CST 2018 (using threaded interrupts)
[Wed Jan 30 15:57:24 2019] PKCS#7 signature not signed with a trusted key
[Wed Jan 30 15:57:24 2019] PKCS#7 signature not signed with a trusted key
[Wed Jan 30 15:57:24 2019] PKCS#7 signature not signed with a trusted key
[Wed Jan 30 15:57:24 2019] PKCS#7 signature not signed with a trusted key
[Wed Jan 30 15:57:25 2019] ata1.00: supports DRM functions and may not be fully accessible
[Wed Jan 30 15:57:25 2019] ata1.00: NCQ Send/Recv Log not supported
[Wed Jan 30 15:57:25 2019] ata1.00: supports DRM functions and may not be fully accessible
[Wed Jan 30 15:57:25 2019] ata1.00: NCQ Send/Recv Log not supported
[Wed Jan 30 15:57:25 2019] ata2.00: supports DRM functions and may not be fully accessible
[Wed Jan 30 15:57:25 2019] ata2.00: supports DRM functions and may not be fully accessible
[Wed Jan 30 15:57:59 2019] PKCS#7 signature not signed with a trusted key
[Wed Jan 30 15:57:59 2019] acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
[Wed Jan 30 15:58:00 2019] ath10k_pci 0000:03:00.0: Direct firmware load for ath10k/pre-cal-pci-0000:03:00.0.bin failed with error -2
[Wed Jan 30 15:58:00 2019] ath10k_pci 0000:03:00.0: Direct firmware load for ath10k/cal-pci-0000:03:00.0.bin failed with error -2
[Wed Jan 30 15:58:00 2019] ath10k_pci 0000:03:00.0: Unknown eventid: 118809
[Wed Jan 30 15:58:00 2019] ath10k_pci 0000:03:00.0: Unknown eventid: 90118
[Wed Jan 30 15:58:01 2019] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[Wed Jan 30 15:58:01 2019] caller os_map_kernel_space.part.7+0xda/0x120 [nvidia] mapping multiple BARs
[Wed Jan 30 15:58:02 2019] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[Wed Jan 30 15:58:02 2019] caller os_map_kernel_space.part.7+0xda/0x120 [nvidia] mapping multiple BARs
[Wed Jan 30 15:58:03 2019] ath10k_pci 0000:03:00.0: Unknown eventid: 118809
[Wed Jan 30 15:58:03 2019] ath10k_pci 0000:03:00.0: Unknown eventid: 90118
[Wed Jan 30 16:04:12 2019] kauditd_printk_skb: 30 callbacks suppressed

Anjum_Sayed · January 30, 2019, 4:11pm

Ok I think I must have had persistent journald logging already set up . It was definitely hadn’t crashed by 08.05 this morning as I have a model checkpoint then, but there’s nothing after that in the log.

journalctl -b-1 -p err ; query systemd journal last boot output:

Jan 29 16:11:42 blackbox kernel: Couldn’t get size: 0x800000000000000e
Jan 29 16:11:42 blackbox kernel: PKCS#7 signature not signed with a trusted key
Jan 29 16:11:42 blackbox kernel: PKCS#7 signature not signed with a trusted key
Jan 29 16:11:42 blackbox kernel: PKCS#7 signature not signed with a trusted key
Jan 29 16:11:42 blackbox kernel: PKCS#7 signature not signed with a trusted key
Jan 29 16:11:42 blackbox kernel: PKCS#7 signature not signed with a trusted key
Jan 29 16:11:42 blackbox kernel: PKCS#7 signature not signed with a trusted key
Jan 29 16:11:42 blackbox kernel: PKCS#7 signature not signed with a trusted key
Jan 29 16:11:44 blackbox bluetoothd[1176]: Failed to set mode: Blocked through rfkill (0x12)
Jan 29 16:11:44 blackbox avahi-daemon[1236]: chroot.c: open() failed: No such file or directory
Jan 29 16:11:46 blackbox wpa_supplicant[1220]: dbus: wpa_dbus_get_object_properties: failed to get object properties: (none) none
Jan 29 16:11:46 blackbox wpa_supplicant[1220]: dbus: Failed to construct signal
Jan 29 16:11:46 blackbox wpa_supplicant[1220]: dbus: wpa_dbus_get_object_properties: failed to get object properties: (none) none
Jan 29 16:11:46 blackbox wpa_supplicant[1220]: dbus: Failed to construct signal
Jan 29 16:11:50 blackbox pulseaudio[1627]: [pulseaudio] backend-ofono.c: Failed to register as a handsfree audio agent with ofono: org.freedesktop.DBus.Error.ServiceUnknown: The name org.ofono was not pro
Jan 29 16:11:51 blackbox spice-vdagent[1789]: Cannot access vdagent virtio channel /dev/virtio-ports/com.redhat.spice.0
Jan 29 16:11:51 blackbox bluetoothd[1176]: Failed to set mode: Blocked through rfkill (0x12)
Jan 29 19:27:35 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out
Jan 29 21:10:29 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out
Jan 29 21:22:02 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out
Jan 29 21:50:23 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out
Jan 29 22:05:05 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out
Jan 29 22:37:38 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out
Jan 30 01:18:17 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out
Jan 30 03:01:11 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out
Jan 30 06:46:56 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out

Also viewing the last 20 entries before today’s crash (from journalctl -b -1 -n 20:

Jan 30 08:12:52 blackbox kernel: wlp3s0: authenticate with a0:57:e3:31:9a:88
Jan 30 08:12:52 blackbox NetworkManager[1224]: [1548825172.2685] device (wlp3s0): supplicant interface state: scanning -> authenticating
Jan 30 08:12:52 blackbox kernel: wlp3s0: send auth to a0:57:e3:31:9a:88 (try 1/3)
Jan 30 08:12:52 blackbox kernel: wlp3s0: authenticated
Jan 30 08:12:52 blackbox wpa_supplicant[1220]: wlp3s0: Trying to associate with a0:57:e3:31:9a:88 (SSID=‘Macaroni Pie’ freq=2422 MHz)
Jan 30 08:12:52 blackbox kernel: wlp3s0: associate with a0:57:e3:31:9a:88 (try 1/3)
Jan 30 08:12:52 blackbox kernel: wlp3s0: RX AssocResp from a0:57:e3:31:9a:88 (capab=0x1411 status=0 aid=3)
Jan 30 08:12:52 blackbox NetworkManager[1224]: [1548825172.2826] device (wlp3s0): supplicant interface state: authenticating -> associating
Jan 30 08:12:52 blackbox wpa_supplicant[1220]: wlp3s0: Associated with a0:57:e3:31:9a:88
Jan 30 08:12:52 blackbox wpa_supplicant[1220]: wlp3s0: CTRL-EVENT-SUBNET-STATUS-UPDATE status=0
Jan 30 08:12:52 blackbox kernel: wlp3s0: associated
Jan 30 08:12:52 blackbox NetworkManager[1224]: [1548825172.2887] device (wlp3s0): supplicant interface state: associating -> 4-way handshake
Jan 30 08:12:52 blackbox wpa_supplicant[1220]: wlp3s0: WPA: Key negotiation completed with a0:57:e3:31:9a:88 [PTK=CCMP GTK=TKIP]
Jan 30 08:12:52 blackbox wpa_supplicant[1220]: wlp3s0: CTRL-EVENT-CONNECTED - Connection to a0:57:e3:31:9a:88 completed [id=0 id_str=]
Jan 30 08:12:52 blackbox NetworkManager[1224]: [1548825172.2961] device (wlp3s0): supplicant interface state: 4-way handshake -> completed
Jan 30 08:12:58 blackbox wpa_supplicant[1220]: wlp3s0: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-73 noise=-103 txrate=1000
Jan 30 08:13:22 blackbox wpa_supplicant[1220]: wlp3s0: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-64 noise=-103 txrate=1000
Jan 30 08:17:01 blackbox CRON[56962]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 30 08:17:01 blackbox CRON[56963]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jan 30 08:17:01 blackbox CRON[56962]: pam_unix(cron:session): session closed for user root

TuxKey · January 30, 2019, 4:38pm

here are a couple of journalctl command’s i had in my notes…
that can be helpful in drilling down to the final few moments before the crash.
(just a copy of my notes translated from dutch to eng.)

journalctl -r ; reverse show the newst enteries first.

journalctl -n 10 ; show the latest 10 lines.

journalctl -p err ; filter errors via priority…

here is a list of priorities…
“debug (7), info (6), notice (5), warning (4), err (3), crit (2), alert (1), and emerg (0).”
(man journalctl)

journalctl -f. ; same as tail -f showing the last 10 lines.
(this is handy for life viewing…)

journalctl --since or --until; DateTimeFormat YYYY.MM.DD hh:mm:ss
if you don’t use date it assumes today and if you don’t use time then it’s 00:00:00 midnight.
so the whole day …

journalctl --since “2014-10-22 23:30:00” ; example.

journalctl -o verbose ;shows more detail to zoom in on a error…

example of usage…

first i determine what i want to see. not knowing what it’s called…

journalctl -p 2 -o verbose

having found it’s something to do with NetworkManager and it’s referenced by _SYSTEMD_UNIT=
i can now drill in…

journalctl _SYSTEMD_UNIT=NetworkManager.service

i will read true the logs you posted see if i can find anything that jumps out…
seeing if i can find the few moments before it happens…
if there’s an event that could shed sum light…

TuxKey · January 30, 2019, 4:49pm

to be honest i’m not seeing any problems in the logs you posted…
if i’m missing something anyone please correct me…
perhaps you can drill down a bit to show us a few moments before t he problem starts… if it hangs the logs should stop logging so a few moments before that happens should be the time your interested in.
(just saying if you leave your system running and come back to it)…
btw it’s not all a waist of time… even if this leads to nothing… considering the work you do . brushing up on diagnostic of a problem is always handy

only real error i’m seeing in your log has to do with wireless but that’s not that big of an issue… (error -2)

Anjum_Sayed · May 11, 2019, 10:59am

Sorry to reignite an old thread but after months pulling my hair out over this, I finally found the solution, so I thought I’d post it here (mostly for my own reference).

I found the problems were reduced somewhat once I down clocked my RAM (to 1833Mhz) but I was still getting freezes when both GPUs were at load. That led me to suspect it was a Ryzen/Motherboard thing not a GPU issue.

I changed the BIOS settings below and now I have a fully stable system under GPU load:
Advanced > AMD CBS > Zen Common Options > Global C-state Control set to “disabled”
Extreme Tweaker > Core Performance Boost set to Off

My suspicion is that the automatic Core Performance Boost was causing the instability.

FurryJackman · May 12, 2019, 4:50am

So it’s happening on Ryzen too? I had stability issues with my 4960x and FFmpeg under Linux so it’s good to know that Precision Boost can actually cause instability with workloads like that.

Unless this isn’t precision boost and it’s actually the Mobo doing an all core overclock.

hobold · May 12, 2019, 12:00pm

“Multicore Enhancement” is the usual name for unofficial all core overclocking done by motherboard makers. This is technically out of spec and will exceed the rated TDP. I prefer to turn this off.

“Turbo Core” is AMD’s old name for increased clock speeds when few cores are active. This is within spec and not considered overclocking.

“Extended Frequency Range” (XFR) is AMD’s name for increased clock speeds when the currently running workload happens to require lower than “average” power and dissipates lower than expected heat. In that case, the extra headroom can be leveraged for slightly increased clock speed regardless of how many cores are active. This is considered clever use of the spec’ed limits and is meant to be safe, official overclocking.

“Core Performance Boost” is a usual BIOS name for the option controlling Turbo Core and XFR. It should be safe to turn this on.

The only potential pitfall I can imagine is with the computer’s power supply unit. If all CPU cores and multiple GPUs are running at full load, the PSU might be overloaded even if its rated wattage seems sufficient on paper. It is normal that very brief peaks in power draw can occur during normal operation of CPUs and GPUs. If those are short enough, the PSU can absorb those. But the more CPU cores and the more GPUs you have, the larger the statistical chance that a few of these brief peaks coincide, eventually overloading the PSU.

(“Precision Boost” is a newer AMD name for technologies like Turbo Core, XFR and XFR2. This seems to be more of a trademark name, that isn’t used much in BIOSes or technical documents.)

noenken · May 12, 2019, 1:27pm

C-States, again. That crap is coming up sooner or later in most cases of “Ryzen freezing for apparently no reason”. It’s either memory or that. Every time. I disable C-State management on everything and haven’t had instability since.

I would not be surprised if the system stayed perfectly stable now even with boost enabled.

FurryJackman · May 12, 2019, 1:30pm

You gotta do that as a kernel argument too because sometimes the kernel bypasses the UEFI in precedence for C-state control.

FaunCB · May 12, 2019, 2:07pm

the next time you post like this OP please space stuff out better.

Thanks.

For your issue my main suggestion is a 19.04 upgrade as theres constantly more and more code that builds up for TR.

Anjum_Sayed · May 14, 2019, 3:42pm

Thanks! TBH I didn’t try both those settings independently (mostly out of desperation so I could get on with my work as soon as possible). I’ll try turning Core Performance Boost back on again and see what happens.

Regarding PSU wattage, I have a 1500W supply so hopefully that isn’t the issue.

I also switched to Ubuntu 19.04 a while back but it made no difference

UPDATE: After turning core performance boost back on, the system has been stable under load for 48 hours. So it looks like the culprit was the C-states

FurryJackman · May 16, 2019, 10:13am

Yup, Ryzen + Linux + C-states is not fun. The power management for Zen in general has been broken for quite some time when using the Linux kernel.