Return to Level1Techs.com

Repeatable system freezes under GPU load with Threadripper & Ubuntu 18.04


#1

Hi guys, I’m getting repeatable system freezes whenever I use PyTorch. After around 5-20 minutes of training, my system freezes (no mouse/keyboard, no tty, no remote SSH, no displayed error messages, reset button required). I’m using 4 workers in the data loaders, so the CPU sits at around 20% usage during training.

Here’s my system details: Ubuntu 18.04, Nvidia driver 410 (installed from ppa:graphics-drivers/ppa), CUDA 10.0, cuDNN 7.4.2. 3 GPUs (1080, 2x2080Ti). Thermals (from nvidia-smi) seem normal. MB: Asus Zenith Extreme, CPU: Threadripper 2950X, RAM: 128GB of Corsair Vengeance LPX (3200MHz sticks but running at 2666Mhz). PSU rated to 1500W. The images used for the training are stored on a Samsung SSD connected via SATA.

Here’s what I’ve tried: Update Linux kernel to 4.19 (didn’t help so went back to the Ubuntu default). Reseated all the GPUs & RAM. Tested each GPU individually. Fresh install of Ubuntu. Stock Python 3.6 & Anaconda 3.7. I tried 1 hour of 100% CPU usage just to make sure it’s not a CPU/RAM thing and also GPU burn seems fine.

I thinking it could be some sort of PCIe issue since it only occurs under GPU load (regardless of GPU and PCIe slot). Or a multithreading issue since I’m using 4 workers (I’m currently running with 1 worker and will see if that causes a freeze).

Any ideas to what I should try next and what logs would help diagnose the issue?

Update I still got a freeze with a single worker, so it’s unlikely to be a multithreading issue


#2

Update the BIOS to the latest version, disable any energy efficiency stuff in it. Also check the exact parts number of the memory and search for it in the motherboard’s compatibility list.

I have no clue about the software but of course searching google for other examples of that behavior might also help. (I am assuming you did that already but… :wink: )

Could it be mixing pascal and turing? Maybe only use one architecture at a time, just to test it?


#3

Check this Ryzen PSAs I have posted:
https://forum.level1techs.com/t/psa-testing-for-and-solving-random-freezes-on-ryzen/

I do not have access to a TR system, but in principle this could be similar.
Also you could try turning IOMMU on/off


#4

Thanks for the suggestions guys. Here some more stuff I tried after scouring these forums:

  1. Set the BIOS according to VFIO BIOS settings for Zenith Extreme
  2. Also forced the BIOS to use PCIe Gen 3
  3. Set pcie_aspm=off in GRUB

It ran stable under load for around 24hrs before crashing again.

Currently using the latest BIOS. My memory kit isn’t listed on the QVL sheet, but there is a near identical 2666MHz kit listed, which is the speed I’m running at (I have 2 kits of 4x16GB 3200Mhz 16-18-18-36).

Right now I’m running memtester and will see if I can reproduce the issue with 1 stick of RAM. I’ll try the Pascal & Turing cards separately again, but when I tried a couple of weeks back it didn’t help

@Methylzero I’ll also try your stress test and report back

So many late nights wasted trying to figure out what is going on with this thing!


#5

interesting thread… Awsome hardware Anjum my compliments…
same cpu i was looking at… but without the expensive gpu’s…
just starting to learn coding and bad at math so machine learning is a universe away… still exciting stuf…

i just wanted to ask did you look at your logging? or how your system is behaving just before the crash… (dmesg, journalctl,top)
reading the pytorch github page issue 1659… i would agree with a guy expecting errors or at least a sign things are going to go bad before they do… i have bin out of the game for a while as a Unix Admin but usually unix based systems are extremely good in keeping the OS floating. doing it’s best to put processes to sleep give penalties to process all in the effort of going on… so when your system freezes… it’s because the system did not have time to manage the load. or there is a piece of software / hardware that’s not playing nice…
the Kernel update is something easy to try and easy to reverse…
i do like the way your tackling the problem… the pcie_aspm setting was a good one… in line with the comment about other power management setting.
i don’t know but isn’t one of the reason mem sticks are sold in packs that they are validated to run as a set? also don’t think that’s you problem because your running 2 kits as a pair… so any small difference should not be a problem…

On github i’m reading a reinstallation of the NVIDIA driver solved one issue. problem is he describes the issue a bit different then you do.
meaning he could be describing a different problem with the same final result…


#6

just remembered you need to make your journald logging is persistent. That’s the first thing i do when asked to look in to a problem… see if i have good logs on the system to diagnose…
went digging in to my own notes… here is a copy paste…
btw here is a small howto for using journalctl…
was refreshing for my hahaha…it’s bin a while…

here is the site…:

but i prefer my notes…

! First make sure you have persistant journald logging !

  • mkdir -p -m2775 /var/log/journal

  • grep journal /etc/group

  • chown :systemd-journal /var/log/journal

  • ps -ef|grep journal

  • killall -USR1 systemd-journald. (you could also just use systemctl restart systemd-journald)

  • journalctl -b-1 -p err ; query systemd journal last boot;


#7

Thanks @TuxKey! Definitely stick at the coding & math stuff, it seems daunting at first but trust me it gets easier and is great fun (and there’s plenty of great resources out there),

Everything in htop seems fine just before the crash (No lag, moderate CPU, RAM at about 10% and no swap usage). If it crashes while I’m using it, the mouse/keyboard stops working and the screen freezes (i.e. the clock doesn’t change). If it crashes overnight, the monitors won’t wake up when I touch the mouse/keyboard. The HDD light has no activity which means that it is no longer reading the training data from the SSD; this leads me to believe it’s not just a desktop manager crash, but something at the driver/kernel level.

I can across that GitHub issue before, which led me to think it was a num_workers thing, but I don’t think it’s what’s happening here (I get the issue with num_workers=0), and I think they fixed that issue in PyTorch 1.0. I’ve also got pretty quick at reinstalling the drivers now that I’ve done it so many time in the last few weeks :slight_smile: One thing I haven’t tried yet is to see if I can replicate the issue using TensorFlow (i’ll give it a go as soon as they release a Python 3.7 build in a few days).

Thanks for the journalctl shout - that’s exactly the type of logging I was looking for (been using Linux for years, but my knowledge of all these type of tools is minimal). I’ll report back if I can get a useful log at the next crash.

Here’s my dmesg output:

[Wed Jan 30 15:57:23 2019] secureboot: Secure boot could not be determined (mode 0)
[Wed Jan 30 15:57:23 2019] #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31
[Wed Jan 30 15:57:23 2019] PPR NX GT IA GA PC GA_vAPIC
[Wed Jan 30 15:57:23 2019] PPR NX GT IA GA PC GA_vAPIC
[Wed Jan 30 15:57:24 2019] Couldn’t get size: 0x800000000000000e
[Wed Jan 30 15:57:24 2019] PKCS#7 signature not signed with a trusted key
[Wed Jan 30 15:57:24 2019] PKCS#7 signature not signed with a trusted key
[Wed Jan 30 15:57:24 2019] nvidia: loading out-of-tree module taints kernel.
[Wed Jan 30 15:57:24 2019] nvidia: module license ‘NVIDIA’ taints kernel.
[Wed Jan 30 15:57:24 2019] Disabling lock debugging due to kernel taint
[Wed Jan 30 15:57:24 2019] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 410.78 Sat Nov 10 22:09:04 CST 2018 (using threaded interrupts)
[Wed Jan 30 15:57:24 2019] PKCS#7 signature not signed with a trusted key
[Wed Jan 30 15:57:24 2019] PKCS#7 signature not signed with a trusted key
[Wed Jan 30 15:57:24 2019] PKCS#7 signature not signed with a trusted key
[Wed Jan 30 15:57:24 2019] PKCS#7 signature not signed with a trusted key
[Wed Jan 30 15:57:25 2019] ata1.00: supports DRM functions and may not be fully accessible
[Wed Jan 30 15:57:25 2019] ata1.00: NCQ Send/Recv Log not supported
[Wed Jan 30 15:57:25 2019] ata1.00: supports DRM functions and may not be fully accessible
[Wed Jan 30 15:57:25 2019] ata1.00: NCQ Send/Recv Log not supported
[Wed Jan 30 15:57:25 2019] ata2.00: supports DRM functions and may not be fully accessible
[Wed Jan 30 15:57:25 2019] ata2.00: supports DRM functions and may not be fully accessible
[Wed Jan 30 15:57:59 2019] PKCS#7 signature not signed with a trusted key
[Wed Jan 30 15:57:59 2019] acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
[Wed Jan 30 15:58:00 2019] ath10k_pci 0000:03:00.0: Direct firmware load for ath10k/pre-cal-pci-0000:03:00.0.bin failed with error -2
[Wed Jan 30 15:58:00 2019] ath10k_pci 0000:03:00.0: Direct firmware load for ath10k/cal-pci-0000:03:00.0.bin failed with error -2
[Wed Jan 30 15:58:00 2019] ath10k_pci 0000:03:00.0: Unknown eventid: 118809
[Wed Jan 30 15:58:00 2019] ath10k_pci 0000:03:00.0: Unknown eventid: 90118
[Wed Jan 30 15:58:01 2019] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[Wed Jan 30 15:58:01 2019] caller os_map_kernel_space.part.7+0xda/0x120 [nvidia] mapping multiple BARs
[Wed Jan 30 15:58:02 2019] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[Wed Jan 30 15:58:02 2019] caller os_map_kernel_space.part.7+0xda/0x120 [nvidia] mapping multiple BARs
[Wed Jan 30 15:58:03 2019] ath10k_pci 0000:03:00.0: Unknown eventid: 118809
[Wed Jan 30 15:58:03 2019] ath10k_pci 0000:03:00.0: Unknown eventid: 90118
[Wed Jan 30 16:04:12 2019] kauditd_printk_skb: 30 callbacks suppressed


#8

Ok I think I must have had persistent journald logging already set up :slight_smile: . It was definitely hadn’t crashed by 08.05 this morning as I have a model checkpoint then, but there’s nothing after that in the log.

journalctl -b-1 -p err ; query systemd journal last boot output:

Jan 29 16:11:42 blackbox kernel: Couldn’t get size: 0x800000000000000e
Jan 29 16:11:42 blackbox kernel: PKCS#7 signature not signed with a trusted key
Jan 29 16:11:42 blackbox kernel: PKCS#7 signature not signed with a trusted key
Jan 29 16:11:42 blackbox kernel: PKCS#7 signature not signed with a trusted key
Jan 29 16:11:42 blackbox kernel: PKCS#7 signature not signed with a trusted key
Jan 29 16:11:42 blackbox kernel: PKCS#7 signature not signed with a trusted key
Jan 29 16:11:42 blackbox kernel: PKCS#7 signature not signed with a trusted key
Jan 29 16:11:42 blackbox kernel: PKCS#7 signature not signed with a trusted key
Jan 29 16:11:44 blackbox bluetoothd[1176]: Failed to set mode: Blocked through rfkill (0x12)
Jan 29 16:11:44 blackbox avahi-daemon[1236]: chroot.c: open() failed: No such file or directory
Jan 29 16:11:46 blackbox wpa_supplicant[1220]: dbus: wpa_dbus_get_object_properties: failed to get object properties: (none) none
Jan 29 16:11:46 blackbox wpa_supplicant[1220]: dbus: Failed to construct signal
Jan 29 16:11:46 blackbox wpa_supplicant[1220]: dbus: wpa_dbus_get_object_properties: failed to get object properties: (none) none
Jan 29 16:11:46 blackbox wpa_supplicant[1220]: dbus: Failed to construct signal
Jan 29 16:11:50 blackbox pulseaudio[1627]: [pulseaudio] backend-ofono.c: Failed to register as a handsfree audio agent with ofono: org.freedesktop.DBus.Error.ServiceUnknown: The name org.ofono was not pro
Jan 29 16:11:51 blackbox spice-vdagent[1789]: Cannot access vdagent virtio channel /dev/virtio-ports/com.redhat.spice.0
Jan 29 16:11:51 blackbox bluetoothd[1176]: Failed to set mode: Blocked through rfkill (0x12)
Jan 29 19:27:35 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out
Jan 29 21:10:29 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out
Jan 29 21:22:02 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out
Jan 29 21:50:23 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out
Jan 29 22:05:05 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out
Jan 29 22:37:38 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out
Jan 30 01:18:17 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out
Jan 30 03:01:11 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out
Jan 30 06:46:56 blackbox kernel: wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out

Also viewing the last 20 entries before today’s crash (from journalctl -b -1 -n 20:

Jan 30 08:12:52 blackbox kernel: wlp3s0: authenticate with a0:57:e3:31:9a:88
Jan 30 08:12:52 blackbox NetworkManager[1224]: [1548825172.2685] device (wlp3s0): supplicant interface state: scanning -> authenticating
Jan 30 08:12:52 blackbox kernel: wlp3s0: send auth to a0:57:e3:31:9a:88 (try 1/3)
Jan 30 08:12:52 blackbox kernel: wlp3s0: authenticated
Jan 30 08:12:52 blackbox wpa_supplicant[1220]: wlp3s0: Trying to associate with a0:57:e3:31:9a:88 (SSID=‘Macaroni Pie’ freq=2422 MHz)
Jan 30 08:12:52 blackbox kernel: wlp3s0: associate with a0:57:e3:31:9a:88 (try 1/3)
Jan 30 08:12:52 blackbox kernel: wlp3s0: RX AssocResp from a0:57:e3:31:9a:88 (capab=0x1411 status=0 aid=3)
Jan 30 08:12:52 blackbox NetworkManager[1224]: [1548825172.2826] device (wlp3s0): supplicant interface state: authenticating -> associating
Jan 30 08:12:52 blackbox wpa_supplicant[1220]: wlp3s0: Associated with a0:57:e3:31:9a:88
Jan 30 08:12:52 blackbox wpa_supplicant[1220]: wlp3s0: CTRL-EVENT-SUBNET-STATUS-UPDATE status=0
Jan 30 08:12:52 blackbox kernel: wlp3s0: associated
Jan 30 08:12:52 blackbox NetworkManager[1224]: [1548825172.2887] device (wlp3s0): supplicant interface state: associating -> 4-way handshake
Jan 30 08:12:52 blackbox wpa_supplicant[1220]: wlp3s0: WPA: Key negotiation completed with a0:57:e3:31:9a:88 [PTK=CCMP GTK=TKIP]
Jan 30 08:12:52 blackbox wpa_supplicant[1220]: wlp3s0: CTRL-EVENT-CONNECTED - Connection to a0:57:e3:31:9a:88 completed [id=0 id_str=]
Jan 30 08:12:52 blackbox NetworkManager[1224]: [1548825172.2961] device (wlp3s0): supplicant interface state: 4-way handshake -> completed
Jan 30 08:12:58 blackbox wpa_supplicant[1220]: wlp3s0: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-73 noise=-103 txrate=1000
Jan 30 08:13:22 blackbox wpa_supplicant[1220]: wlp3s0: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-64 noise=-103 txrate=1000
Jan 30 08:17:01 blackbox CRON[56962]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 30 08:17:01 blackbox CRON[56963]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jan 30 08:17:01 blackbox CRON[56962]: pam_unix(cron:session): session closed for user root


#9

here are a couple of journalctl command’s i had in my notes…
that can be helpful in drilling down to the final few moments before the crash.
(just a copy of my notes translated from dutch to eng.)

journalctl -r ; reverse show the newst enteries first.

journalctl -n 10 ; show the latest 10 lines.

journalctl -p err ; filter errors via priority…

here is a list of priorities…
“debug (7), info (6), notice (5), warning (4), err (3), crit (2), alert (1), and emerg (0).”
(man journalctl)

journalctl -f. ; same as tail -f showing the last 10 lines.
(this is handy for life viewing…)

journalctl --since or --until; DateTimeFormat YYYY.MM.DD hh:mm:ss
if you don’t use date it assumes today and if you don’t use time then it’s 00:00:00 midnight.
so the whole day …

journalctl --since “2014-10-22 23:30:00” ; example.

journalctl -o verbose ;shows more detail to zoom in on a error…

example of usage…

first i determine what i want to see. not knowing what it’s called…

journalctl -p 2 -o verbose

having found it’s something to do with NetworkManager and it’s referenced by _SYSTEMD_UNIT=
i can now drill in…

journalctl _SYSTEMD_UNIT=NetworkManager.service

i will read true the logs you posted see if i can find anything that jumps out…
seeing if i can find the few moments before it happens…
if there’s an event that could shed sum light…


#10

to be honest i’m not seeing any problems in the logs you posted…
if i’m missing something anyone please correct me… :wink:
perhaps you can drill down a bit to show us a few moments before t he problem starts… if it hangs the logs should stop logging so a few moments before that happens should be the time your interested in.
(just saying if you leave your system running and come back to it)…
btw it’s not all a waist of time… even if this leads to nothing… considering the work you do . brushing up on diagnostic of a problem is always handy :wink:

only real error i’m seeing in your log has to do with wireless but that’s not that big of an issue… (error -2)