Laptop (Nvidia GPU) powering off randomly while gaming

biolinguist · September 9, 2024, 7:36pm

So, I have this weird problem where my laptop randomly shuts down while gaming. This issues seems to occur mostly with older games (e.g. Painkiller, Half Life etc.) or some indie games. The laptop isn’t running particularly hot, and there doesn’t seem to be any power supply issues either. Also, most games (Diablo 3, 4, Wolfenstein, Metro, Stalcraft, Tomb Raider, RE6 etc.) don’t seem to have this issue.

I am wondering if this could be wine/proton issue? I am running Ubuntu 24.04 with Nvidia 535.183.01. The laptop’s specs are i7-9700KF and RTX 2080. Here’s some of the logs… can provide more if necessary. Thanks guys!

00:54:06 kernel: index 1 is out of range for type 'uvm_gpu_chunk_t *[*]'
00:54:06 kernel: UBSAN: array-index-out-of-bounds in build/nvidia/535.183.01/build/nvidia-uvm/uvm_pmm_gpu.c:2044:63
00:54:06 kernel: index 0 is out of range for type 'uvm_gpu_chunk_t *[*]'
00:54:06 kernel: UBSAN: array-index-out-of-bounds in build/nvidia/535.183.01/build/nvidia-uvm/uvm_pmm_gpu.c:746:68
00:54:06 kernel: index 0 is out of range for type 'uvm_gpu_chunk_t *[*]'
00:54:06 kernel: UBSAN: array-index-out-of-bounds in build/nvidia/535.183.01/build/nvidia-uvm/uvm_pmm_gpu.c:2038:44
00:54:06 kernel: index 0 is out of range for type 'uvm_gpu_chunk_t *[*]'
00:54:06 kernel: UBSAN: array-index-out-of-bounds in build/nvidia/535.183.01/build/nvidia-uvm/uvm_pmm_gpu.c:2364:28
00:49:27 kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to grab modeset ownership
00:49:25 systemd: Failed to start app-gnome-user\x2ddirs\x2dupdate\x2dgtk-2803.scope - Application launched by gnome-session-binary.
00:49:25 kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to grab modeset ownership
00:49:24 systemd: Failed to start app-gnome-gnome\x2dkeyring\x2dssh-2555.scope - Application launched by gnome-session-binary.
00:49:20 (uidsynth): fluidsynth.service: Failed to drop capabilities: Operation not permitted
00:49:19 gdm-session-wor: gkr-pam: unable to locate daemon control file
00:47:02 kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to grab modeset ownership
00:46:58 systemd: Failed to start fluidsynth.service - FluidSynth Daemon.
00:46:58 (uidsynth): fluidsynth.service: Failed to drop capabilities: Operation not permitted
00:46:56 canonical-livep: Task "refresh" returned an error: livepatch check failed: POST request to "https://livepatch.canonical.com/v1/client/8692696bb0db4082abb7f4f8dec99d8f/updates" failed, retrying in 30s.
00:46:54 (uetoothd): src/device.c:set_wake_allowed_complete() Set device flags return status: Invalid Parameters
00:46:53 kernel: 
00:46:53 kernel: ucsi_ccg 0-0008: ucsi_ccg_init failed - -6
00:46:53 kernel: xpad 1-1:1.0: xpad_try_sending_next_out_packet - usb_submit_urb failed with result -19
00:46:53 kernel: x86/cpu: SGX disabled by BIOS.

TryTwiceMedia · September 9, 2024, 10:42pm

are you binding the GPU to the wine/proton instance?

biolinguist · September 10, 2024, 9:58am

Can you explain? I am not sure I follow…

DavieDavieDavie · September 10, 2024, 10:33am

Random shutoffs on laptops, at least from my experience, is the GPU overheating. How are you checking temperatures or monitoring voltages?

biolinguist · September 10, 2024, 3:29pm

Freon, coupled with mangohud in-game. This is a 200 watt RTX 2080 inside a Clevo chassis. It’s typically the CPU that runs on the hotter side at around 90C. The GPU usually hovers between 80C to 86C. I am not quite sure I’ve been monitoring voltages closely though… I will keep an eye on that! Thanks!

SgtAwesomesauce · September 10, 2024, 3:49pm

All of those kernel level errors seem to be normal failures. I can’t speak for fluidsynth, but that looks like some setcap error.

Is this a hard-stop situation? If so, this might be tough to diagnose since these are difficult to diagnose because the most relevant logs tend to get wiped when the system goes down.

Do you have logs that are closer to the shutdown? Maybe kernel errors thrown during game launch? If you can collect some proton logs, that would be helpful as well.

It’s a 2080, so it shouldn’t be old enough to give me hardware-failure suspicions.

I know Nvidia likes to run it’s GPUs around 87C max, if you’re running those older games at 80-86c, I’d check the voltages. That seems like a lot of heat for OG Half Life.

biolinguist · September 13, 2024, 12:00pm

Hmmm… I have quite a few more taxing games installed as well. Frankly, I don’t want to say the GPU runs 80+ with the older games. But I am gonna keep an eye on the temps. across games for a while.

To answer your question, yes these are usually hard stops. The laptop simply shuts down.

thunderysteak · September 15, 2024, 5:10pm

Check with nvidia-smi if the process itself is running on the GPU and to check the temperatures of the GPU itself

Workaround for this is to have an additional system SSH’d into the affected system and collect logs until the system crashes. I’ve been diagnosing SteamVR and Nvidia GPU driver issues like this, and when the system crashes and terminates the SSH session then you will see the last logs displayed right before the crash