I have been having random issues with my AMD GPU (RX 6800 non-XT). At first, I brushed it off as software bugs because the only major issues I had were random reboots when using ROCm(basically CUDA but AMD) on Linux to run AI and reboots when using ROCm and Blender on Windows, specifically with low rendering output resolution for some reason??? Yeah, super confused on that one. I discovered it when I was bored and set the output resolution to something like 180x144 to see how fast it would be, and spamming F12 to re-render would cause a reboot on my card consistently. All of these were using ROCm, so it was reasonable to assume it was just software bugs; after all, ROCm is a bit(a lot) buggy, I heard. I got a AMD GPU expecting this to some degree, hoping VRAM for cheap would make it worth it in the end. And at the time my GPU was perfectly stable besides these weird reboots when doing specific things. And maybe minor graphical weirdness? But I just thought those where bugs in the game, because afterall they where so minor not like textures going all crazy and lines everywhere, things like that happen even with my old card nothing crazy yet. Games played mostly perfectly, I’ve played though multiple games at this point without noticing the slighist thing infact.
All of these crashes had something in common though, when I reboot, I get an MCE (Machine Check Exception) in the logs. But I still assumed it was software still. And plus it’s kinda weird since MCE errors usually indicate memory corruption in RAM not VRAM, and I never had these before switching from a GTX 1080, plus I ran memtest86+ for hours and got no errors whatsoever or anything.
So before this, my only issues were using ROCm or Vulkan compute (basically the same thing on Linux, they use the same interface anyway). However, as of late, I have been having random freezes(no reboot) and weird glitchy lines appear on my monitor. I assumed an update must have broken something, but I was getting suspicious at this point, so I switched to Windows to be 1000000% sure. After switching to Windows the weird lines where still there, but only when I played games seemingly and so far at least I haven’t had any freezes but that was quite rare and random. on Linux they where always there when doing anything.
I am pretty convinced I have faulty GPU memory probably, and it has degraded overtime to where I am now where it’s always noticable afterall things I talked about before are VRAM intensive. It seems extremely subtle though which I why it has taken me so long to suspect it’s hardware related. My explanation for the MCE errors is maybe it happened when transferring from CPU to GPU, meaning it was hardware afterall.
Anyway, I’m mostly just asking if there is even the slightest chance it could be something else than hardware failure before completely throwing in the towel. And also just general advise because I have never had to waranty a GPU or really anything. For reference I bought my GPU new off Amazon from XFX in 2023, but I don’t have the original box anymore. And maybe someone else has had similar experience.
And MCE errors:
Feb 27 15:09:35 nixos kernel: mce: [Hardware Error]: Machine check events logged
Feb 27 15:09:35 nixos kernel: mce: [Hardware Error]: CPU 9: Machine Check: 0 Bank 5: bea0000000000108
Feb 27 15:09:35 nixos kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffc0754f4a MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Feb 27 15:09:35 nixos kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1709068166 SOCKET 0 APIC 3 microcode 8701030
Feb 27 15:09:36 nixos kernel: MCE: In-kernel MCE decoding enabled.
Error from Linux drivers themselves before reboot:
Feb 27 22:08:15 nixos systemd[1]: Starting Cleanup of Temporary Directories...
Feb 27 22:08:15 nixos systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Feb 27 22:08:15 nixos systemd[1]: Finished Cleanup of Temporary Directories.
Feb 27 22:08:24 nixos kernel: amdgpu 0000:09:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000028 SMN_C2PMSG_82:0x00000000
Feb 27 22:08:24 nixos kernel: amdgpu 0000:09:00.0: amdgpu: Failed to enable gfxoff!
Feb 27 22:08:29 nixos kernel: amdgpu 0000:09:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000028 SMN_C2PMSG_82:0x00000000
Feb 27 22:08:29 nixos kernel: amdgpu 0000:09:00.0: amdgpu: Failed to enable gfxoff!
Feb 27 22:08:30 nixos kernel: amdgpu 0000:09:00.0: [drm] *ERROR* [CRTC:95:crtc-1] flip_done timed out
Feb 27 22:08:31 nixos kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=3400, emitted seq=3401
Feb 27 22:08:31 nixos kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Feb 27 22:08:31 nixos kernel: amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
Feb 27 22:08:31 nixos kernel: amdgpu: Failed to suspend process 0x800c
This kinda seems like a race condition in the driver, but again I’ve been having other issues now. And I was going to attach a video of the weird lines, but it seems I can’t upload anything yet since I just made a account:/