Random reboots after bios update/new graphic card

I have a problem with random reboots after I got my AMD RX 6900 XT.
I had a system running with a RX 590, and never experienced an issue, but when i upgraded to the RX 6900 I also upgraded my BIOS, and i think it was then my troubles startet.
The motherboard is a Gigabyte Aorus x570 master with a Ryzen 9 3950 x, BIOS v. F31l, and now running the lastest BIOS version F33.
Some times it can run for 8 hours with no problem and then reboot in the middle of the night when not used, at other times it reboots while opening firefox right after boot.
The only thing i have is the following log entries after the reboot
`> … node #0, CPUs: #1 #2

mce: [Hardware Error]: Machine check events logged
[Hardware Error]: System Fatal error.
[Hardware Error]: CPU:2 (17:71:0) MC5_STATUS[-|UE|MiscV|AddrV|PCC|TCC|SyndV|-|-|-]: 0xbea0000000000108
[Hardware Error]: Error Addr: 0x0001ffffb24a6fbc
[Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000000
[Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error.
[Hardware Error]: cache level: RESV, tx: GEN, mem-tx: GEN
#3 #4 #5 #6 #7
mce: [Hardware Error]: Machine check events logged
[Hardware Error]: System Fatal error.
[Hardware Error]: CPU:7 (17:71:0) MC5_STATUS[-|UE|MiscV|AddrV|PCC|TCC|SyndV|-|-|-]: 0xbea0000000000108
[Hardware Error]: Error Addr: 0x0001ffffc0390422
[Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000000
[Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error.
[Hardware Error]: cache level: RESV, tx: GEN, mem-tx: GEN
#8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24
[Hardware Error]: System Fatal error.
[Hardware Error]: CPU:24 (17:71:0) MC5_STATUS[-|UE|MiscV|AddrV|PCC|TCC|SyndV|-|-|-]: 0xbea0000000000108
[Hardware Error]: Error Addr: 0x0001ffffb2f827b4
[Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000000
[Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error.
[Hardware Error]: cache level: RESV, tx: GEN, mem-tx: GEN
#25 #26 #27 #28 #29 #30 #31`

System is running Gentoo with “home made” kernel 5.10.13, but the issue have been present from kernel 5.10.1.

Anyone who can give som pointers to where to look before i go completely crazy?


suffle

First of all you have made sure to reset any kind of overclocking that might have been in place, right? Historically instability often results from some kind of power delivery issue. Are you sure your PSU is sized right and in proper condition?

Do you want to join this thread maybe?

All is reset to default in BIOS, except for the RAM, which is set to XMP.
The PSU is a brand new be Quiet! Dark Power Pro 11 1000W, so it shouldn’t be a problem.


suffle

I’ve seen that thread, and i’m on that BIOS already, and tried many to all of the suggestion there, still no luck, but the base error looks like the same.


suffle

could be a bad reset switch on your case.

Just a small update.
I have switched graphics card to my old RX 590 and do not have had a single reboot, been running the RX 6900 XT for 3 days on an other platform, z290 chipset with windows without any reboots.


suffle

Just got a rx 6700 xt i’m trying with, and the system is still reboot, but now i got some kernel dump, so if anyone knows about this i would like to hear, this is what is being spammed in my logfile

   ------------[ cut here ]------------
WARNING: CPU: 30 PID: 5913 at drivers/gpu/drm/ttm/ttm_bo.c:517 ttm_bo_release+0x293/0x2f0 [ttm]
Modules linked in: md4 sha512_generic cifs fuse xt_MASQUERADE xt_addrtype iptable_nat bnep btusb amdgpu btrtl btbcm btintel k10temp bluetooth ecdh_generic iwlmvm ecc drm_ttm_helper it87 ttm hwmon_vid backlight gpu_sched iwlwifi r8169 efivarfs
CPU: 30 PID: 5913 Comm: Xwayland Tainted: G        W         5.11.9-gentoo #13
Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS MASTER/X570 AORUS MASTER, BIOS F33f 03/12/2021
RIP: 0010:ttm_bo_release+0x293/0x2f0 [ttm]
Code: e8 f2 1c 4d d2 e9 f2 fd ff ff 48 8b 7d 88 b9 30 75 00 00 31 d2 be 01 00 00 00 e8 f8 42 4d d2 48 8b 45 d8 eb 9d 4c 89 e0 eb 98 <0f> 0b c7 85 9c 00 00 00 00 00 00 00 4c 89 ef e8 d9 f1 ff ff 48 8d
RSP: 0000:ffff9c6dc2dbbba8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff995dd3638500 RCX: 0000000000000007
RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffffc017fa28
RBP: ffff995dd3638570 R08: ffff995dabe6b1b8 R09: ffff9c6dc2dbba48
R10: 0000000000000001 R11: ffff9c6dc2dbbc70 R12: ffff995d53725588
R13: ffff995dd3638400 R14: 0000000000000000 R15: ffff995d53725f48
FS:  00007f9768d8fd00(0000) GS:ffff99646f180000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f97342f9000 CR3: 0000000146fca000 CR4: 0000000000350ee0
Call Trace:
 ttm_bo_move_accel_cleanup+0x1f2/0x3e0 [ttm]
 amdgpu_bo_move+0x15c/0x6f0 [amdgpu]
 ? amdgpu_vram_mgr_new+0x2c7/0x3b0 [amdgpu]
 ttm_bo_handle_move_mem+0x8b/0x170 [ttm]
 ttm_bo_validate+0x147/0x180 [ttm]
 amdgpu_bo_fault_reserve_notify+0xbf/0x140 [amdgpu]
 amdgpu_ttm_fault+0x31/0x80 [amdgpu]
 __do_fault+0x32/0x90
 handle_mm_fault+0x955/0x1000
 do_user_addr_fault+0x191/0x470
 exc_page_fault+0x4f/0xf0
 ? asm_exc_page_fault+0x8/0x30
 asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7f97692d9bdb
Code: 80 fa 01 77 3b 72 05 0f b6 0e 88 0f c3 c5 fa 6f 06 c5 fa 6f 4c 16 f0 c5 fa 7f 07 c5 fa 7f 4c 17 f0 c3 48 8b 4c 16 f8 48 8b 36 <48> 89 4c 17 f8 48 89 37 c3 8b 4c 16 fc 8b 36 89 4c 17 fc 89 37 c3
RSP: 002b:00007ffc612f0a68 EFLAGS: 00010246
RAX: 00007f97342f9000 RBX: 00007f97696725a0 RCX: 00010000017e00e3
RDX: 0000000000000008 RSI: 00010000017e00e3 RDI: 00007f97342f9000
RBP: 0000563790b428a0 R08: 0000000000000004 R09: 0000000000000008
R10: 000056379010e2d0 R11: 0000000000000000 R12: 00007f97342f9000
R13: 00000000000000e3 R14: 000056378ff949d0 R15: 0000000000000001
---[ end trace 6b31dbd706eb4c54 ]---


suffle