Proxmox reboot loop / strange behaviour

Hi all so i posted on the proxmox forum, reddit and ltt forum but either got nothing or the thread has died off with no answers. So here i am before i just give up with proxmox.

Hi all, new to promox and fancied setting up a cluster. 1 node so far and already failing. Apologies for the long post but its adds context.

So i got a HP prodesk G3 400 mini - 8GB RAM (to be upgraded), i5-8500t. It should run at least the promox system without vms but im getting an issue with random reboots.

So it came with W11 installed and i ran that as a test, web browsing, light gaming etc etc for a good few hours and was fine, pretty solid. So i used the VE iso and rufus, got proxmox installed following a couple of guides (craftcomputing, jimsgarage) and all went smooth.

Went to bed and woke up to the pc cooking itself and unresponsive, power cycle and all good. Started messing with containers and a vm all seemed ok, beyond my config of those specific items then the web ui dropped again and the chassis started to warm up. Another power cycle all good.

Put W11 back on, left running all day and night, not a hiccup 40c cpu pretty stable all the way through. unraid, appears to also be stable after running for the best part of a day (loaded up the immich docker and let it process over 100k image files as a workout).

Put proxmox back on, web ui up and running, did nothing, went to bed, woke up still fine. Loaded the web ui started having a look around and boom, lock up.

Checked using ping and was unreachable, then a few minutes later it came back, so i looked around and it might be the realtek 8111 ethernet connection not playing nice. Checked the sys log however and im currently sat at 10 reboots. Ive powered it off now as it wasnt coming back and started to heat up.

Id point at hardware but 2 out of 3 OS were fine. Anyway, i couldn’t pull the syslog from the drive however this is what was loaded and running the web ui during the few reboots this morning.

Any help is much appreciated and the log is attached(linked) (trimmed for size)

Update: i tried this with the new release of proxmox ve 8.2 but still getting the same behaviour

Update 2: i put W11 back on the pc and its been up for over 2 days now. ive been remoting in and running stuff periodically. Even did as was suggested in the LTT forum of hitting with with prime95. It went fine not reboot or bluescreen. at a massive loss with this now

As a new user i cant upload the syslog but here is a snippet from before and after a reboot

Apr 13 11:53:54 apollo systemd: Inserted module ‘autofs4’
Apr 13 11:53:54 apollo systemd: systemd 252.22-1~deb12u1 running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Apr 13 11:53:54 apollo systemd: Detected architecture x86-64.
Apr 13 11:53:54 apollo systemd: Hostname set to .
Apr 13 11:53:54 apollo kernel: Lockdown: systemd: /dev/mem,kmem,port is restricted; see man kernel_lockdown.7
Apr 13 11:53:54 apollo systemd: Queued start job for default target graphical.target.
Apr 13 11:53:54 apollo systemd: Created slice system-getty.slice - Slice /system/getty.
Apr 13 11:53:54 apollo systemd: Created slice system-modprobe.slice - Slice /system/modprobe.
Apr 13 11:53:54 apollo systemd: Created slice system-postfix.slice - Slice /system/postfix.
Apr 13 11:53:54 apollo systemd: Created slice system-systemd\x2dfsck.slice - Slice /system/systemd-fsck.
Apr 13 11:53:54 apollo systemd: Created slice user.slice - User and Session Slice.
Apr 13 11:53:54 apollo systemd: Started systemd-ask-password-console.path - Dispatch Password Requests to Console Directory Watch.
Apr 13 11:53:54 apollo systemd: Started systemd-ask-password-wall.path - Forward Password Requests to Wall Directory Watch.
Apr 13 11:53:54 apollo systemd: Set up automount proc-sys-fs-binfmt_misc.automount - Arbitrary Executable File Formats File System Automount Point.
Apr 13 11:53:54 apollo systemd: Expecting device dev-disk-by\x2duuid-B47C\x2dE1DF.device - /dev/disk/by-uuid/B47C-E1DF…
Apr 13 11:53:54 apollo systemd: Expecting device dev-pve-swap.device - /dev/pve/swap…
Apr 13 11:53:54 apollo systemd: Reached target ceph-fuse.target - ceph target allowing to start/stop all [email protected] instances at once.
Apr 13 11:53:54 apollo systemd: Reached target ceph.target - ceph target allowing to start/stop all ceph*@.service instances at once.
Apr 13 11:53:54 apollo systemd: Reached target cryptsetup.target - Local Encrypted Volumes.
Apr 13 11:53:54 apollo systemd: Reached target integritysetup.target - Local Integrity Protected Volumes.
Apr 13 11:53:54 apollo systemd: Reached target paths.target - Path Units.
Apr 13 11:53:54 apollo systemd: Reached target slices.target - Slice Units.
Apr 13 11:53:54 apollo systemd: Reached target time-set.target - System Time Set.
Apr 13 11:53:54 apollo systemd: Reached target veritysetup.target - Local Verity Protected Volumes.
Apr 13 11:53:54 apollo systemd: Listening on dm-event.socket - Device-mapper event daemon FIFOs.
Apr 13 11:53:54 apollo systemd: Listening on lvm2-lvmpolld.socket - LVM2 poll daemon socket.
Apr 13 11:53:54 apollo systemd: Listening on rpcbind.socket - RPCbind Server Activation Socket.
Apr 13 11:53:54 apollo systemd: Listening on systemd-fsckd.socket - fsck to fsckd communication Socket.
Apr 13 11:53:54 apollo systemd: Listening on systemd-initctl.socket - initctl Compatibility Named Pipe.
Apr 13 11:53:54 apollo systemd: Listening on systemd-journald-audit.socket - Journal Audit Socket.
Apr 13 11:53:54 apollo systemd: Listening on systemd-journald-dev-log.socket - Journal Socket (/dev/log).
Apr 13 11:53:54 apollo systemd: Listening on systemd-journald.socket - Journal Socket.
Apr 13 11:53:54 apollo systemd: Listening on systemd-udevd-control.socket - udev Control Socket.
Apr 13 11:53:54 apollo systemd: Listening on systemd-udevd-kernel.socket - udev Kernel Socket.
Apr 13 11:53:54 apollo systemd: Mounting dev-hugepages.mount - Huge Pages File System…
Apr 13 11:53:54 apollo systemd: Mounting dev-mqueue.mount - POSIX Message Queue File System…
Apr 13 11:53:54 apollo systemd: Mounting sys-kernel-debug.mount - Kernel Debug File System…
Apr 13 11:53:54 apollo systemd: Mounting sys-kernel-tracing.mount - Kernel Trace File System…
Apr 13 11:53:54 apollo systemd: auth-rpcgss-module.service - Kernel Module supporting RPCSEC_GSS was skipped because of an unmet condition check (ConditionPathExists=/etc/krb5.keytab).
Apr 13 11:53:54 apollo systemd: Starting keyboard-setup.service - Set the console keyboard layout…
Apr 13 11:53:54 apollo systemd: Starting kmod-static-nodes.service - Create List of Static Device Nodes…
Apr 13 11:53:54 apollo systemd: Starting lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling…
Apr 13 11:53:54 apollo systemd: Starting [email protected] - Load Kernel Module configfs…
Apr 13 11:53:54 apollo systemd: Starting modprobe@dm_mod.service - Load Kernel Module dm_mod…
Apr 13 11:53:54 apollo systemd: Starting [email protected] - Load Kernel Module drm…
Apr 13 11:53:54 apollo systemd: Starting modprobe@efi_pstore.service - Load Kernel Module efi_pstore…
Apr 13 11:53:54 apollo systemd: Starting [email protected] - Load Kernel Module fuse…
Apr 13 11:53:54 apollo systemd: Starting [email protected] - Load Kernel Module loop…
Apr 13 11:53:54 apollo systemd: systemd-fsck-root.service - File System Check on Root Device was skipped because of an unmet condition check (ConditionPathExists=!/run/initramfs/fsck-root).
Apr 13 11:53:54 apollo systemd: Starting systemd-journald.service - Journal Service…
Apr 13 11:53:54 apollo systemd: Starting systemd-modules-load.service - Load Kernel Modules…
Apr 13 11:53:54 apollo systemd: Starting systemd-remount-fs.service - Remount Root and Kernel File Systems…
Apr 13 11:53:54 apollo systemd: Starting systemd-udev-trigger.service - Coldplug All udev Devices…
Apr 13 11:53:54 apollo systemd: Mounted dev-hugepages.mount - Huge Pages File System.
Apr 13 11:53:54 apollo systemd: Mounted dev-mqueue.mount - POSIX Message Queue File System.
Apr 13 11:53:54 apollo systemd: Mounted sys-kernel-debug.mount - Kernel Debug File System.
Apr 13 11:53:54 apollo systemd: Mounted sys-kernel-tracing.mount - Kernel Trace File System.
Apr 13 11:53:54 apollo kernel: pstore: Using crash dump compression: deflate
Apr 13 11:53:54 apollo systemd: Finished keyboard-setup.service - Set the console keyboard layout.
Apr 13 11:53:54 apollo systemd: Finished kmod-static-nodes.service - Create List of Static Device Nodes.
Apr 13 11:53:54 apollo systemd: [email protected]: Deactivated successfully.
Apr 13 11:53:54 apollo systemd: Finished [email protected] - Load Kernel Module configfs.
Apr 13 11:53:54 apollo systemd: modprobe@dm_mod.service: Deactivated successfully.
Apr 13 11:53:54 apollo systemd: Finished modprobe@dm_mod.service - Load Kernel Module dm_mod.
Apr 13 11:53:54 apollo systemd: [email protected]: Deactivated successfully.
Apr 13 11:53:54 apollo systemd: Finished [email protected] - Load Kernel Module fuse.
Apr 13 11:53:54 apollo systemd: [email protected]: Deactivated successfully.
Apr 13 11:53:54 apollo systemd: Finished [email protected] - Load Kernel Module loop.
Apr 13 11:53:54 apollo systemd: Mounting sys-fs-fuse-connections.mount - FUSE Control File System…
Apr 13 11:53:54 apollo systemd: Mounting sys-kernel-config.mount - Kernel Configuration File System…
Apr 13 11:53:54 apollo systemd: systemd-repart.service - Repartition Root Disk was skipped because no trigger condition checks were met.
Apr 13 11:53:54 apollo kernel: EXT4-fs (dm-1): re-mounted 5aab04a6-ff53-4c73-980b-2747e9d639ed r/w. Quota mode: none.
Apr 13 11:53:54 apollo systemd: Finished systemd-remount-fs.service - Remount Root and Kernel File Systems.
Apr 13 11:53:54 apollo systemd: Mounted sys-fs-fuse-connections.mount - FUSE Control File System.
Apr 13 11:53:54 apollo systemd: systemd-firstboot.service - First Boot Wizard was skipped because of an unmet condition check (ConditionFirstBoot=yes).
Apr 13 11:53:54 apollo systemd: Starting systemd-random-seed.service - Load/Save Random Seed…
Apr 13 11:53:54 apollo systemd: Starting systemd-sysusers.service - Create System Users…
Apr 13 11:53:54 apollo systemd: Mounted sys-kernel-config.mount - Kernel Configuration File System.
Apr 13 11:53:54 apollo systemd: Started dm-event.service - Device-mapper event daemon.
Apr 13 11:53:54 apollo kernel: pstore: Registered efi_pstore as persistent store backend
Apr 13 11:53:54 apollo systemd: Finished systemd-random-seed.service - Load/Save Random Seed.
Apr 13 11:53:54 apollo systemd: first-boot-complete.target - First Boot Complete was skipped because of an unmet condition check (ConditionFirstBoot=yes).
Apr 13 11:53:54 apollo systemd: modprobe@efi_pstore.service: Deactivated successfully.
Apr 13 11:53:54 apollo systemd: Finished modprobe@efi_pstore.service - Load Kernel Module efi_pstore.
Apr 13 11:53:54 apollo systemd: systemd-pstore.service - Platform Persistent Storage Archival was skipped because of an unmet condition check (ConditionDirectoryNotEmpty=/sys/fs/pstore).
Apr 13 11:53:54 apollo kernel: ACPI: bus type drm_connector registered
Apr 13 11:53:54 apollo systemd: [email protected]: Deactivated successfully.
Apr 13 11:53:54 apollo systemd: Finished [email protected] - Load Kernel Module drm.
Apr 13 11:53:54 apollo systemd: Finished systemd-sysusers.service - Create System Users.
Apr 13 11:53:54 apollo systemd: Starting systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev…
Apr 13 11:53:54 apollo systemd-journald[324]: Journal started
Apr 13 11:53:54 apollo systemd-journald[324]: Runtime Journal (/run/log/journal/ebd36095ef8e40018464ee7246e62b79) is 8.0M, max 76.4M, 68.4M free.
Apr 13 11:53:54 apollo dmeventd[336]: dmeventd ready for processing.
Apr 13 11:53:54 apollo systemd-modules-load[325]: Inserted module ‘vhost_net’
Apr 13 11:53:54 apollo dmeventd[336]: Monitoring thin pool pve-data.
Apr 13 11:53:54 apollo systemd[1]: Starting systemd-journal-flush.service - Flush Journal to Persistent Storage…
Apr 13 11:53:54 apollo systemd: Started systemd-journald.service - Journal Service.
Apr 13 11:53:54 apollo systemd-journald[324]: Time spent on flushing to /var/log/journal/ebd36095ef8e40018464ee7246e62b79 is 6.102ms for 829 entries.
– Reboot –
Apr 13 11:55:06 apollo kernel: Linux version 6.5.11-8-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-8 (2024-01-30T12:27Z) ()
Apr 13 11:55:06 apollo kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.11-8-pve root=/dev/mapper/pve-root ro quiet
Apr 13 11:55:06 apollo kernel: KERNEL supported cpus:
Apr 13 11:55:06 apollo kernel: Intel GenuineIntel
Apr 13 11:55:06 apollo kernel: AMD AuthenticAMD
Apr 13 11:55:06 apollo kernel: Hygon HygonGenuine
Apr 13 11:55:06 apollo kernel: Centaur CentaurHauls
Apr 13 11:55:06 apollo kernel: zhaoxin Shanghai
Apr 13 11:55:06 apollo kernel: BIOS-provided physical RAM map:
Apr 13 11:55:06 apollo kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009efff] usable
Apr 13 11:55:06 apollo kernel: BIOS-e820: [mem 0x000000000009f000-0x00000000000fffff] reserved
Apr 13 11:55:06 apollo kernel: BIOS-e820: [mem 0x0000000000100000-0x000000008e82cfff] usable
Apr 13 11:55:06 apollo kernel: BIOS-e820: [mem 0x000000008e82d000-0x000000008fa8efff] reserved
Apr 13 11:55:06 apollo kernel: BIOS-e820: [mem 0x000000008fa8f000-0x000000008fc8efff] ACPI NVS
Apr 13 11:55:06 apollo kernel: BIOS-e820: [mem 0x000000008fc8f000-0x000000008fd0efff] ACPI data
Apr 13 11:55:06 apollo kernel: BIOS-e820: [mem 0x000000008fd0f000-0x000000008fd0ffff] usable
Apr 13 11:55:06 apollo kernel: BIOS-e820: [mem 0x000000008fd10000-0x000000009e7fffff] reserved
Apr 13 11:55:06 apollo kernel: BIOS-e820: [mem 0x00000000fe010000-0x00000000fe010fff] reserved
Apr 13 11:55:06 apollo kernel: BIOS-e820: [mem 0x00000000ff400000-0x00000000ffffffff] reserved
Apr 13 11:55:06 apollo kernel: BIOS-e820: [mem 0x0000000100000000-0x000000025f7fffff] usable
Apr 13 11:55:06 apollo kernel: NX (Execute Disable) protection: active
Apr 13 11:55:06 apollo kernel: efi: EFI v2.6 by HP

2 things you can try:

  1. see if Proxmox ISO - Proxmox Virtual Environment 6.4 works correctly. The reason being the different versions of drivers from that release might enlighten us to where the specific issue is.

  2. some people have reported using a different A/c adapter with a higher wattage on those worked better for them. i think it was because of a linux kernel power state issue, maybe. It has been a while.

Ok, thank you for that im finally making progress.

After a few hours of playing around i have proxmox 6.4 installed and running as well as a ubuntu server up and running in a VM with docker and portainer all up.

Would this suggest a driver change in the versions has caused an issue?

could still be either but if you are not too far along there are some more things to try if you want. do an upgrade following the Proxmox 6 to 7 guide, and keep both kernels installed. then you will have a version 7 install with that kernel and the older kernel from 6.4 still installed. as long as you have a fairly vanila system probably the older kernel will still work and you can boot back and forth and see if the problem comes back with 1 kernel or the other.

you could always just run 6.4. but at some point you will need to deal with creating and testing an upgrade path. by the time it becomes emergent, it may be fixed in the newer versions.

Awesome, ill give that a go tonight. Proxmox is all stock and just the 1 VM so should be easy enough.

Yeah that was my hope after trying 8.1 when 8.2 came out but obviously its still missing something.

Phew!

Sounded more like thermal issues, where windows might have had better temp throttling

I did consider it. Checking the temps on 6.4 at a nice 34C with the single VM so definitely a software incompatibility issue. But at least i know have some progress

1 Like

Just as an update. after a couple of days with 6.4 running rock solid i did a fresh install of 7.4.

This is still up an running just fine 24 hours on so looks like ill be waiting for v8 to get fixed for my hardware and just keep testing it on version release

1 Like

This does not sound like simple software incompatibility, more like preexisting issue thats being tripped more often by newer versions than older.

If it were driver related, there should be kernel panic logged, but there is nothing. And reboot would ensue even if configured as such ( /proc/sys/kernel/panic == 1)

Go through all historical journactl logs and look for anomalies happening before unscheduled reboot. Writing from memory, might be different switch

# to list availbe historical logs (if there is less than expected, journactl has to be reconfigured to keep them)
journalctl --list-boots
# list current boot session logs
journalctl -b 0
# list previous boot session logs
journalctl -b -1
# etc for -n

If there is consistently nothing there (as in your snippet) I would check:

  • power supply ( newest proxmox and power brick from the other solid units)
  • look inside of case in case shorts failing electronics (scorch marks, bursting smds, etc)
  • do full stress test (is it triggered by load?, does it fail during deep idle?)
    • in case of extreme overheating unit might restart as last resort, but I have never seen it in real world. It would have to be catastrophic, cpu throttling usually suffices
    • voltage issues under load might also be the cause, but it would be related to psu, probably
  • do a full memtest battery overnight, alsojust in case

Random reboots are generally not a linux issue, even less so non-exotic platform like consumer 7000 intel series

appreciate the write up.

since 7.4 has been fine, i.e. literally no issues at all. im down for testing 8.4 some more. ill get it installed tonight and let it run until morning. at least then ill have newer logs and potentially other logs i can recover for any diagnosis

so it didnt take long for the symptoms to return. got 8.2.2 (wrong version i last post) installed, did an update check all fine.

after about 20 minutes we are into the loop again - id downloaded and just installed ubuntu into a vm disconnected the display port cable and keyboard so it could run headless

i got a syslog from the web gui showing 6 reboots within a 10 minute span but i cant upload the file here yet (new user) and too large for pastebin. i also grabbed the var/log folder but not entirely sure where to start looking

I am running proxmox 8.1 in uncustomized setup, so my result should be broadly applicable to your setup too:

  • default kernel panic value is 0, i.e system will not restart automatically if kernel panics (scratch that)
  • journald is logging persistently, so there are historical boot data available
root@pve:~# journalctl --list-boots
IDX BOOT ID                          FIRST ENTRY                  LAST ENTRY
-26 6cd24f429ed04271ab5ee147dd3cfa20 Wed 2021-11-10 21:11:01 CET  Wed 2021-11-10 21:21:13 CET
-25 d18900e3f83a41918b33b4b3642a1709 Wed 2021-11-10 21:22:47 CET  Fri 2021-11-12 15:41:55 CET
-24 ab17f06a0ab94a08a8ab517fc20b86b1 Wed 2021-11-17 20:30:06 CET  Wed 2021-12-01 13:35:33 CET
-23 1f15b37ceb654eda89d41ab66d74c15a Wed 2021-12-01 13:35:55 CET  Fri 2021-12-10 00:14:01 CET
-22 20e1a534aa9448dea4e7afda7b63d07e Fri 2021-12-10 00:14:22 CET  Sat 2021-12-18 14:13:29 CET
-21 dd14a4b0194444d89b68d4fe1c40f948 Fri 2022-05-27 17:15:30 CEST Sun 2022-08-14 10:22:14 CEST
-20 529fe1c6c6cb4a9bba53b7ad319cc919 Sun 2022-08-14 10:30:45 CEST Sun 2022-09-04 13:09:26 CEST
-19 eb0c31c9f8c84036bf4cb719cae1f06e Sun 2022-09-04 13:09:50 CEST Fri 2022-11-11 09:15:33 CET
-18 b45fa67a94a8498f8ad39847114a03ea Fri 2022-11-11 09:15:57 CET  Mon 2022-12-19 12:34:29 CET
-17 3951edbe82054e33932e366d9bd0d214 Mon 2022-12-19 12:34:54 CET  Mon 2022-12-19 12:37:35 CET
-16 bb71193f519e459faa29d4f885715586 Mon 2022-12-19 12:37:54 CET  Wed 2023-01-25 16:10:34 CET
-15 4e26183ca12d4547b98b83edb816e33a Wed 2023-01-25 16:10:58 CET  Wed 2023-01-25 16:14:06 CET
-14 f61f45cbd6754dc98610eea85454f621 Wed 2023-01-25 16:14:25 CET  Wed 2023-01-25 16:14:47 CET
-13 05199d55e44f4e67a5dfbdde2e1442db Wed 2023-01-25 16:15:06 CET  Wed 2023-01-25 16:21:51 CET
-12 73563725fa484d2ca64712725b5d7ea9 Wed 2023-01-25 16:22:10 CET  Wed 2023-03-22 16:07:14 CET
-11 053153e553724b3e8b3bcb0239334fac Sat 2023-03-25 11:41:45 CET  Sun 2023-03-26 14:51:05 CEST
-10 56efdcbeb8714538a25b6cce5d1ac182 Sun 2023-03-26 14:51:29 CEST Thu 2023-04-13 16:41:28 CEST
 -9 a73ba2c6b5b34fd691356f797659598f Thu 2023-04-13 23:45:10 CEST Tue 2023-05-16 09:41:39 CEST
 -8 b460eb557aa34530b0ca9d93a089ad19 Tue 2023-05-16 09:42:03 CEST Sat 2023-06-10 15:50:28 CEST
 -7 b0112d2d3860497fba133fa4803f107c Sat 2023-06-10 15:54:29 CEST Sat 2023-06-10 15:59:06 CEST
 -6 6cecbbe8740f47bbb178454f44da21ee Sat 2023-06-10 16:27:17 CEST Thu 2023-06-22 22:32:49 CEST
 -5 131b98d70c094d8ba57adab5b51b3d05 Thu 2023-06-22 22:33:12 CEST Thu 2023-11-30 15:09:50 CET
 -4 444c67a6c0b54c28b0b68405e7ab7f10 Thu 2023-11-30 15:10:15 CET  Sun 2024-01-07 11:14:45 CET
 -3 f123d5668ec4451890629a59dde54ac4 Sun 2024-01-07 11:50:31 CET  Wed 2024-01-10 14:40:02 CET
 -2 652d6187096542de9b5683388eafbb18 Wed 2024-01-10 14:40:25 CET  Sun 2024-03-03 21:18:55 CET
 -1 ca1109257a4c48ed8ac1e200b73fa5f6 Sun 2024-03-03 21:19:19 CET  Sat 2024-04-06 15:13:30 CEST
  0 7398903fc7e74dfcb1a49bd4a89f4523 Sat 2024-04-06 15:13:53 CEST Sun 2024-05-05 09:29:34 CEST

Rock stable installation on HP ProDesk 405 G6 DM , very similar to yours.

Journald aggregates all relevant log data, so manually exploring /var/log is not strictly necessary here. Only if you want manually check source streams in more condensed view. Proxmox stack itself logs /var/log/pve*, but nothing really interesting is there.

So go through e.g last 100 lines of each boot logs and look for interesting error, if they are there, they should be obvious.
You can install lnav utility for comfortable work with logs.

apt install lnav -y
journalctl -b 0 --no-pager -n 100 | lnav
journalctl -b -1 --no-pager -n 100 | lnav
...

If there is anything relevant there, post it or investigate. If not, focus on hardware itself:

  • PSU,
  • memory stability
  • cooling …

EDIT: catastrophic error like kernel panic looks like this

so i managed to parse the logs into a text file in between one of the reboots. and also got the last 100 lines loaded as per the commands provided.

ive had a look through and other than the errors for the dirty bit due to unclean shutdown nothing is jumping out.

below is the previous to last boot (last 100 lines) so the log is as complete as it would be.

Summary
May 06 12:43:36 apollo systemd[1]: Started pve-daily-update.timer - Daily PVE download activities.
May 06 12:43:36 apollo systemd[1]: Reached target timers.target - Timer Units.
May 06 12:43:36 apollo systemd[1]: Starting rrdcached.service - LSB: start or stop rrdcached...
May 06 12:43:36 apollo systemd[1]: Finished lxc.service - LXC Container Initialization and Autoboot Code.
May 06 12:43:36 apollo rrdcached[742]: rrdcached started.
May 06 12:43:36 apollo systemd[1]: Started rrdcached.service - LSB: start or stop rrdcached.
May 06 12:43:36 apollo systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
May 06 12:43:36 apollo postfix[779]: Postfix is using backwards-compatible default settings
May 06 12:43:36 apollo postfix[779]: See redacted for forum post - for details
May 06 12:43:36 apollo postfix[779]: To disable backwards compatibility use "postconf compatibility_level=3.6" and "postfix reload"
May 06 12:43:36 apollo pmxcfs[778]: [main] notice: resolved node name 'apollo' to '10.10.10.13' for default node IP address
May 06 12:43:36 apollo pmxcfs[778]: [main] notice: resolved node name 'apollo' to '10.10.10.13' for default node IP address
May 06 12:43:36 apollo postfix/postfix-script[876]: starting the Postfix mail system
May 06 12:43:36 apollo postfix/master[878]: daemon started -- version 3.7.10, configuration /etc/postfix
May 06 12:43:36 apollo systemd[1]: Started [email protected] - Postfix Mail Transport Agent (instance -).
May 06 12:43:36 apollo systemd[1]: Starting postfix.service - Postfix Mail Transport Agent...
May 06 12:43:36 apollo systemd[1]: Finished postfix.service - Postfix Mail Transport Agent.
May 06 12:43:36 apollo kernel: vmbr0: port 1(enp2s0) entered disabled state
May 06 12:43:37 apollo systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
May 06 12:43:37 apollo systemd[1]: corosync.service - Corosync Cluster Engine was skipped because of an unmet condition check (ConditionPathExists=/etc/corosync/corosync.conf).
May 06 12:43:37 apollo systemd[1]: Started cron.service - Regular background program processing daemon.
May 06 12:43:37 apollo systemd[1]: Started proxmox-firewall.service - Proxmox nftables firewall.
May 06 12:43:37 apollo cron[885]: (CRON) INFO (pidfile fd = 3)
May 06 12:43:37 apollo systemd[1]: Starting pve-firewall.service - Proxmox VE firewall...
May 06 12:43:37 apollo cron[885]: (CRON) INFO (Running @reboot jobs)
May 06 12:43:37 apollo systemd[1]: Starting pvedaemon.service - PVE API Daemon...
May 06 12:43:37 apollo systemd[1]: Starting pvestatd.service - PVE Status Daemon...
May 06 12:43:37 apollo pve-firewall[895]: starting server
May 06 12:43:37 apollo systemd[1]: Started pve-firewall.service - Proxmox VE firewall.
May 06 12:43:37 apollo pvestatd[903]: starting server
May 06 12:43:37 apollo systemd[1]: Started pvestatd.service - PVE Status Daemon.
May 06 12:43:38 apollo pvedaemon[921]: starting server
May 06 12:43:38 apollo pvedaemon[921]: starting 3 worker(s)
May 06 12:43:38 apollo pvedaemon[921]: worker 922 started
May 06 12:43:38 apollo pvedaemon[921]: worker 923 started
May 06 12:43:38 apollo pvedaemon[921]: worker 924 started
May 06 12:43:38 apollo systemd[1]: Started pvedaemon.service - PVE API Daemon.
May 06 12:43:38 apollo systemd[1]: Starting pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon...
May 06 12:43:38 apollo systemd[1]: Starting pveproxy.service - PVE API Proxy Server...
May 06 12:43:38 apollo kernel: r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control rx/tx
May 06 12:43:38 apollo kernel: vmbr0: port 1(enp2s0) entered blocking state
May 06 12:43:38 apollo kernel: vmbr0: port 1(enp2s0) entered forwarding state
May 06 12:43:38 apollo pve-ha-crm[929]: starting server
May 06 12:43:38 apollo pve-ha-crm[929]: status change startup => wait_for_quorum
May 06 12:43:38 apollo systemd[1]: Started pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon.
May 06 12:43:39 apollo pveproxy[930]: starting server
May 06 12:43:39 apollo pveproxy[930]: starting 3 worker(s)
May 06 12:43:39 apollo pveproxy[930]: worker 931 started
May 06 12:43:39 apollo pveproxy[930]: worker 932 started
May 06 12:43:39 apollo pveproxy[930]: worker 933 started
May 06 12:43:39 apollo systemd[1]: Started pveproxy.service - PVE API Proxy Server.
May 06 12:43:39 apollo systemd[1]: Starting pve-ha-lrm.service - PVE Local HA Resource Manager Daemon...
May 06 12:43:39 apollo systemd[1]: Starting spiceproxy.service - PVE SPICE Proxy Server...
May 06 12:43:39 apollo spiceproxy[936]: starting server
May 06 12:43:39 apollo spiceproxy[936]: starting 1 worker(s)
May 06 12:43:39 apollo spiceproxy[936]: worker 937 started
May 06 12:43:39 apollo systemd[1]: Started spiceproxy.service - PVE SPICE Proxy Server.
May 06 12:43:40 apollo pve-ha-lrm[938]: starting server
May 06 12:43:40 apollo pve-ha-lrm[938]: status change startup => wait_for_agent_lock
May 06 12:43:40 apollo systemd[1]: Started pve-ha-lrm.service - PVE Local HA Resource Manager Daemon.
May 06 12:43:40 apollo systemd[1]: Starting pve-guests.service - PVE guests...
May 06 12:43:41 apollo pve-guests[940]: <root@pam> starting task UPID:apollo:000003AD:00000466:6638C26D:startall::root@pam:
May 06 12:43:41 apollo pve-guests[940]: <root@pam> end task UPID:apollo:000003AD:00000466:6638C26D:startall::root@pam: OK
May 06 12:43:41 apollo systemd[1]: Finished pve-guests.service - PVE guests.
May 06 12:43:41 apollo systemd[1]: Starting pvescheduler.service - Proxmox VE scheduler...
May 06 12:43:42 apollo pvescheduler[943]: starting server
May 06 12:43:42 apollo systemd[1]: Started pvescheduler.service - Proxmox VE scheduler.
May 06 12:43:42 apollo systemd[1]: Reached target multi-user.target - Multi-User System.
May 06 12:43:42 apollo systemd[1]: Reached target graphical.target - Graphical Interface.
May 06 12:43:42 apollo systemd[1]: Starting systemd-update-utmp-runlevel.service - Record Runlevel Change in UTMP...
May 06 12:43:42 apollo systemd[1]: systemd-update-utmp-runlevel.service: Deactivated successfully.
May 06 12:43:42 apollo systemd[1]: Finished systemd-update-utmp-runlevel.service - Record Runlevel Change in UTMP.
May 06 12:43:42 apollo systemd[1]: Startup finished in 3.460s (firmware) + 7.381s (loader) + 2.093s (kernel) + 9.975s (userspace) = 22.911s.
May 06 12:43:42 apollo pvedaemon[923]: <root@pam> successful auth for user 'root@pam'
May 06 12:43:43 apollo pvedaemon[950]: starting termproxy UPID:apollo:000003B6:00000517:6638C26F:vncshell::root@pam:
May 06 12:43:43 apollo pvedaemon[922]: <root@pam> starting task UPID:apollo:000003B6:00000517:6638C26F:vncshell::root@pam:
May 06 12:43:43 apollo pvedaemon[923]: <root@pam> successful auth for user 'root@pam'
May 06 12:43:43 apollo login[957]: pam_unix(login:session): session opened for user root(uid=0) by (uid=0)
May 06 12:43:43 apollo systemd[1]: Created slice user-0.slice - User Slice of UID 0.
May 06 12:43:43 apollo systemd[1]: Starting [email protected] - User Runtime Directory /run/user/0...
May 06 12:43:43 apollo systemd-logind[584]: New session 1 of user root.
May 06 12:43:43 apollo systemd[1]: Finished [email protected] - User Runtime Directory /run/user/0.
May 06 12:43:43 apollo systemd[1]: Starting [email protected] - User Manager for UID 0...
May 06 12:43:43 apollo (systemd)[963]: pam_unix(systemd-user:session): session opened for user root(uid=0) by (uid=0)
May 06 12:43:43 apollo systemd[963]: Queued start job for default target default.target.
May 06 12:43:43 apollo systemd[963]: Created slice app.slice - User Application Slice.
May 06 12:43:43 apollo systemd[963]: Reached target paths.target - Paths.
May 06 12:43:43 apollo systemd[963]: Reached target timers.target - Timers.
May 06 12:43:43 apollo systemd[963]: Listening on dirmngr.socket - GnuPG network certificate management daemon.
May 06 12:43:43 apollo systemd[963]: Listening on gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
May 06 12:43:43 apollo systemd[963]: Listening on gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
May 06 12:43:43 apollo systemd[963]: Listening on gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
May 06 12:43:43 apollo systemd[963]: Listening on gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
May 06 12:43:43 apollo systemd[963]: Reached target sockets.target - Sockets.
May 06 12:43:43 apollo systemd[963]: Reached target basic.target - Basic System.
May 06 12:43:43 apollo systemd[963]: Reached target default.target - Main User Target.
May 06 12:43:43 apollo systemd[963]: Startup finished in 110ms.
May 06 12:43:43 apollo systemd[1]: Started [email protected] - User Manager for UID 0.
May 06 12:43:43 apollo systemd[1]: Started session-1.scope - Session 1 of User root.
May 06 12:43:43 apollo login[978]: ROOT LOGIN  on '/dev/pts/0'
May 06 12:43:46 apollo chronyd[721]: Selected source 131.111.8.60 (2.debian.pool.ntp.org)
May 06 12:43:46 apollo chronyd[721]: System clock TAI offset set to 37 seconds
May 06 12:43:47 apollo chronyd[721]: Selected source 162.159.200.1 (2.debian.pool.ntp.org)
May 06 12:44:03 apollo systemd[1]: systemd-fsckd.service: Deactivated successfully.
May 06 12:44:11 apollo sshd[1079]: Accepted password for root from 10.10.10.135 port 53600 ssh2
May 06 12:44:11 apollo sshd[1079]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
May 06 12:44:11 apollo systemd-logind[584]: New session 3 of user root.
May 06 12:44:11 apollo systemd[1]: Started session-3.scope - Session 3 of User root.
May 06 12:44:11 apollo sshd[1079]: pam_env(sshd:session): deprecated reading of user environment enabled

ive also put the full log for the above loop on pastebin ‘yWtwPiqz’ - as no links allowed

i did a search over it all for ‘panic’ but only found this line for each loop

May 05 00:09:12 apollo kernel: softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)

im working going to try the change in power supplies now to be sure but wanted to get this bit posted at least.

thank you for the help so far too

so it was a quick turn around but tried the power supply from my other unit (same make/model specs etc) and like clockwork got the boot loop.

so im still fairly convinced now that its clearly something in ve 8 thats causing it.

ive took the cover off and given the board a look over, nothing showing as blown, burnt, loose or otherwise that might cause a short. id still have expected that to cause issue in earlier proxmox versions and windows/unraid when i tested.

so yeah im 99% sure now its software related from ve 8. unless you can see anything in the logs, guess ill play the waiting game until its had a few more releases

As suspected, zero diagnostic output generated, damn.

I dimly remember somewhat similar issue with skylake based office SFFs I had at work way back, some computers just froze randomly and had to be hard power cycled.

No diagnostic ever shown anything, even service techs gave up.

We chalked it up to either faulty silicon or deep firmware fuckery going on. It was also random, we had hundred of identical machines and few of them were affected.

I would try full cold reset, even bios + cmos. Do bios upgrade and look if fwupd has something to offer.

Yeah this is just a weird edge case i think.

Im in the mind to just give up with this one. Im in some auctions to try get some more modern USSF machines so maybe those will have better luck, in the meantime this will be a little test machine until a few more VE8 releases come out.

Many thanks for jumping in and giving me a hand, it was much appreciated

Jumping back in with just a potential update. After trying to install ubuntu server on the bare metal, i got the same results. After reading around the topic and doing some more searches etc related more to ubuntu its possible that this age of hardware is incompatible with the newer kernel 6.x