WD Ultrastar DC SN620 U.2 drive issues on Supermicro H12DSi-N6 Dual EPYC 7742 server

I have a pretty dumb problem happening with some U.2 drives that just keep disappearing from a EPYC linux Supermicro server I have

Specs and Setup:

CPU: Dual Socket 64-core EPYC 7742
MOBO: Supermicro H12DSi-N6 board
MEM: 512GB DDR4 2400 Registered ECC Micron
GPU: Nvidia RTX A2000 12GB (irrelevant and didn’t always exist)
Backplane: I assume a Supermicro split SAS/SATA and U.2 NVMe Backplane. The NVMe part goes to OcuLink
Drives: 4x WD Ultrastar DC SN620 U.2 1.6TB drives configured in ZFS RAIDZ1
PSU: Dual Redundant 1200w Platinum 1U Supermicro PSUs

OS: Proxmox 8.x latest (Debian 12 Bookworm based)
Kernel: 6.5.11-7-pve
BIOS: 2.7 - Latest for board

It is a coin-flip at boot that they don’t show up in the OS until I physically re-insert them in the bay

Underload one of the 4 U.2 drives in the ZFS array will just disappear and show as removed. I have to again, physically remove and reinsert it

This has been happening since day one of this server under Proxmox 7.x, Kernel 5.15, an older BIOS. And now ZFS is reporting checksum issues on on them

dmesg/syslog provides no clues, only just that the NVMe device shows up or disappears

This is happening on 2 servers with identical specs

Without having just straight up replacing the U.2 NVMe drives, I am at a loss for what could be the problem and I am wondering if someone else had had a similar experience or knows if there’s something faulty about the Supermicro backplane or these WD SN620 drives. As a sidenote, the root is on RAID1 SATA SSDs on the SAS/SATA side of the backplane connected to an HBA and that has no issues

lspci output
c1:00.0 Non-Volatile memory controller: Sandisk Corp Skyhawk Series NVME SSD (rev 01) (prog-if 02 [NVM Express])
        Subsystem: Marvell Technology Group Ltd. Skyhawk Series NVME SSD
        Physical Slot: 0-1
        Flags: bus master, fast devsel, latency 0, IRQ 213, NUMA node 1, IOMMU group 79
        Memory at ba500000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [80] Power Management version 3
        Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
        Capabilities: [b0] MSI-X: Enable+ Count=128 Masked-
        Capabilities: [60] Vital Product Data
        Capabilities: [c0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [300] Secondary PCI Express
        Kernel driver in use: nvme
        Kernel modules: nvme
Here's smart data of one of the drives
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.11-7-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SDLC2CLR-016T-3NA1
Serial Number:                      A045C049
Firmware Version:                   N00A
PCI Vendor ID:                      0x15b7
PCI Vendor Subsystem ID:            0x11ab
IEEE OUI Identifier:                0x731100
Controller ID:                      0
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,600,321,314,816 [1.60 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            672c02 0010731100
Local Time is:                      Wed Dec 27 22:23:40 2023 EST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0006):   Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     70 Celsius
Critical Comp. Temp. Threshold:     75 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    12.00W       -        -    0  0  0  0    15000   15000
 1 +    11.00W       -        -    1  1  1  1    15000   15000
 2 +     9.00W       -        -    2  2  2  2    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        39 Celsius
Available Spare:                    100%
Available Spare Threshold:          1%
Percentage Used:                    0%
Data Units Read:                    25,550,068 [13.0 TB]
Data Units Written:                 1,169,530 [598 GB]
Host Read Commands:                 782,993,412
Host Write Commands:                27,196,494
Controller Busy Time:               0
Power Cycles:                       168
Power On Hours:                     14,357
Unsafe Shutdowns:                   49
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               39 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged
Semi-Full dmesg output
[Dec27 18:04] Linux version 6.5.11-7-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-7 (2023-12-05T09:44Z) ()
[  +0.000000] Command line: initrd=\EFI\proxmox\6.5.11-7-pve\initrd.img-6.5.11-7-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
. . .
[  +0.000523] xhci_hcd 0000:82:00.3: hcc params 0x0260ffe5 hci version 0x110 quirks 0x0000000000000410
[  +0.002007] nvme nvme3: 64/0/0 default/read/poll queues
[  +0.008010] nvme nvme2: 64/0/0 default/read/poll queues
[  +0.002702] megaraid_sas 0000:a1:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
[  +0.000012] megaraid_sas 0000:a1:00.0: INIT adapter done
[  +0.000577] nvme nvme1: 64/0/0 default/read/poll queues
[  +0.000082] megaraid_sas 0000:a1:00.0: Snap dump wait time    : 15
[  +0.000008] megaraid_sas 0000:a1:00.0: pci id         : (0x1000)/(0x005d)/(0x15d9)/(0x0a09)
[  +0.000009] megaraid_sas 0000:a1:00.0: unevenspan support     : no
[  +0.000007] megaraid_sas 0000:a1:00.0: firmware crash dump    : no
[  +0.000006] megaraid_sas 0000:a1:00.0: JBOD sequence map      : enabled
[  +0.000447] megaraid_sas 0000:a1:00.0: Max firmware commands: 4063 shared with default hw_queues = 95 poll_queues 0
[  +0.000013] scsi host0: Avago SAS based MegaRAID driver
[  +0.002330] scsi 0:0:0:0: Enclosure         LSI      SAS3x28          0601 PQ: 0 ANSI: 5
[  +0.019347] xhci_hcd 0000:82:00.3: xHCI Host Controller
[  +0.000029] xhci_hcd 0000:82:00.3: new USB bus registered, assigned bus number 8
[  +0.000015] xhci_hcd 0000:82:00.3: Host supports USB 3.1 Enhanced SuperSpeed
[  +0.000079] usb usb7: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 6.05
[  +0.000013] usb usb7: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[  +0.000011] usb usb7: Product: xHCI Host Controller
[  +0.000008] usb usb7: Manufacturer: Linux 6.5.11-7-pve xhci-hcd
[  +0.000009] usb usb7: SerialNumber: 0000:82:00.3
[  +0.000917] hub 7-0:1.0: USB hub found
[  +0.000020] hub 7-0:1.0: 2 ports detected
[  +0.001754] usb usb8: We don't know the algorithms for LPM for this host, disabling LPM.
[  +0.000051] usb usb8: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 6.05
[  +0.000012] usb usb8: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[  +0.000010] usb usb8: Product: xHCI Host Controller
[  +0.000008] usb usb8: Manufacturer: Linux 6.5.11-7-pve xhci-hcd
[  +0.000008] usb usb8: SerialNumber: 0000:82:00.3
[  +0.000733] ahci 0000:c5:00.0: AHCI 0001.0301 32 slots 8 ports 6 Gbps 0xff impl SATA mode
[  +0.000015] ahci 0000:c5:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part ems sxs
[  +0.000927] hub 8-0:1.0: USB hub found
[  +0.000235] hub 8-0:1.0: 2 ports detected
[  +0.006317]  nvme3n1: p1 p9
[  +0.004416] scsi host12: ahci
[  +0.000827] scsi host13: ahci
[  +0.000538] scsi host14: ahci
[  +0.000627] scsi host15: ahci
[  +0.000517] scsi host16: ahci
[  +0.000247]  nvme2n1: p1 p9
[  +0.000354] scsi host17: ahci
[  +0.000564] scsi host18: ahci
[  +0.000068]  nvme1n1: p1 p9
[  +0.000656] scsi host19: ahci
[  +0.000220] ata12: SATA max UDMA/133 abar m2048@0xba100000 port 0xba100100 irq 498
[  +0.000012] ata13: SATA max UDMA/133 abar m2048@0xba100000 port 0xba100180 irq 499
[  +0.000011] ata14: SATA max UDMA/133 abar m2048@0xba100000 port 0xba100200 irq 500
[  +0.000011] ata15: SATA max UDMA/133 abar m2048@0xba100000 port 0xba100280 irq 501
[  +0.000010] ata16: SATA max UDMA/133 abar m2048@0xba100000 port 0xba100300 irq 502
[  +0.000011] ata17: SATA max UDMA/133 abar m2048@0xba100000 port 0xba100380 irq 503
[  +0.000010] ata18: SATA max UDMA/133 abar m2048@0xba100000 port 0xba100400 irq 504
[  +0.000011] ata19: SATA max UDMA/133 abar m2048@0xba100000 port 0xba100480 irq 505
[  +0.000518] ahci 0000:c6:00.0: AHCI 0001.0301 32 slots 1 ports 6 Gbps 0x1 impl SATA mode
[  +0.000015] ahci 0000:c6:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part
[  +0.000684] scsi host20: ahci
[  +0.000249] ata20: SATA max UDMA/133 abar m2048@0xba000000 port 0xba000100 irq 522
[  +0.000321] ahci 0000:a4:00.0: AHCI 0001.0301 32 slots 1 ports 6 Gbps 0x1 impl SATA mode
[  +0.000014] ahci 0000:a4:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part
[  +0.000636] scsi host21: ahci
[  +0.000229] ata21: SATA max UDMA/133 abar m2048@0xb6700000 port 0xb6700100 irq 524
[  +0.000319] ahci 0000:a5:00.0: AHCI 0001.0301 32 slots 1 ports 6 Gbps 0x1 impl SATA mode
[  +0.000014] ahci 0000:a5:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part
[  +0.000690] scsi host22: ahci
[  +0.000276] ata22: SATA max UDMA/133 abar m2048@0xb6600000 port 0xb6600100 irq 526
[  +0.046373] ixgbe 0000:01:00.0: Multiqueue Enabled: Rx Queue count = 63, Tx Queue count = 63 XDP Queue count = 0
[  +0.000315] ixgbe 0000:01:00.0: 32.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x8 link)
[  +0.000337] ixgbe 0000:01:00.0: MAC: 2, PHY: 14, SFP+: 3, PBA No: G24940-002
[  +0.000008] ixgbe 0000:01:00.0: 00:1b:21:c0:1b:38
[  +0.003062] ixgbe 0000:01:00.0: Intel(R) 10 Gigabit Network Connection
[  +0.000392] ixgbe 0000:01:00.1: enabling device (0000 -> 0002)
[  +0.061069] usb 1-1: new high-speed USB device number 2 using xhci_hcd
[  +0.035287] scsi 0:2:0:0: Direct-Access     ATA      Hitachi HUS72404 A5F0 PQ: 0 ANSI: 6
[  +0.020485] ata7: SATA link down (SStatus 0 SControl 300)
[  +0.000069] ata3: SATA link down (SStatus 0 SControl 300)
[  +0.000018] ata9: SATA link down (SStatus 0 SControl 300)
[  +0.000033] ata2: SATA link down (SStatus 0 SControl 300)
[  +0.000053] ata4: SATA link down (SStatus 0 SControl 300)
[  +0.001635] ata1: SATA link down (SStatus 0 SControl 300)
[  +0.001376] ata8: SATA link down (SStatus 0 SControl 300)
[  +0.020478] usb 5-1: new high-speed USB device number 2 using xhci_hcd
[  +0.035007] ata10: SATA link down (SStatus 0 SControl 300)
[  +0.013438] ata11: SATA link down (SStatus 0 SControl 300)
[  +0.022032] usb 1-1: New USB device found, idVendor=1d6b, idProduct=0107, bcdDevice= 1.00
[  +0.000015] usb 1-1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[  +0.000008] usb 1-1: Product: USB Virtual Hub
[  +0.000007] usb 1-1: Manufacturer: Aspeed
[  +0.000006] usb 1-1: SerialNumber: 00000000
[  +0.034926] hub 1-1:1.0: USB hub found
[  +0.000295] hub 1-1:1.0: 7 ports detected
[  +0.012949] ata16: SATA link down (SStatus 0 SControl 300)
[  +0.000085] ata12: SATA link down (SStatus 0 SControl 300)
[  +0.000390] ata13: SATA link down (SStatus 0 SControl 300)
[  +0.000070] ata17: SATA link down (SStatus 0 SControl 300)
[  +0.000289] ata20: SATA link down (SStatus 0 SControl 300)
[  +0.000005] ata15: SATA link down (SStatus 0 SControl 300)
[  +0.000134] ata18: SATA link down (SStatus 0 SControl 300)
[  +0.000194] ata14: SATA link down (SStatus 0 SControl 300)
[  +0.000068] ata19: SATA link down (SStatus 0 SControl 300)
[  +0.002838] ata22: SATA link down (SStatus 0 SControl 300)
[  +0.000267] ata21: SATA link down (SStatus 0 SControl 300)
[  +0.016913] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  +0.000053] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  +0.000376] ata6.00: ATA-11: TEAM T253512GB, SBFM61.5, max UDMA/133
[  +0.000026] ata5.00: ATA-11: TEAM T253512GB, SBFM61.5, max UDMA/133
[  +0.000061] ata6.00: 1000215216 sectors, multi 16: LBA48 NCQ (depth 32), AA
[  +0.000030] ata5.00: 1000215216 sectors, multi 16: LBA48 NCQ (depth 32), AA
[  +0.000401] ata6.00: configured for UDMA/133
[  +0.000024] ata5.00: configured for UDMA/133
[  +0.010564] scsi: waiting for bus probes to complete ...
[  +0.006103] usb 5-1: New USB device found, idVendor=1a40, idProduct=0101, bcdDevice= 1.00
[  +0.000019] usb 5-1: New USB device strings: Mfr=0, Product=1, SerialNumber=0
[  +0.000010] usb 5-1: Product: USB2.0 HUB
[  +0.045013] scsi 0:0:0:0: Attached scsi generic sg0 type 13
[  +0.000261] sd 0:2:0:0: Attached scsi generic sg1 type 0
[  +0.000674] scsi 5:0:0:0: Direct-Access     ATA      TEAM T253512GB   61.5 PQ: 0 ANSI: 5
[  +0.001083] sd 5:0:0:0: Attached scsi generic sg2 type 0
[  +0.000170] sd 0:2:0:0: [sda] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[  +0.000014] sd 0:2:0:0: [sda] 4096-byte physical blocks
[  +0.001082] sd 5:0:0:0: [sdb] 1000215216 512-byte logical blocks: (512 GB/477 GiB)
[  +0.000039] sd 5:0:0:0: [sdb] Write Protect is off
[  +0.000011] sd 5:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[  +0.000024] scsi 6:0:0:0: Direct-Access     ATA      TEAM T253512GB   61.5 PQ: 0 ANSI: 5
[  +0.000050] sd 5:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  +0.000131] sd 5:0:0:0: [sdb] Preferred minimum I/O size 512 bytes
[  +0.000199] sd 6:0:0:0: Attached scsi generic sg3 type 0
[  +0.000499] sd 6:0:0:0: [sdc] 1000215216 512-byte logical blocks: (512 GB/477 GiB)
[  +0.000069] sd 6:0:0:0: [sdc] Write Protect is off
[  +0.000017] sd 6:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[  +0.000095] sd 6:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  +0.000152] sd 6:0:0:0: [sdc] Preferred minimum I/O size 512 bytes
[  +0.000711] sd 0:2:0:0: [sda] Write Protect is off
[  +0.000011] sd 0:2:0:0: [sda] Mode Sense: 9b 00 10 08
[  +0.002834] sd 0:2:0:0: [sda] Write cache: enabled, read cache: enabled, supports DPO and FUA
[  +0.001102]  sdb: sdb1 sdb2 sdb3
[  +0.000774]  sdc: sdc1 sdc2 sdc3
[  +0.000013] sd 5:0:0:0: [sdb] Attached SCSI removable disk
[  +0.000487] sd 6:0:0:0: [sdc] Attached SCSI removable disk
[  +0.001790] hub 5-1:1.0: USB hub found
[  +0.000310] hub 5-1:1.0: 4 ports detected
[  +0.043763] sd 0:2:0:0: [sda] Attached SCSI disk
[  +0.018091] ses 0:0:0:0: Attached Enclosure device
[  +0.062899] tg3 0000:02:00.0 eno1: renamed from eth0
[  +0.051992] tg3 0000:02:00.1 eno2: renamed from eth1
[  +0.007969] usb 1-1.1: new high-speed USB device number 3 using xhci_hcd
[  +0.102595] usb 1-1.1: New USB device found, idVendor=0557, idProduct=9241, bcdDevice= 5.04
[  +0.000014] usb 1-1.1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[  +0.000009] usb 1-1.1: Product: SMCI HID KM
[  +0.000006] usb 1-1.1: Manufacturer: Linux 5.4.62 with aspeed_vhub
[  +0.091258] hid: raw HID events driver (C) Jiri Kosina
[  +0.011135] usbcore: registered new interface driver usbhid
[  +0.000008] usbhid: USB HID core driver
[  +0.002940] usb 5-1.4: new low-speed USB device number 3 using xhci_hcd
[  +0.000745] usbcore: registered new interface driver usbmouse
[  +0.004750] input: Linux 5.4.62 with aspeed_vhub SMCI HID KM as /devices/pci0000:20/0000:20:08.1/0000:22:00.3/usb1/1-1/1-1.1/1-1.1:1.0/0003:0557:9241.0001/input/input1
[  +0.050536] usb 1-1.2: new high-speed USB device number 4 using xhci_hcd
[  +0.008138] hid-generic 0003:0557:9241.0001: input,hidraw0: USB HID v1.00 Keyboard [Linux 5.4.62 with aspeed_vhub SMCI HID KM] on usb-0000:22:00.3-1.1/input0
[  +0.000133] input: Linux 5.4.62 with aspeed_vhub SMCI HID KM as /devices/pci0000:20/0000:20:08.1/0000:22:00.3/usb1/1-1/1-1.1/1-1.1:1.1/0003:0557:9241.0002/input/input2
[  +0.000210] hid-generic 0003:0557:9241.0002: input,hidraw1: USB HID v1.00 Mouse [Linux 5.4.62 with aspeed_vhub SMCI HID KM] on usb-0000:22:00.3-1.1/input1
[  +0.094723] usb 1-1.2: New USB device found, idVendor=0b1f, idProduct=03ee, bcdDevice= 5.04
[  +0.000014] usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[  +0.000009] usb 1-1.2: Product: RNDIS/Ethernet Gadget
[  +0.000007] usb 1-1.2: Manufacturer: Linux 5.4.62 with aspeed_vhub
[  +0.054911] usbcore: registered new interface driver cdc_ether
[  +0.003741] usb 5-1.4: New USB device found, idVendor=046d, idProduct=c31c, bcdDevice=64.00
[  +0.000013] usb 5-1.4: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[  +0.000009] usb 5-1.4: Product: USB Keyboard
[  +0.000006] usb 5-1.4: Manufacturer: Logitech
[  +0.002568] rndis_host 1-1.2:2.0 usb0: register 'rndis_host' at usb-0000:22:00.3-1.2, RNDIS device, be:3a:f2:b6:05:9f
[  +0.000036] usbcore: registered new interface driver rndis_host
[  +0.003644] rndis_host 1-1.2:2.0 enxbe3af2b6059f: renamed from usb0
[  +0.100268] input: Logitech USB Keyboard as /devices/pci0000:a0/0000:a0:08.1/0000:a3:00.3/usb5/5-1/5-1.4/5-1.4:1.0/0003:046D:C31C.0003/input/input3
[  +0.059944] hid-generic 0003:046D:C31C.0003: input,hidraw2: USB HID v1.10 Keyboard [Logitech USB Keyboard] on usb-0000:a3:00.3-1.4/input0
[  +0.010104] input: Logitech USB Keyboard Consumer Control as /devices/pci0000:a0/0000:a0:08.1/0000:a3:00.3/usb5/5-1/5-1.4/5-1.4:1.1/0003:046D:C31C.0004/input/input4
[  +0.052887] ixgbe 0000:01:00.1: Multiqueue Enabled: Rx Queue count = 63, Tx Queue count = 63 XDP Queue count = 0
[  +0.000319] ixgbe 0000:01:00.1: 32.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x8 link)
[  +0.000340] ixgbe 0000:01:00.1: MAC: 2, PHY: 1, PBA No: G24940-002
[  +0.000008] ixgbe 0000:01:00.1: 00:1b:21:c0:1b:39
[  +0.002136] ixgbe 0000:01:00.1: Intel(R) 10 Gigabit Network Connection
[  +0.001935] input: Logitech USB Keyboard System Control as /devices/pci0000:a0/0000:a0:08.1/0000:a3:00.3/usb5/5-1/5-1.4/5-1.4:1.1/0003:046D:C31C.0004/input/input5
[  +0.000173] hid-generic 0003:046D:C31C.0004: input,hidraw3: USB HID v1.10 Device [Logitech USB Keyboard] on usb-0000:a3:00.3-1.4/input1
[  +0.004845] usbcore: registered new interface driver usbkbd
[  +0.043056] ixgbe 0000:01:00.1 enp1s0f1: renamed from eth0
[  +0.020196] ixgbe 0000:01:00.0 enp1s0f0: renamed from eth2
[  +0.055102] simple-framebuffer simple-framebuffer.0: framebuffer at 0xf6000000, 0x300000 bytes
[  +0.000017] simple-framebuffer simple-framebuffer.0: format=x8r8g8b8, mode=1024x768x32, linelength=4096
[  +0.000223] Console: switching to colour frame buffer device 128x48
[  +0.008413] simple-framebuffer simple-framebuffer.0: fb0: simplefb registered!
[  +0.247847] raid6: avx2x4   gen() 29622 MB/s
[  +0.068000] raid6: avx2x2   gen() 31889 MB/s
[  +0.067999] raid6: avx2x1   gen() 24213 MB/s
[  +0.000069] raid6: using algorithm avx2x2 gen() 31889 MB/s
[  +0.067932] raid6: .... xor() 18331 MB/s, rmw enabled
[  +0.000082] raid6: using avx2x2 recovery algorithm
[  +0.004664] xor: automatically using best checksumming function   avx
[  +0.206451] Btrfs loaded, zoned=yes, fsverity=yes
[  +0.086089] spl: loading out-of-tree module taints kernel.
[  +0.033539] zfs: module license 'CDDL' taints kernel.
[  +0.000100] Disabling lock debugging due to kernel taint
[  +0.000108] zfs: module license taints kernel.
[  +0.583576] Large kmem_alloc(73728, 0x1000), please file an issue at:
              https://github.com/openzfs/zfs/issues/new
[  +0.000194] CPU: 135 PID: 2371 Comm: modprobe Tainted: P           O       6.5.11-7-pve #1
[  +0.004081] Hardware name: Supermicro Super Server/H12DSi-N6, BIOS 2.7 10/25/2023
[  +0.004067] Call Trace:
[  +0.003975]  <TASK>
[  +0.003879]  dump_stack_lvl+0x48/0x70
[  +0.003846]  dump_stack+0x10/0x20
[  +0.003764]  spl_kmem_zalloc+0x11b/0x130 [spl]
[  +0.003741]  zstd_init+0x38/0x1f0 [zfs]
[  +0.003811]  openzfs_init+0x3f/0xbe0 [zfs]
[  +0.003759]  ? __pfx_openzfs_init+0x10/0x10 [zfs]
[  +0.003733]  do_one_initcall+0x5e/0x340
[  +0.003598]  do_init_module+0x68/0x260
[  +0.003574]  load_module+0x213a/0x22a0
[  +0.003526]  init_module_from_file+0x96/0x100
[  +0.003459]  ? init_module_from_file+0x96/0x100
[  +0.003410]  idempotent_init_module+0x11c/0x2b0
[  +0.003354]  __x64_sys_finit_module+0x64/0xd0
[  +0.003282]  do_syscall_64+0x5b/0x90
[  +0.003240]  ? srso_return_thunk+0x5/0x10
[  +0.003214]  ? exit_to_user_mode_prepare+0x39/0x190
[  +0.003268]  ? srso_return_thunk+0x5/0x10
[  +0.003201]  ? syscall_exit_to_user_mode+0x37/0x60
[  +0.003165]  ? srso_return_thunk+0x5/0x10
[  +0.003098]  ? do_syscall_64+0x67/0x90
[  +0.003029]  ? srso_return_thunk+0x5/0x10
[  +0.002986]  ? do_syscall_64+0x67/0x90
[  +0.002926]  ? exc_page_fault+0x94/0x1b0
[  +0.002867]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  +0.002887] RIP: 0033:0x7f4f02abf559
[  +0.002845] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 77 08 0d 00 f7 d8 64 89 01 48
[  +0.006020] RSP: 002b:00007ffeba0bc138 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  +0.003071] RAX: ffffffffffffffda RBX: 0000557276835e40 RCX: 00007f4f02abf559
[  +0.003040] RDX: 0000000000000000 RSI: 00005572763814a0 RDI: 0000000000000004
[  +0.003079] RBP: 00005572763814a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.003073] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000060000
[  +0.003051] R13: 0000000000000000 R14: 0000557276835f70 R15: 0000000000000000
[  +0.003037]  </TASK>
[  +0.002980] Large kmem_alloc(73728, 0x1000), please file an issue at:
              https://github.com/openzfs/zfs/issues/new
[  +0.006041] CPU: 135 PID: 2371 Comm: modprobe Tainted: P           O       6.5.11-7-pve #1
[  +0.003111] Hardware name: Supermicro Super Server/H12DSi-N6, BIOS 2.7 10/25/2023
[  +0.003127] Call Trace:
[  +0.003070]  <TASK>
[  +0.003023]  dump_stack_lvl+0x48/0x70
[  +0.003026]  dump_stack+0x10/0x20
[  +0.002962]  spl_kmem_zalloc+0x11b/0x130 [spl]
[  +0.002963]  zstd_init+0x61/0x1f0 [zfs]
[  +0.003049]  openzfs_init+0x3f/0xbe0 [zfs]
[  +0.002994]  ? __pfx_openzfs_init+0x10/0x10 [zfs]
[  +0.002963]  do_one_initcall+0x5e/0x340
[  +0.002845]  do_init_module+0x68/0x260
[  +0.002837]  load_module+0x213a/0x22a0
[  +0.002857]  init_module_from_file+0x96/0x100
[  +0.002851]  ? init_module_from_file+0x96/0x100
[  +0.002859]  idempotent_init_module+0x11c/0x2b0
[  +0.002860]  __x64_sys_finit_module+0x64/0xd0
[  +0.002834]  do_syscall_64+0x5b/0x90
[  +0.002820]  ? srso_return_thunk+0x5/0x10
[  +0.002833]  ? exit_to_user_mode_prepare+0x39/0x190
[  +0.002862]  ? srso_return_thunk+0x5/0x10
[  +0.002861]  ? syscall_exit_to_user_mode+0x37/0x60
[  +0.002876]  ? srso_return_thunk+0x5/0x10
[  +0.002868]  ? do_syscall_64+0x67/0x90
[  +0.002870]  ? srso_return_thunk+0x5/0x10
[  +0.002863]  ? do_syscall_64+0x67/0x90
[  +0.002856]  ? exc_page_fault+0x94/0x1b0
[  +0.002849]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  +0.002872] RIP: 0033:0x7f4f02abf559
[  +0.002860] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 77 08 0d 00 f7 d8 64 89 01 48
[  +0.006139] RSP: 002b:00007ffeba0bc138 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  +0.003155] RAX: ffffffffffffffda RBX: 0000557276835e40 RCX: 00007f4f02abf559
[  +0.003197] RDX: 0000000000000000 RSI: 00005572763814a0 RDI: 0000000000000004
[  +0.003213] RBP: 00005572763814a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.003199] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000060000
[  +0.003175] R13: 0000000000000000 R14: 0000557276835f70 R15: 0000000000000000
[  +0.003082]  </TASK>
[  +1.275291] ZFS: Loaded module v2.2.2-pve1, ZFS pool version 5000, ZFS filesystem version 5
[  +0.873451] systemd[1]: Inserted module 'autofs4'
[  +0.335159] systemd[1]: systemd 252.19-1~deb12u1 running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
[  +0.009946] systemd[1]: Detected architecture x86-64.
[  +0.015491] systemd[1]: Hostname set to <zeus>.
[  +0.293414] systemd[1]: Queued start job for default target graphical.target.
[  +0.058468] systemd[1]: Created slice system-getty.slice - Slice /system/getty.
[  +0.009571] systemd[1]: Created slice system-modprobe.slice - Slice /system/modprobe.
[  +0.009029] systemd[1]: Created slice system-postfix.slice - Slice /system/postfix.
[  +0.009278] systemd[1]: Created slice system-systemd\x2djournald.slice - Slice /system/systemd-journald.
[  +0.009643] systemd[1]: Created slice system-systemd\x2djournald\x2dvarlink.slice - Slice /system/systemd-journald-varlink.
[  +0.010703] systemd[1]: Created slice user.slice - User and Session Slice.
[  +0.009122] systemd[1]: Started systemd-ask-password-console.path - Dispatch Password Requests to Console Directory Watch.
[  +0.009566] systemd[1]: Started systemd-ask-password-wall.path - Forward Password Requests to Wall Directory Watch.
[  +0.010261] systemd[1]: Set up automount proc-sys-fs-binfmt_misc.automount - Arbitrary Executable File Formats File System Automount Point.
[  +0.015758] systemd[1]: Reached target ceph-fuse.target - ceph target allowing to start/stop all [email protected] instances at once.
[  +0.016495] systemd[1]: Reached target ceph.target - ceph target allowing to start/stop all ceph*@.service instances at once.
[  +0.011301] systemd[1]: Reached target cryptsetup.target - Local Encrypted Volumes.
[  +0.011513] systemd[1]: Reached target integritysetup.target - Local Integrity Protected Volumes.
[  +0.011906] systemd[1]: Reached target paths.target - Path Units.
[  +0.011940] systemd[1]: Reached target slices.target - Slice Units.
[  +0.011810] systemd[1]: Reached target swap.target - Swaps.
[  +0.011795] systemd[1]: Reached target time-set.target - System Time Set.
[  +0.011990] systemd[1]: Reached target veritysetup.target - Local Verity Protected Volumes.
[  +0.012584] systemd[1]: Listening on dm-event.socket - Device-mapper event daemon FIFOs.
[  +0.014760] systemd[1]: Listening on lvm2-lvmpolld.socket - LVM2 poll daemon socket.
[  +0.040483] systemd[1]: Listening on rpcbind.socket - RPCbind Server Activation Socket.
[  +0.012846] systemd[1]: Listening on syslog.socket - Syslog Socket.
[  +0.012391] systemd[1]: Listening on systemd-initctl.socket - initctl Compatibility Named Pipe.
[  +0.013503] systemd[1]: Listening on systemd-journald-audit.socket - Journal Audit Socket.
[  +0.012783] systemd[1]: Listening on systemd-journald-dev-log.socket - Journal Socket (/dev/log).
[  +0.013045] systemd[1]: Listening on systemd-journald.socket - Journal Socket.
[  +0.012871] systemd[1]: Listening on systemd-udevd-control.socket - udev Control Socket.
[  +0.012646] systemd[1]: Listening on systemd-udevd-kernel.socket - udev Kernel Socket.
[  +0.051827] systemd[1]: Mounting dev-hugepages.mount - Huge Pages File System...
[  +0.014567] systemd[1]: Mounting dev-mqueue.mount - POSIX Message Queue File System...
[  +0.014304] systemd[1]: Mounting sys-kernel-debug.mount - Kernel Debug File System...
[  +0.014856] systemd[1]: Mounting sys-kernel-tracing.mount - Kernel Trace File System...
[  +0.011824] systemd[1]: auth-rpcgss-module.service - Kernel Module supporting RPCSEC_GSS was skipped because of an unmet condition check (ConditionPathExists=/etc/krb5.keytab).
[  +0.016475] systemd[1]: Starting keyboard-setup.service - Set the console keyboard layout...
[  +0.014023] systemd[1]: Starting kmod-static-nodes.service - Create List of Static Device Nodes...
[  +0.013663] systemd[1]: Starting lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
[  +0.019559] systemd[1]: Starting [email protected] - Load Kernel Module configfs...
[  +0.013368] systemd[1]: Starting modprobe@dm_mod.service - Load Kernel Module dm_mod...
[  +0.013314] systemd[1]: Starting [email protected] - Load Kernel Module drm...
[  +0.013573] systemd[1]: Starting modprobe@efi_pstore.service - Load Kernel Module efi_pstore...
[  +0.015204] systemd[1]: Starting [email protected] - Load Kernel Module fuse...
[  +0.014182] systemd[1]: Starting [email protected] - Load Kernel Module loop...
[  +0.010236] pstore: backend 'erst' already in use: ignoring 'efi_pstore'
[  +0.001036] ACPI: bus type drm_connector registered
[  +0.017673] systemd[1]: Starting systemd-journald.service - Journal Service...
[  +0.015079] systemd[1]: Starting systemd-modules-load.service - Load Kernel Modules...
[  +0.016273] systemd[1]: Starting systemd-remount-fs.service - Remount Root and Kernel File Systems...
[  +0.015244] systemd[1]: Starting systemd-udev-trigger.service - Coldplug All udev Devices...
[  +0.014135] systemd[1]: Mounted dev-hugepages.mount - Huge Pages File System.
[  +0.012043] systemd[1]: Mounted dev-mqueue.mount - POSIX Message Queue File System.
[  +0.011854] systemd[1]: Mounted sys-kernel-debug.mount - Kernel Debug File System.
[  +0.011840] systemd[1]: Mounted sys-kernel-tracing.mount - Kernel Trace File System.
[  +0.012289] systemd[1]: Finished keyboard-setup.service - Set the console keyboard layout.
[  +0.012394] systemd[1]: Finished kmod-static-nodes.service - Create List of Static Device Nodes.
[  +0.012485] systemd[1]: Finished lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
[  +0.019040] systemd[1]: [email protected]: Deactivated successfully.
[  +0.006355] systemd[1]: Finished [email protected] - Load Kernel Module configfs.
[  +0.012432] systemd[1]: modprobe@dm_mod.service: Deactivated successfully.
[  +0.006223] systemd[1]: Finished modprobe@dm_mod.service - Load Kernel Module dm_mod.
[  +0.012171] systemd[1]: [email protected]: Deactivated successfully.
[  +0.006141] systemd[1]: Finished [email protected] - Load Kernel Module drm.
[  +0.011872] systemd[1]: modprobe@efi_pstore.service: Deactivated successfully.
[  +0.005966] systemd[1]: Finished modprobe@efi_pstore.service - Load Kernel Module efi_pstore.
[  +0.011701] systemd[1]: [email protected]: Deactivated successfully.
[  +0.005914] systemd[1]: Finished [email protected] - Load Kernel Module fuse.
[  +0.011640] systemd[1]: [email protected]: Deactivated successfully.
[  +0.005871] systemd[1]: Finished [email protected] - Load Kernel Module loop.
[  +0.011255] systemd[1]: Started systemd-journald.service - Journal Service.
[  +0.047291] systemd-journald[3639]: Received client request to flush runtime journal.
[  +0.078967] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[  +0.133621] nvidia-nvlink: Nvlink Core is being initialized, major device number 236

[  +0.009204] nvidia 0000:41:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[  +0.061218] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.147.05  Wed Oct 25 20:27:35 UTC 2023
[  +0.363750] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[  +1.165505] nvidia-uvm: Loaded the UVM driver, major device number 234.
[  +0.154597] ptdma 0000:61:00.2: enabling device (0000 -> 0002)
[  +0.000639] IPMI message handler: version 39.2
[  +0.002405] ptdma 0000:62:00.2: enabling device (0000 -> 0002)
[  +0.004765] ccp 0000:22:00.1: enabling device (0000 -> 0002)
[  +0.000521] ptdma 0000:42:00.2: enabling device (0000 -> 0002)
[  +0.000982] ccp 0000:22:00.1: no command queues available
[  +0.000250] ccp 0000:22:00.1: sev enabled
[  +0.000007] ccp 0000:22:00.1: psp enabled
[  +0.000186] ccp 0000:a3:00.1: enabling device (0000 -> 0002)
[  +0.000171] ptdma 0000:43:00.2: enabling device (0000 -> 0002)
[  +0.000357] ccp 0000:a3:00.1: no command queues available
[  +0.000043] ccp 0000:a3:00.1: psp enabled
[  +0.000268] ptdma 0000:21:00.2: enabling device (0000 -> 0002)
[  +0.001837] ptdma 0000:22:00.2: enabling device (0000 -> 0002)
[  +0.000760] ptdma 0000:05:00.2: enabling device (0000 -> 0002)
[  +0.000573] ptdma 0000:06:00.2: enabling device (0000 -> 0002)
[  +0.000679] ptdma 0000:e3:00.2: enabling device (0000 -> 0002)
[  +0.000746] ptdma 0000:e4:00.2: enabling device (0000 -> 0002)
[  +0.000746] ptdma 0000:c3:00.2: enabling device (0000 -> 0002)
[  +0.000767] ptdma 0000:c4:00.2: enabling device (0000 -> 0002)
[  +0.000746] ptdma 0000:a2:00.2: enabling device (0000 -> 0002)
[  +0.000681] ptdma 0000:a3:00.2: enabling device (0000 -> 0002)
[  +0.000696] ptdma 0000:81:00.2: enabling device (0000 -> 0002)
[  +0.000682] ptdma 0000:82:00.2: enabling device (0000 -> 0002)
[  +0.004295] ipmi device interface
[  +0.010615] ipmi_si: IPMI System Interface driver
[  +0.000038] ipmi_si dmi-ipmi-si.0: ipmi_platform: probing via SMBIOS
[  +0.000004] ipmi_platform: ipmi_si: SMBIOS: io 0xca2 regsize 1 spacing 1 irq 0
[  +0.000003] ipmi_si: Adding SMBIOS-specified kcs state machine
[  +0.007763] ipmi_si IPI0001:00: ipmi_platform: probing via ACPI
[  +0.000296] ipmi_si IPI0001:00: ipmi_platform: [io  0x0ca4] regsize 1 spacing 1 irq 0
[  +0.016156] ipmi_si: Adding ACPI-specified kcs state machine
[  +0.000192] ipmi_si: Trying SMBIOS-specified kcs state machine at i/o address 0xca2, slave address 0x20, irq 0
[  +0.022209] ccp 0000:22:00.1: SEV API:0.24 build:18
[  +0.142870] Console: switching to colour dummy device 80x25
[  +0.012679] snd_hda_intel 0000:41:00.1: Disabling MSI
[  +0.000048] snd_hda_intel 0000:41:00.1: Handle vga_switcheroo audio client
[  +0.061841] ast 0000:04:00.0: vgaarb: deactivate vga console
[  +0.000456] ast 0000:04:00.0: [drm] P2A bridge disabled, using default configuration
[  +0.000013] ast 0000:04:00.0: [drm] AST 2600 detected
[  +0.000014] ast 0000:04:00.0: [drm] Using analog VGA
[  +0.000008] ast 0000:04:00.0: [drm] dram MCLK=396 Mhz type=1 bus_width=16
[  +0.000681] ipmi_si dmi-ipmi-si.0: The BMC does not support clearing the recv irq bit, compensating, but the BMC needs to be fixed.
[  +0.001104] [drm] Initialized ast 0.1.0 20120228 for 0000:04:00.0 on minor 0
[  +0.088996] ipmi_si dmi-ipmi-si.0: IPMI message handler: Found new BMC (man_id: 0x002a7c, prod_id: 0x1c22, dev_id: 0x20)
[  +0.003235] fbcon: astdrmfb (fb0) is primary device
[  +0.013641] Console: switching to colour frame buffer device 128x48
[  +0.022194] ast 0000:04:00.0: [drm] fb0: astdrmfb frame buffer device
[  +0.001927] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:01.1/0000:41:00.1/sound/card0/input6
[  +0.000285] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:01.1/0000:41:00.1/sound/card0/input7
[  +0.000205] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:01.1/0000:41:00.1/sound/card0/input8
[  +0.000214] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:01.1/0000:41:00.1/sound/card0/input9
[  +0.144053] input: PC Speaker as /devices/platform/pcspkr/input/input10
[  +0.033757] ipmi_si dmi-ipmi-si.0: IPMI kcs interface initialized
[  +0.004544] ipmi_ssif: IPMI SSIF Interface driver
[  +0.059525] RAPL PMU: API unit is 2^-32 Joules, 1 fixed counters, 163840 ms ovfl timer
[  +0.000062] RAPL PMU: hw unit of domain package 2^-16 Joules
[  +0.010169] cryptd: max_cpu_qlen set to 1000
[  +0.219226] AVX2 version of gcm_enc/dec engaged.
[  +0.001645] AES CTR mode by8 optimization enabled
[  +0.438765] kvm_amd: TSC scaling supported
[  +0.000006] kvm_amd: Nested Virtualization enabled
[  +0.000002] kvm_amd: Nested Paging enabled
[  +0.000004] kvm_amd: SEV enabled (ASIDs 1 - 509)
[  +0.000001] kvm_amd: SEV-ES disabled (ASIDs 0 - 0)
[  +0.000375] kvm_amd: Virtual VMLOAD VMSAVE supported
[  +0.000001] kvm_amd: Virtual GIF supported
[  +0.000001] kvm_amd: LBR virtualization supported
[  +0.216907] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.147.05  Wed Oct 25 20:21:31 UTC 2023
[  +0.071909] MCE: In-kernel MCE decoding enabled.
[  +0.006303] EDAC MC0: Giving out device to module amd64_edac controller F17h_M30h: DEV 0000:00:18.3 (INTERRUPT)
[  +0.000006] EDAC amd64: F17h_M30h detected (node 0).
[  +0.000005] EDAC MC: UMC0 chip selects:
[  +0.000001] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[  +0.000002] EDAC amd64: MC: 2:     0MB 3:     0MB
[  +0.000005] EDAC MC: UMC1 chip selects:
[  +0.000001] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[  +0.000002] EDAC amd64: MC: 2:     0MB 3:     0MB
[  +0.000004] EDAC MC: UMC2 chip selects:
[  +0.000001] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[  +0.000002] EDAC amd64: MC: 2:     0MB 3:     0MB
[  +0.000004] EDAC MC: UMC3 chip selects:
[  +0.000001] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[  +0.000001] EDAC amd64: MC: 2:     0MB 3:     0MB
[  +0.000004] EDAC MC: UMC4 chip selects:
[  +0.000001] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[  +0.000001] EDAC amd64: MC: 2:     0MB 3:     0MB
[  +0.000004] EDAC MC: UMC5 chip selects:
[  +0.000001] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[  +0.000002] EDAC amd64: MC: 2:     0MB 3:     0MB
[  +0.000004] EDAC MC: UMC6 chip selects:
[  +0.000001] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[  +0.000001] EDAC amd64: MC: 2:     0MB 3:     0MB
[  +0.000004] EDAC MC: UMC7 chip selects:
[  +0.000001] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[  +0.000001] EDAC amd64: MC: 2:     0MB 3:     0MB
[  +0.000473] EDAC MC1: Giving out device to module amd64_edac controller F17h_M30h: DEV 0000:00:19.3 (INTERRUPT)
[  +0.000003] EDAC amd64: F17h_M30h detected (node 1).
[  +0.000004] EDAC MC: UMC0 chip selects:
[  +0.000001] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[  +0.000001] EDAC amd64: MC: 2:     0MB 3:     0MB
[  +0.000004] EDAC MC: UMC1 chip selects:
[  +0.000001] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[  +0.000002] EDAC amd64: MC: 2:     0MB 3:     0MB
[  +0.000004] EDAC MC: UMC2 chip selects:
[  +0.000001] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[  +0.000001] EDAC amd64: MC: 2:     0MB 3:     0MB
[  +0.000005] EDAC MC: UMC3 chip selects:
[  +0.000001] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[  +0.000001] EDAC amd64: MC: 2:     0MB 3:     0MB
[  +0.000005] EDAC MC: UMC4 chip selects:
[  +0.000000] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[  +0.000002] EDAC amd64: MC: 2:     0MB 3:     0MB
[  +0.000004] EDAC MC: UMC5 chip selects:
[  +0.000001] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[  +0.000001] EDAC amd64: MC: 2:     0MB 3:     0MB
[  +0.000005] EDAC MC: UMC6 chip selects:
[  +0.000001] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[  +0.000001] EDAC amd64: MC: 2:     0MB 3:     0MB
[  +0.000005] EDAC MC: UMC7 chip selects:
[  +0.000001] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[  +0.000001] EDAC amd64: MC: 2:     0MB 3:     0MB
[  +0.023285] intel_rapl_common: Found RAPL domain package
[  +0.021361] intel_rapl_common: Found RAPL domain core
[  +0.066297] intel_rapl_common: Found RAPL domain package
[  +0.001217] intel_rapl_common: Found RAPL domain core
[  +0.282343] [drm] [nvidia-drm] [GPU ID 0x00004100] Loading driver
[  +0.001349] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:41:00.0 on minor 1
[  +1.180250] audit: type=1400 audit(1703718273.131:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/lxc-start" pid=5689 comm="apparmor_parser"
[  +0.002318] audit: type=1400 audit(1703718273.135:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="swtpm" pid=5691 comm="apparmor_parser"
[  +0.001085] audit: type=1400 audit(1703718273.135:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=5687 comm="apparmor_parser"
[  +0.000943] audit: type=1400 audit(1703718273.135:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=5687 comm="apparmor_parser"
[  +0.000938] audit: type=1400 audit(1703718273.135:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=5685 comm="apparmor_parser"
[  +0.000958] audit: type=1400 audit(1703718273.135:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=5690 comm="apparmor_parser"
[  +0.000967] audit: type=1400 audit(1703718273.135:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=5690 comm="apparmor_parser"
[  +0.001000] audit: type=1400 audit(1703718273.135:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=5690 comm="apparmor_parser"
[  +0.001089] audit: type=1400 audit(1703718273.139:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/chronyd" pid=5693 comm="apparmor_parser"
[  +0.001034] audit: type=1400 audit(1703718273.139:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default" pid=5686 comm="apparmor_parser"
[  +0.235720] softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
[  +0.000005] softdog:              soft_reboot_cmd=<not set> soft_active_on_boot=0
[  +0.004382] RPC: Registered named UNIX socket transport module.
[  +0.010136] RPC: Registered udp transport module.
[  +0.000002] RPC: Registered tcp transport module.
[  +0.000001] RPC: Registered tcp-with-tls transport module.
[  +0.000002] RPC: Registered tcp NFSv4.1 backchannel transport module.
[  +0.833850] vmbr0: port 1(enp1s0f0) entered blocking state
[  +0.000009] vmbr0: port 1(enp1s0f0) entered disabled state
[  +0.000145] ixgbe 0000:01:00.0 enp1s0f0: entered allmulticast mode
[  +0.000176] ixgbe 0000:01:00.0 enp1s0f0: entered promiscuous mode
[  +0.312528] ixgbe 0000:01:00.0: registered PHC device on enp1s0f0
[  +0.169426] ixgbe 0000:01:00.0 enp1s0f0: detected SFP+: 3
[  +0.255742] ixgbe 0000:01:00.0 enp1s0f0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[  +0.010413] vmbr0: port 1(enp1s0f0) entered blocking state
[  +0.000642] vmbr0: port 1(enp1s0f0) entered forwarding state
[  +0.405759] Loading iSCSI transport class v2.0-870.
[  +0.230313] lxc-autostart[6094]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[  +2.006551] bpfilter: Loaded bpfilter_umh pid 7761
[  +0.000329] Started bpfilter
[  +2.682462] wireguard: WireGuard 1.0.0 loaded. See www.wireguard.com for information.
[  +0.002990] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <[email protected]>. All Rights Reserved.
[  +2.716182] vmbr0: port 2(fwpr265p0) entered blocking state
[  +0.000085] vmbr0: port 2(fwpr265p0) entered disabled state
[  +0.000410] fwpr265p0: entered allmulticast mode
[  +0.032836] fwpr265p0: entered promiscuous mode
[  +0.000437] vmbr0: port 2(fwpr265p0) entered blocking state
[  +0.000043] vmbr0: port 2(fwpr265p0) entered forwarding state
[  +0.015343] fwbr265i0: port 1(fwln265i0) entered blocking state
[  +0.000076] fwbr265i0: port 1(fwln265i0) entered disabled state
[  +0.000140] fwln265i0: entered allmulticast mode
[  +0.001114] fwln265i0: entered promiscuous mode
[  +0.000687] fwbr265i0: port 1(fwln265i0) entered blocking state
[  +0.000017] fwbr265i0: port 1(fwln265i0) entered forwarding state
[  +0.014326] fwbr265i0: port 2(veth265i0) entered blocking state
[  +0.000075] fwbr265i0: port 2(veth265i0) entered disabled state
[  +0.000134] veth265i0: entered allmulticast mode
[  +0.001113] veth265i0: entered promiscuous mode
[  +0.062979] eth0: renamed from vethQzSqtn
[  +1.132119] cfg80211: Loading compiled-in X.509 certificates for regulatory database
[  +0.001119] Loaded X.509 cert '[email protected]: 577e021cb980e0e820821ba7b54b4961b8b4fadf'
[  +0.000819] Loaded X.509 cert '[email protected]: 3abbc6ec146e09d1b6016ab9d6cf71dd233f0328'
[  +0.000820] Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
[  +0.000645] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
[  +0.000023] cfg80211: failed to load regulatory.db
[  +0.062399] fwbr265i0: port 2(veth265i0) entered blocking state
[  +0.000065] fwbr265i0: port 2(veth265i0) entered forwarding state
[  +1.979689] FS-Cache: Loaded
[  +0.370851] NFS: Registering the id_resolver key type
[  +0.000701] Key type id_resolver registered
[  +0.000488] Key type id_legacy registered
[Dec27 18:35] nvidia-uvm: Unloaded the UVM driver.
[  +0.049479] [drm] [nvidia-drm] [GPU ID 0x00004100] Unloading driver
[Dec27 18:36] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[  +0.060547] nvidia-uvm: Loaded the UVM driver, major device number 234.
[ +38.427577] fwbr265i0: port 2(veth265i0) entered disabled state
[  +0.002108] veth265i0 (unregistering): left allmulticast mode
[  +0.000020] veth265i0 (unregistering): left promiscuous mode
[  +0.000023] fwbr265i0: port 2(veth265i0) entered disabled state
[  +1.110478] fwbr265i0: port 1(fwln265i0) entered disabled state
[  +0.000241] vmbr0: port 2(fwpr265p0) entered disabled state
[  +0.002991] fwln265i0 (unregistering): left allmulticast mode
[  +0.002101] fwln265i0 (unregistering): left promiscuous mode
[  +0.000767] fwbr265i0: port 1(fwln265i0) entered disabled state
[  +0.033303] fwpr265p0 (unregistering): left allmulticast mode
[  +0.001995] fwpr265p0 (unregistering): left promiscuous mode
[  +0.000771] vmbr0: port 2(fwpr265p0) entered disabled state
[Dec27 18:41] nvidia-uvm: Unloaded the UVM driver.
[  +0.052024] nvidia-modeset: Unloading
[  +0.072959] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[Dec27 18:42] VFIO - User Level meta-driver version: 0.3
[  +0.437425] nvidia-nvlink: Nvlink Core is being initialized, major device number 235

[  +0.004446] nvidia 0000:41:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[  +0.049191] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.146.02  Sun Dec  3 14:06:14 UTC 2023
[  +0.140596] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[  +0.058913] nvidia-uvm: Loaded the UVM driver, major device number 510.
[  +0.028320] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.146.02  Sun Dec  3 14:02:44 UTC 2023
[  +0.016223] [drm] [nvidia-drm] [GPU ID 0x00004100] Loading driver
[  +0.000006] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:41:00.0 on minor 1
[  +0.009702] [drm] [nvidia-drm] [GPU ID 0x00004100] Unloading driver
[  +0.046701] nvidia-modeset: Unloading
[  +0.037803] nvidia-uvm: Unloaded the UVM driver.
[  +0.051644] nvidia-nvlink: Unregistered Nvlink Core, major device number 235
[Dec27 18:45] nvidia-nvlink: Nvlink Core is being initialized, major device number 236

[  +0.003976] nvidia 0000:41:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[  +0.046584] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.146.02  Sun Dec  3 14:06:14 UTC 2023
[  +1.503748] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.146.02  Sun Dec  3 14:02:44 UTC 2023
[  +0.014940] [drm] [nvidia-drm] [GPU ID 0x00004100] Loading driver
[  +0.000005] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:41:00.0 on minor 1
[  +0.078008] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[  +0.069300] nvidia-uvm: Loaded the UVM driver, major device number 234.
[Dec27 18:46] vmbr0: port 2(fwpr265p0) entered blocking state
[  +0.001051] vmbr0: port 2(fwpr265p0) entered disabled state
[  +0.001175] fwpr265p0: entered allmulticast mode
[  +0.016175] fwpr265p0: entered promiscuous mode
[  +0.004663] vmbr0: port 2(fwpr265p0) entered blocking state
[  +0.001843] vmbr0: port 2(fwpr265p0) entered forwarding state
[  +0.020638] fwbr265i0: port 1(fwln265i0) entered blocking state
[  +0.001845] fwbr265i0: port 1(fwln265i0) entered disabled state
[  +0.000980] fwln265i0: entered allmulticast mode
[  +0.007925] fwln265i0: entered promiscuous mode
[  +0.002549] fwbr265i0: port 1(fwln265i0) entered blocking state
[  +0.004211] fwbr265i0: port 1(fwln265i0) entered forwarding state
[  +0.020964] fwbr265i0: port 2(veth265i0) entered blocking state
[  +0.001870] fwbr265i0: port 2(veth265i0) entered disabled state
[  +0.003302] veth265i0: entered allmulticast mode
[  +0.007166] veth265i0: entered promiscuous mode
[  +0.092679] eth0: renamed from vethNKtkZo
[  +0.752036] fwbr265i0: port 2(veth265i0) entered blocking state
[  +0.001646] fwbr265i0: port 2(veth265i0) entered forwarding state
[Dec27 18:48] fwbr265i0: port 2(veth265i0) entered disabled state
[  +0.006965] veth265i0 (unregistering): left allmulticast mode
[  +0.003308] veth265i0 (unregistering): left promiscuous mode
[  +0.000342] fwbr265i0: port 2(veth265i0) entered disabled state
[  +1.152533] fwbr265i0: port 1(fwln265i0) entered disabled state
[  +0.000765] vmbr0: port 2(fwpr265p0) entered disabled state
[  +0.005065] fwln265i0 (unregistering): left allmulticast mode
[  +0.002419] fwln265i0 (unregistering): left promiscuous mode
[  +0.000988] fwbr265i0: port 1(fwln265i0) entered disabled state
[  +0.032572] fwpr265p0 (unregistering): left allmulticast mode
[  +0.002073] fwpr265p0 (unregistering): left promiscuous mode
[  +0.000957] vmbr0: port 2(fwpr265p0) entered disabled state
[  +2.144634] vmbr0: port 2(fwpr265p0) entered blocking state
[  +0.001538] vmbr0: port 2(fwpr265p0) entered disabled state
[  +0.001350] fwpr265p0: entered allmulticast mode
[  +0.001907] fwpr265p0: entered promiscuous mode
[  +0.001620] vmbr0: port 2(fwpr265p0) entered blocking state
[  +0.001079] vmbr0: port 2(fwpr265p0) entered forwarding state
[  +0.051167] fwbr265i0: port 1(fwln265i0) entered blocking state
[  +0.002054] fwbr265i0: port 1(fwln265i0) entered disabled state
[  +0.001151] fwln265i0: entered allmulticast mode
[  +0.001349] fwln265i0: entered promiscuous mode
[  +0.001159] fwbr265i0: port 1(fwln265i0) entered blocking state
[  +0.008511] fwbr265i0: port 1(fwln265i0) entered forwarding state
[  +0.051236] fwbr265i0: port 2(veth265i0) entered blocking state
[  +0.002140] fwbr265i0: port 2(veth265i0) entered disabled state
[  +0.001242] veth265i0: entered allmulticast mode
[  +0.008135] veth265i0: entered promiscuous mode
[  +0.189773] eth0: renamed from vethdHtqQL
[  +1.851378] fwbr265i0: port 2(veth265i0) entered blocking state
[  +0.000473] fwbr265i0: port 2(veth265i0) entered forwarding state
[Dec27 18:50] sched: RT throttling activated
[  +0.000001] sched: RT throttling activated
[Dec27 18:57] perf: interrupt took too long (2778 > 2500), lowering kernel.perf_event_max_sample_rate to 72000
[Dec27 19:10] perf: interrupt took too long (3520 > 3472), lowering kernel.perf_event_max_sample_rate to 56750
[Dec27 19:20] perf: interrupt took too long (4621 > 4400), lowering kernel.perf_event_max_sample_rate to 43250
[Dec27 21:10] pcieport 0000:e0:03.1: pciehp: Slot(0): Link Down
[  +0.118020] zio pool=DEFIANT vdev=/dev/disk/by-id/nvme-SDLC2CLR-016T-3NA1_A045C049-part1 error=5 type=1 offset=270336 size=8192 flags=721601
[  +3.384270] pcieport 0000:e0:03.1: pciehp: Slot(0): Card present
[  +0.001300] pcieport 0000:e0:03.1: pciehp: Slot(0): Link Up
[  +0.135021] pci 0000:e1:00.0: [15b7:2001] type 00 class 0x010802
[  +0.000707] pci 0000:e1:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[  +0.000616] pci 0000:e1:00.0: PME# supported from D0 D3hot
[  +0.000550] pci 0000:e1:00.0: Adding to iommu group 68
[  +0.000446] pcieport 0000:e0:03.1: ASPM: current common clock configuration is inconsistent, reconfiguring
[  +0.009640] pcieport 0000:e0:03.1: bridge window [io  0x1000-0x0fff] to [bus e1] add_size 1000
[  +0.000359] pcieport 0000:e0:03.1: BAR 13: no space for [io  size 0x1000]
[  +0.000344] pcieport 0000:e0:03.1: BAR 13: failed to assign [io  size 0x1000]
[  +0.000336] pcieport 0000:e0:03.1: BAR 13: no space for [io  size 0x1000]
[  +0.000530] pcieport 0000:e0:03.1: BAR 13: failed to assign [io  size 0x1000]
[  +0.000332] pci 0000:e1:00.0: BAR 0: assigned [mem 0xc0300000-0xc0303fff 64bit]
[  +0.000355] pcieport 0000:e0:03.1: PCI bridge to [bus e1]
[  +0.000346] pcieport 0000:e0:03.1:   bridge window [mem 0xc0300000-0xc03fffff]
[  +0.000348] pcieport 0000:e0:03.1:   bridge window [mem 0x380e0200000-0x380e03fffff 64bit pref]
[  +0.000871] nvme nvme0: pci function 0000:e1:00.0
[  +0.002595] nvme 0000:e1:00.0: enabling device (0000 -> 0002)
[  +7.314091] nvme nvme0: 64/0/0 default/read/poll queues
[  +0.019132]  nvme0n1: p1 p9

Is it reporting checksum errors on just the one U.2 drive, or every drive connected through the backplane? If the later, it sounds like your backplane may be introducing the errors. Is it / your HBA card overheating? I’ve read that these cards run hot, and if not properly cooled they can start introducing errors.

There are also some stacktraces from ZFS in your dmesg, but I’m uncertain if these are related and indications of errors, or if they are just warnings:

Summary
[  +0.583576] Large kmem_alloc(73728, 0x1000), please file an issue at:
              https://github.com/openzfs/zfs/issues/new
[  +0.000194] CPU: 135 PID: 2371 Comm: modprobe Tainted: P           O       6.5.11-7-pve #1
[  +0.004081] Hardware name: Supermicro Super Server/H12DSi-N6, BIOS 2.7 10/25/2023
[  +0.004067] Call Trace:
[  +0.003975]  <TASK>
[  +0.003879]  dump_stack_lvl+0x48/0x70
[  +0.003846]  dump_stack+0x10/0x20
[  +0.003764]  spl_kmem_zalloc+0x11b/0x130 [spl]
[  +0.003741]  zstd_init+0x38/0x1f0 [zfs]
[  +0.003811]  openzfs_init+0x3f/0xbe0 [zfs]
[  +0.003759]  ? __pfx_openzfs_init+0x10/0x10 [zfs]
[  +0.003733]  do_one_initcall+0x5e/0x340
[  +0.003598]  do_init_module+0x68/0x260
[  +0.003574]  load_module+0x213a/0x22a0
[  +0.003526]  init_module_from_file+0x96/0x100
[  +0.003459]  ? init_module_from_file+0x96/0x100
[  +0.003410]  idempotent_init_module+0x11c/0x2b0
[  +0.003354]  __x64_sys_finit_module+0x64/0xd0
[  +0.003282]  do_syscall_64+0x5b/0x90
[  +0.003240]  ? srso_return_thunk+0x5/0x10
[  +0.003214]  ? exit_to_user_mode_prepare+0x39/0x190
[  +0.003268]  ? srso_return_thunk+0x5/0x10
[  +0.003201]  ? syscall_exit_to_user_mode+0x37/0x60
[  +0.003165]  ? srso_return_thunk+0x5/0x10
[  +0.003098]  ? do_syscall_64+0x67/0x90
[  +0.003029]  ? srso_return_thunk+0x5/0x10
[  +0.002986]  ? do_syscall_64+0x67/0x90
[  +0.002926]  ? exc_page_fault+0x94/0x1b0
[  +0.002867]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  +0.002887] RIP: 0033:0x7f4f02abf559
[  +0.002845] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 77 08 0d 00 f7 d8 64 89 01 48
[  +0.006020] RSP: 002b:00007ffeba0bc138 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  +0.003071] RAX: ffffffffffffffda RBX: 0000557276835e40 RCX: 00007f4f02abf559
[  +0.003040] RDX: 0000000000000000 RSI: 00005572763814a0 RDI: 0000000000000004
[  +0.003079] RBP: 00005572763814a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.003073] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000060000
[  +0.003051] R13: 0000000000000000 R14: 0000557276835f70 R15: 0000000000000000
[  +0.003037]  </TASK>

EDIT: This seems to be safe to ignore according to zfs call trace on boot of "Large kmem_alloc(73728, 0x1000)" · Issue #13085 · openzfs/zfs · GitHub

This is interesting and matches similar observed problems with pcie enumeration (ref 1, 2).
Try to match if adress matches the missing drives and try the following tweaks:

  • boot with kernel boot option pci=nocrs
  • boot with kernel boot option pci=realloc=off
  • check if following uefi options are enabled/disabled as follows
  1. Above 4G Decoding - ENABLED
  2. Re-Size BAR Support - AUTO
  3. SR-IOV Support - ENABLED
  4. BME DMA Mitigation - DISABLED [Edit - See below for explanation]
  5. Hot-Plug Support - DISABLED

In case of disappearing drives during runtime, check temperature of drives and controllers. If they are overheating, they might drop off.

1 Like

@branchmispredictor Yeah I looked it up and it seems to be a thing on large memory systems like on Epyc so I figured it wasn’t related

@greatnull The BIOS options for 4G/Resize and SR-IOV are already enabled. I’ll check that BME DMA option

As for the drives disappearing during runtime, it may have been related to SVT-AV1 running in RT priority and causing timeouts for a drive during operation. The actual disk load is low while this is happening

Temperatures of everything I was observing was normal

I think at this point the real mystery is why don’t these NVMe drives sometimes not show up at boot. A coworker of mine said in their experience they had to set a POST delay on Supermicros but there’s no way to set it on this board

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.