ZFS issues after installing epyc 7532

I was running the following config:

AMD EPYC 7351P (upgraded to an EPYC 7532)
128 GB RAM (2666 MHz) 8 * 16 GB
Asrock Rack EPYCD8-2T (bios version 2.7)
AMD S7150x2 ( Replaced with zotac 1070ti)
Adaptec 71605 in HBA mode running 15 4tb seagate 7e8 hdds
ZFS with an Intel 750 series ssd as slog device.
Proxmox 7.2

Recently I decided to upgrade the processor to the AMD 7532 32 core. I installed the processor. Initially the system displayed a blank screen with BIOS code b2 even through the IPMI was detecting the processor and all 8 sticks of ram and pci-e cards, after which I decided to reseat the processor and reset the BIOS.

I was still getting stuck at code b2, so I removed the graphics card and the system booted up. The system booted up fine with the the nvidia p40 as well. I reinstalled the amd firepro s1750x2 and set the oprom mode to legacy and the system booted up.

I got errors initially that reported the the system is out of memory and that zfs stopped rebuilding L2ARC. I removed my cache and log device and it removed this error

With only one or two windows vms running and a practically empty zfs pool, I am getting major IO delays and random stuttering and zfs blocked messages.

PS. I am getting the errors even after replacing the amd s7150x2 with 1070ti.

Dec 12 01:03:30 hoserver kernel: INFO: task txg_sync:1544 blocked for more than 241 seconds.
Dec 12 01:04:50 hoserver kernel:       Tainted: P           O      5.15.74-1-pve #1
Dec 12 01:04:50 hoserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 12 01:04:50 hoserver kernel: task:txg_sync        state:D stack:    0 pid: 1544 ppid:     2 flags:0x00004000
Dec 12 01:04:50 hoserver kernel: Call Trace:
Dec 12 01:04:50 hoserver kernel:  <TASK>
Dec 12 01:04:50 hoserver kernel:  __schedule+0x34e/0x1740
Dec 12 01:04:50 hoserver kernel:  ? lock_timer_base+0x3b/0xd0
Dec 12 01:04:50 hoserver kernel:  ? __mod_timer+0x271/0x440
Dec 12 01:04:50 hoserver kernel:  schedule+0x69/0x110
Dec 12 01:04:50 hoserver kernel:  schedule_timeout+0x87/0x140
Dec 12 01:04:50 hoserver kernel:  ? __bpf_trace_tick_stop+0x20/0x20
Dec 12 01:04:50 hoserver kernel:  io_schedule_timeout+0x51/0x80
Dec 12 01:04:50 hoserver kernel:  __cv_timedwait_common+0x135/0x170 [spl]
Dec 12 01:04:50 hoserver kernel:  ? wait_woken+0x70/0x70
Dec 12 01:04:50 hoserver kernel:  __cv_timedwait_io+0x19/0x20 [spl]
Dec 12 01:04:50 hoserver kernel:  zio_wait+0x137/0x300 [zfs]
Dec 12 01:04:50 hoserver kernel:  ? __cond_resched+0x1a/0x50
Dec 12 01:04:50 hoserver kernel:  dsl_pool_sync+0xcc/0x4f0 [zfs]
Dec 12 01:04:50 hoserver kernel:  ? spa_suspend_async_destroy+0x60/0x60 [zfs]
Dec 12 01:04:50 hoserver kernel:  ? add_timer+0x20/0x30
Dec 12 01:04:50 hoserver kernel:  spa_sync+0x55a/0x1020 [zfs]
Dec 12 01:04:50 hoserver kernel:  ? spa_txg_history_init_io+0x10a/0x120 [zfs]
Dec 12 01:04:50 hoserver kernel:  txg_sync_thread+0x278/0x400 [zfs]
Dec 12 01:04:50 hoserver kernel:  ? txg_init+0x2c0/0x2c0 [zfs]
Dec 12 01:04:50 hoserver kernel:  thread_generic_wrapper+0x64/0x80 [spl]
Dec 12 01:04:50 hoserver kernel:  ? __thread_exit+0x20/0x20 [spl]
Dec 12 01:04:50 hoserver kernel:  kthread+0x12a/0x150
Dec 12 01:04:50 hoserver kernel:  ? set_kthread_struct+0x50/0x50
Dec 12 01:04:50 hoserver kernel:  ret_from_fork+0x22/0x30
Dec 12 01:04:50 hoserver kernel:  </TASK>

Update: removed the pool. I have reinstalled proxmox. Tried using the single nvme ssd pool or the sata ssd mirror rpool. but it’s not fixing the situation

You did upgrade the BIOS, correct? That is an early EPYC board and needs a current bios for 7002 support.

Yea, I upgraded the bios, tried using three versions . Did not work. I even tried using a rpool ssds , I am getting the same error.

is all the ZFS stuff built in Proxmox, or are you doing passthrough and using a VM for ZFS?

have you tried reverting CPU’s as a troubleshooting step? does you MB allow RAM speed adjustments? you could try lowering the memory speed to see if there is any change, also.

if this was 1999 i would say you have an IRQ conflict. it still looks like some bits are not jiving though.

ZFS stuff is all built into proxmox. I am not using passthrough. I have another system with Ryzen 3900x that also threw the same when I tried using proxmox on it. I found that culprit has been 5.15 and kernel 5.19, 5.13 does not throw errors. I am still looking into it

sounds like bad montherboard? if you want to send me the board and cpu, I have a like board and cpu here I can put through the paces.

could be memory, but I can try memory here and if your stuff works we’ll have our answer.

I just went through an upgrade from first gen EPYC to 2nd get on the EPYCD8-2T and it was a major PIA. For posterity, here are things I learned - some of them you already tried:

  • Before booting with the newly installed 2nd gen EPYC, you definitely need to clear CMOS
  • I needed to go into a semi-minimum POST config prior to booting the first time. Remove all PCIe cards, but fully populated DIMM slots were okay
  • Needed to POST after inserting each PCIe card at a time. Not sure if this was strictly necessary, but adding them all back in didn’t work. Possibly adding certain combinations of cards at a time will work, though.
  • The SP3 socket is super finicky. One of my DIMM slots didn’t work, and required a reseat of the CPU. There was a tiny speck of dust in the socket that probably interfered with one of the pins.
  • The BIOS settings on the second gen SOC are layed out completely differently. Definitely need to go through and adjust as part of first time POST.
  • Remember to reset your “VGA Priority” if needed. Clearing CMOS will have it default to external GPU.
  • In BIOS enable SVM (enabled by default) and IOMMU if needed (not enabled by default, don’t forget iommu=pt kernel command line parameter for best perf). Enable PCIe Above 4G Decoding, Ten Bit Tag Support (may not be necessary), enable PCIe Advanced Error Reporting (AER, seemed to help with vfio stability, but didn’t confirm so take with grain of salt)
  • As expected, EPYCD8 does not support PCIe Gen4, BUT Gen4 capable cards will happily negotiate Gen4 link with the SOC! This took me forever to figure out and lead to random I/O stutter, hangs and crashes. You need to force Gen3 link in BIOS on PCIe slots populated with Gen4 cards.

I am not sure that any of this will help with your situation, but perhaps this will be helpful for anyone upgrading on ROMED8.

6 Likes

After much juggling, the system is working for now. I have tested the system with transfer upto 4T (using a zpool send ) and it’s stable, ssd based rpool seems to be stable now with. Also, installing and playing GOD of war passing through my 1070ti to a VM

In order to make it work, I tried the following:

  • Disabled C states (auto also works)
  • Reinstalled the processor
  • Installed version 2.7 from asrock rack that I had earlier. (This fixed the bus degraded issue I had with my GPU and stabilized the adaptec controller)

The weird bit is that I installed a different in order to test the ssd issues , I removed proxmox and installed a ubuntu using xfs and since everything was working fine , I resintalled proxmox on the zfs mirror as earlier, and suprisingly the rpool was stable.

I am still testing the system.
[EDIT]: I spoke too soon, the link speed on my GPU (1070ti) is only 2.5 GT/s.

[Update]: @FlorianZ , Manually specifying the GPU to Gen 3 speeds seems to have fixed this issue.

@FlorianZ I inadvertently followed similar steps and did resolve some issues. but not all.

@wendell Thanks for the offer, I will test the system further, if it still gives me trouble, I will send it to you.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.