ASRock X870 Pro RS - Possible PCIe Dropouts?

System specs:

  • MB: ASRock X870 PRO RS
  • CPU: R7 7700X
  • RAM: Corsair Vengeance 48GB (2x24GB) DDR5-6000
  • ZFS DDT SSDs: 2x Crucial P3 Plus 500GB NVMe
  • Boot SSD: 1x WD SN580 250GB NVMe
  • OS: Debian 12.8
  • ZFS Version: zfs-2.1.11-1/zfs-kmod-2.1.11-1
  • Kernel Version: 6.1.0-28-amd64

ZFS configuration:

  • Mirrored zpool running atop both Crucial P3 Plus SSDs
  • Two zvols stacked on top of the zpool
  • Two data zpools on separate hard disks
  • Each zvol is attached as a DDT/metadata device to each data zpool (zvol 1 attached to data zpool 1 and zvol 2 attached to data zpool 2)

The problem:

When both data zpools are scrubbed simultaneously, one of the zvols, without fail, will be marked as unavailable by ZFS and taken offline, at which point ZFS suspends the associated zpool. After this happens, the other data zpool will finish scrubbing fine. fdisk -l shows that all of the physical and logical devices are there (including both zvols), and I can bring everything back online again with zpool clear.

Fault replication information:

  • Scrub one data zpool on its own = no problems.
  • Scrub the other data zpool on its own = no problems.
  • Scrub the DDT zpool + one of the data zpools simultaneously = no problems.
  • Scrub both data zpools simultaneously = zvol drops out and faults one of the data zpools.
    Scrub both data zpools + the DDT zpool simultaneously = zvol drops out and faults one of the data zpools.

Diagnostic/fault-finding information:

  • Kernel logs are virtually silent when this happens - the only hint at what went wrong is in the output of dmesg and /var/log/syslog.
  • dmesg just shows the warning that ZFS prints to the kernel logs when the zpool is suspended.
  • /var/log/syslog shows ZFS marking the zvol that drops out as unavailable and then suspending the associated zpool.
  • I’m suspecting that the PCIe subsystem is responsible for this - it only happens when the DDT SSDs are being hammered with extreme read I/O when both data zpools are being scrubbed simultaneously. Once one of the zpools ends up suspended, no further problems occur, possibly due to the reduced load on the DDT SSDs (and the PCIe subsystem as a result).
  • If there are PCIe dropouts happening, it’s not enough for the kernel to notice - all logs are completely silent with regards to PCIe errors or dropouts.

Does this sound like PCIe dropouts rearing their ugly head at me? If so, should I replace the motherboard with an identical one, or should I replace it with a different one from a different vendor/with a different chipset (e.g. X870E or X870E)?

If it is a motherboard and/or CPU issue, I have until the 31st of January to return everything for a refund/replacement.

Thanks in advance.

ETA: I’m tempted to move the ZFS DDT SSDs and data drives into my main workstation (R9 9900X + MSI MPG X870E CARBON WIFI + 48GB DDR5-8000) and see if I can repeat the problem on a completely different set of hardware. This should at least tell me if I need to be looking at hardware or software.

You haven’t listed the spec of what is the likely culprit of your issues: PSU.

I think you experience the symptoms of a computer system being starved for power at peak loads.

It’s been a minute since I’ve laid eyes on the power supply in any meaningful way, but I think that it’s a good name-brand 500W power supply. I’ve pulled power consumption logs from when the system has kicked off ZFS scrubs, and it’s less than 70W of additional power draw, and the system otherwise remains stable. Before I commissioned it, I performed extended burn-in testing with Prime95, which passed with flying colors, which placed somewhere between 110W and 150W of additional load on the system.

How many HDDs in total?

Two hard disks and three NVMe SSDs. Just spun up Prime95 and kicked off a single-zpool scrub in parallel, and things seem to be progressing fine with the CPU at full load and around 125-150W of additional load on the power supply, suggesting that power problems aren’t the culprit.

Yeah, I tend to agree with your assessment.

The second guess is also related to power: power management settings.

Try turning off power management in UEFI/BIOS and OS and see if your troubles go away.

Could it be that power management is somehow shutting down the DDT SSDs and causing dropouts? (I’d expect that if that was the case, it would only cause one drive to drop and not the other, which would manifest itself as a single-drive failure in ZFS, requiring a resilver to fully resolve, rather than a zvol dropping out (which would require both drives to drop out at the exact same time, something that a flaky PCIe controller or switch could cause.))

Not sure about the exact influence of power management on storage, but my troubles went away in several systems after I eased off the initial aggressive power management settings.

Would loading the system up with Prime95 effectively bypass aggressive power management? (If it is power management causing my troubles, would running Prime95 in parallel with the ZFS scrubs more than likely alleviate the dropout problems?)

Prime95 stresses the CPU (only), zfs scrub stresses the storage system (PCIe, SATA, etc.) - totally different work loads.

What you could do to rule out software/kernel issues is to boot a FreeBSD 14.2 memstick image, mount the array and run scrub to see if it also occurs using FreeBSD and with a bit of luck the kernel might be more verbose. There are reports of PCIe devices being wonky when ram speed is too high and causes (silent) corruption so you might want to try to dial it back to 5600 (within specs/JEDEC) for now.