System specs:
- MB: ASRock X870 PRO RS
- CPU: R7 7700X
- RAM: Corsair Vengeance 48GB (2x24GB) DDR5-6000
- ZFS DDT SSDs: 2x Crucial P3 Plus 500GB NVMe
- Boot SSD: 1x WD SN580 250GB NVMe
- OS: Debian 12.8
- ZFS Version: zfs-2.1.11-1/zfs-kmod-2.1.11-1
- Kernel Version: 6.1.0-28-amd64
ZFS configuration:
- Mirrored zpool running atop both Crucial P3 Plus SSDs
- Two zvols stacked on top of the zpool
- Two data zpools on separate hard disks
- Each zvol is attached as a DDT/metadata device to each data zpool (zvol 1 attached to data zpool 1 and zvol 2 attached to data zpool 2)
The problem:
When both data zpools are scrubbed simultaneously, one of the zvols, without fail, will be marked as unavailable by ZFS and taken offline, at which point ZFS suspends the associated zpool. After this happens, the other data zpool will finish scrubbing fine. fdisk -l
shows that all of the physical and logical devices are there (including both zvols), and I can bring everything back online again with zpool clear
.
Fault replication information:
- Scrub one data zpool on its own = no problems.
- Scrub the other data zpool on its own = no problems.
- Scrub the DDT zpool + one of the data zpools simultaneously = no problems.
- Scrub both data zpools simultaneously = zvol drops out and faults one of the data zpools.
Scrub both data zpools + the DDT zpool simultaneously = zvol drops out and faults one of the data zpools.
Diagnostic/fault-finding information:
- Kernel logs are virtually silent when this happens - the only hint at what went wrong is in the output of
dmesg
and/var/log/syslog
. dmesg
just shows the warning that ZFS prints to the kernel logs when the zpool is suspended./var/log/syslog
shows ZFS marking the zvol that drops out as unavailable and then suspending the associated zpool.- I’m suspecting that the PCIe subsystem is responsible for this - it only happens when the DDT SSDs are being hammered with extreme read I/O when both data zpools are being scrubbed simultaneously. Once one of the zpools ends up suspended, no further problems occur, possibly due to the reduced load on the DDT SSDs (and the PCIe subsystem as a result).
- If there are PCIe dropouts happening, it’s not enough for the kernel to notice - all logs are completely silent with regards to PCIe errors or dropouts.
Does this sound like PCIe dropouts rearing their ugly head at me? If so, should I replace the motherboard with an identical one, or should I replace it with a different one from a different vendor/with a different chipset (e.g. X870E or X870E)?
If it is a motherboard and/or CPU issue, I have until the 31st of January to return everything for a refund/replacement.
Thanks in advance.
ETA: I’m tempted to move the ZFS DDT SSDs and data drives into my main workstation (R9 9900X + MSI MPG X870E CARBON WIFI + 48GB DDR5-8000) and see if I can repeat the problem on a completely different set of hardware. This should at least tell me if I need to be looking at hardware or software.