[SOLVED, with zfs-2.2] Call for help: ZFS + Sapphire Rapids + VM = crash

For the past few weeks, I’ve been trying to debug an issue regarding ZFS crashes on my Sapphire Rapids machine (w9-3495X), usually with a VM running. The crash looks something like this:

[   70.390218] BUG: unable to handle page fault for address: ffff96efe92cdcff
[   70.390221] #PF: supervisor write access in kernel mode
[   70.390222] #PF: error_code(0x0003) - permissions violation
[   70.390223] PGD 807f601067 P4D 807f601067 PUD 1085d8063 PMD 132d13063 PTE 80000001292cd161
[   70.390226] Oops: 0003 [#1] PREEMPT SMP NOPTI
[   70.390228] CPU: 0 PID: 1671 Comm: z_rd_int_8 Tainted: P           O       6.2.15_2 #1
[   70.390230] Hardware name: Supermicro Super Server/X13SWA-TF, BIOS 1.1 02/15/2023
[   70.390231] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
[   70.390237] Code: 00 e8 f5 46 c2 db fa 0f 1f 44 00 00 48 8b 1d 98 af 00 00 e8 c3 77 68 dc 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[   70.390238] RSP: 0018:ffffbbb848b5b908 EFLAGS: 00010086
[   70.390240] RAX: 00000000ffffffff RBX: ffff96efdae5ac00 RCX: ffff96efe92cb000
[   70.390241] RDX: 00000000ffffffff RSI: ffffffff9d10adb1 RDI: ffffffff9d1191e3
[   70.390242] RBP: ffff96f11a190000 R08: ffffbbb848b5ba20 R09: ffffffffc0e288a0
[   70.390242] R10: 0000000000000000 R11: 00000000c12c4cc9 R12: ffffbbb848b5ba40
[   70.390243] R13: 0000000000010000 R14: 0000000000000000 R15: 0000000000000008
[   70.390244] FS:  0000000000000000(0000) GS:ffff976d3a200000(0000) knlGS:0000000000000000
[   70.390245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   70.390246] CR2: ffff96efe92cdcff CR3: 000000807e810005 CR4: 0000000000772ef0
[   70.390247] PKRU: 55555554
[   70.390247] Call Trace:
[   70.390249]  <TASK>
[   70.390251]  fletcher_4_avx512f_native+0x18/0xa0 [zcommon]
[   70.390254]  ? abd_fletcher_4_iter+0x64/0xc0 [zcommon]
[   70.390259]  ? abd_iterate_func+0xe8/0x1c0 [zfs]
[   70.390321]  ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zcommon]
[   70.390325]  ? abd_fletcher_4_native+0x7b/0xc0 [zfs]
[   70.390369]  ? preempt_count_add+0x47/0xa0
[   70.390373]  ? abd_iter_map+0x6c/0xb0 [zfs]
[   70.390411]  ? abd_copy_to_buf_off_cb+0x1b/0x30 [zfs]
[   70.390438]  ? abd_iter_advance+0x3f/0x70 [zfs]
[   70.390474]  ? abd_iterate_func+0x10d/0x1c0 [zfs]
[   70.390500]  ? zio_checksum_error_impl+0xfb/0x630 [zfs]
[   70.390536]  ? __pfx_abd_fletcher_4_native+0x10/0x10 [zfs]
[   70.390570]  ? update_load_avg+0x7e/0x780
[   70.390573]  ? newidle_balance+0x2eb/0x420
[   70.390575]  ? vdev_queue_io_to_issue+0x3db/0xc40 [zfs]
[   70.390622]  ? zio_checksum_error+0x64/0xd0 [zfs]
[   70.390659]  ? zio_checksum_verify+0x36/0x140 [zfs]
[   70.390698]  ? preempt_count_add+0x6a/0xa0
[   70.390699]  ? _raw_spin_lock+0x13/0x40
[   70.390702]  ? _raw_spin_unlock+0x12/0x30
[   70.390703]  ? zio_wait_for_children+0x88/0xc0 [zfs]
[   70.390740]  ? zio_vdev_io_assess+0x20/0x2d0 [zfs]
[   70.390776]  ? zio_execute+0x80/0x120 [zfs]
[   70.390812]  ? taskq_thread+0x2e0/0x510 [spl]
[   70.390817]  ? __pfx_default_wake_function+0x10/0x10
[   70.390819]  ? __pfx_zio_execute+0x10/0x10 [zfs]
[   70.390855]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[   70.390859]  ? kthread+0xe6/0x110
[   70.390861]  ? __pfx_kthread+0x10/0x10
[   70.390862]  ? ret_from_fork+0x29/0x50
[   70.390865]  </TASK>
[   70.390865] Modules linked in: snd_seq_dummy snd_hrtimer xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables bridge stp llc bnep msr nls_iso8859_1 nls_cp437 vfat fat hid_logitech_hidpp snd_usb_audio hid_logitech_dj snd_usbmidi_lib snd_rawmidi mc mei_hdcp mei_pxp ipmi_ssif snd_sof_pci_intel_tgl intel_rapl_msr intel_rapl_common snd_sof_intel_hda_common intel_uncore_frequency intel_uncore_frequency_common soundwire_intel soundwire_generic_allocation i10nm_edac soundwire_cadence nfit snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp x86_pkg_temp_thermal intel_powerclamp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus snd_soc_core kvm_intel snd_hda_codec_realtek snd_hda_codec_generic snd_compress i2c_designware_platform i2c_designware_core
[   70.390894]  snd_hda_codec_hdmi ledtrig_audio snd_pcm_dmaengine ac97_bus kvm snd_hda_intel snd_intel_dspcfg iTCO_wdt intel_pmc_bxt snd_intel_sdw_acpi snd_hda_codec mei_gsc irqbypass snd_hda_core intel_lpss_pci pmt_crashlog iTCO_vendor_support rndis_host pmt_telemetry mei_me rapl snd_hwdep intel_lpss acpi_ipmi i2c_i801 atlantic igb intel_sdsi idma64 cdc_ether spi_intel_pci pmt_class idxd intel_cstate pcspkr tpm_crb intel_vsec idxd_bus input_leds usbnet macsec snd_pcm aquacomputer_d5next dca virt_dma mei spi_intel i2c_smbus ipmi_si ipmi_devintf ipmi_msghandler tpm_tis tpm_tis_core tpm tiny_power_button rng_core evdev joydev button mac_hid nct6775 nct6775_core hwmon_vid coretemp sg snd_seq snd_seq_device snd_timer snd soundcore vhost_vsock vmw_vsock_virtio_transport_common vsock vhost_net vhost vhost_iotlb tap uhid hci_vhci bluetooth ecdh_generic rfkill ecc crc16 vfio_iommu_type1 vfio iommufd dm_mod uinput userio ppp_generic slhc tun loop nvram btrfs blake2b_generic xor raid6_pq libcrc32c
[   70.390936]  crc32c_generic cuse fuse usbkbd hid_generic usbmouse usbhid hid i915 zfs(PO) zunicode(PO) zzstd(O) drm_buddy intel_gtt zlua(O) video zavl(PO) wmi crct10dif_pclmul ast icp(PO) crc32_pclmul drm_display_helper drm_shmem_helper crc32c_intel i2c_algo_bit polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel cec sha512_ssse3 drm_kms_helper xhci_pci ahci xhci_pci_renesas rc_core zcommon(PO) syscopyarea aesni_intel znvpair(PO) libahci sysfillrect ttm crypto_simd sysimgblt xhci_hcd spl(O) libata cryptd drm usbcore scsi_mod agpgart scsi_common usb_common
[   70.390960] CR2: ffff96efe92cdcff
[   70.390961] ---[ end trace 0000000000000000 ]---
[   70.435252] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
[   70.435259] Code: 00 e8 f5 46 c2 db fa 0f 1f 44 00 00 48 8b 1d 98 af 00 00 e8 c3 77 68 dc 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[   70.435261] RSP: 0018:ffffbbb848b5b908 EFLAGS: 00010086
[   70.435262] RAX: 00000000ffffffff RBX: ffff96efdae5ac00 RCX: ffff96efe92cb000
[   70.435264] RDX: 00000000ffffffff RSI: ffffffff9d10adb1 RDI: ffffffff9d1191e3
[   70.435265] RBP: ffff96f11a190000 R08: ffffbbb848b5ba20 R09: ffffffffc0e288a0
[   70.435266] R10: 0000000000000000 R11: 00000000c12c4cc9 R12: ffffbbb848b5ba40
[   70.435267] R13: 0000000000010000 R14: 0000000000000000 R15: 0000000000000008
[   70.435268] FS:  0000000000000000(0000) GS:ffff976d3a200000(0000) knlGS:0000000000000000
[   70.435269] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   70.435270] CR2: ffff96efe92cdcff CR3: 000000807e810005 CR4: 0000000000772ef0
[   70.435272] PKRU: 55555554
[   70.435273] note: z_rd_int_8[1671] exited with irqs disabled
[   70.435280] note: z_rd_int_8[1671] exited with preempt_count 2

The issue is very inconsistent: it only happens during some reboots, and never happens in some others, regardless of how many times I’ve spun up and down a VM. When it happens, it happens with the first VM I started, and so far, only when avx512f is selected as fletcher4 implementation.

There are no errors in IPMI event logs, and I’m already running an ECC memory with no overlocks. My first suspect would be fletcher_4_avx512f_native implementation, so I’ve filed a bug upstream accordingly…

Well, today I saw another crash, but with a different kind of error:

[ 1696.497577] BUG: unable to handle page fault for address: ffff9e7989271cff
[ 1696.497583] #PF: supervisor write access in kernel mode
[ 1696.497585] #PF: error_code(0x0003) - permissions violation
[ 1696.497586] PGD 807f601067 P4D 807f601067 PUD 10855b063 PMD 11c71a063 PTE 8000000109271061
[ 1696.497589] Oops: 0003 [#1] PREEMPT SMP NOPTI
[ 1696.497592] CPU: 1 PID: 67531 Comm: z_wr_iss Tainted: P           O       6.2.15_2 #1
[ 1696.497594] Hardware name: Supermicro Super Server/X13SWA-TF, BIOS 1.1a 04/25/2023
[ 1696.497595] RIP: 0010:kfpu_begin+0x2f/0x70 [icp]
[ 1696.497607] Code: 00 e8 35 b7 55 e6 fa 0f 1f 44 00 00 48 8b 1d d8 0f c6 ff e8 03 e8 fb e6 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[ 1696.497608] RSP: 0018:ffffad15e50075b0 EFLAGS: 00010096
[ 1696.497610] RAX: 00000000ffffffff RBX: ffff9e7984dae400 RCX: ffff9e798926f000
[ 1696.497611] RDX: 00000000ffffffff RSI: ffffffffa810adb1 RDI: ffffffffa81191e3
[ 1696.497612] RBP: ffff9e7ba31c0000 R08: 0000000000000000 R09: 0000000007af41f3
[ 1696.497613] R10: ffff9ef6fa272de8 R11: 0000000000000140 R12: 0000000000000000
[ 1696.497614] R13: 0000000000007fe0 R14: 0000000000000010 R15: ffffad15e5007980
[ 1696.497615] FS:  0000000000000000(0000) GS:ffff9ef6fa240000(0000) knlGS:0000000000000000
[ 1696.497616] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1696.497617] CR2: ffff9e7989271cff CR3: 000000807e810006 CR4: 0000000000772ee0
[ 1696.497618] PKRU: 55555554
[ 1696.497619] Call Trace:
[ 1696.497621]  <TASK>
[ 1696.497623]  gcm_mode_encrypt_contiguous_blocks+0x552/0x7c0 [icp]
[ 1696.497631]  ? __pfx_aes_encrypt_block+0x10/0x10 [icp]
[ 1696.497639]  ? __pfx_aes_encrypt_contiguous_blocks+0x10/0x10 [icp]
[ 1696.497645]  aes_encrypt_contiguous_blocks+0xac/0xc0 [icp]
[ 1696.497651]  ? __pfx_aes_copy_block+0x10/0x10 [icp]
[ 1696.497657]  ? __pfx_aes_xor_block+0x10/0x10 [icp]
[ 1696.497662]  crypto_update_uio+0xe8/0x140 [icp]
[ 1696.497671]  aes_encrypt_atomic+0x149/0x330 [icp]
[ 1696.497679]  crypto_encrypt+0x281/0x350 [icp]
[ 1696.497686]  ? zio_crypt_bp_do_hmac_updates+0x6b/0xa0 [zfs]
[ 1696.497744]  ? i_mod_hash_find_nosync+0x5f/0xa0 [icp]
[ 1696.497752]  ? preempt_count_add+0x47/0xa0
[ 1696.497755]  ? up_read+0x37/0x70
[ 1696.497759]  zio_do_crypt_uio+0x1dc/0x2d0 [zfs]
[ 1696.497800]  zio_do_crypt_data+0x273/0x10a0 [zfs]
[ 1696.497835]  ? preempt_count_add+0x47/0xa0
[ 1696.497837]  ? up_read+0x37/0x70
[ 1696.497839]  spa_do_crypt_abd+0x150/0x330 [zfs]
[ 1696.497897]  zio_encrypt+0x300/0x730 [zfs]
[ 1696.497955]  zio_execute+0x80/0x120 [zfs]
[ 1696.498007]  taskq_thread+0x2e0/0x510 [spl]
[ 1696.498014]  ? __pfx_default_wake_function+0x10/0x10
[ 1696.498016]  ? __pfx_zio_execute+0x10/0x10 [zfs]
[ 1696.498066]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[ 1696.498071]  kthread+0xe6/0x110
[ 1696.498074]  ? __pfx_kthread+0x10/0x10
[ 1696.498076]  ret_from_fork+0x29/0x50
[ 1696.498079]  </TASK>
[ 1696.498080] Modules linked in: snd_seq_dummy snd_hrtimer xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables bridge stp llc bnep msr nls_iso8859_1 nls_cp437 vfat fat hid_logitech_hidpp hid_logitech_dj mei_hdcp mei_pxp ipmi_ssif snd_sof_pci_intel_tgl snd_sof_intel_hda_common intel_rapl_msr intel_rapl_common soundwire_intel soundwire_generic_allocation intel_uncore_frequency intel_uncore_frequency_common soundwire_cadence snd_sof_intel_hda snd_sof_pci i10nm_edac snd_sof_xtensa_dsp nfit snd_sof x86_pkg_temp_thermal intel_powerclamp snd_sof_utils snd_soc_hdac_hda snd_hda_codec_realtek snd_hda_ext_core snd_soc_acpi_intel_match snd_hda_codec_generic snd_soc_acpi soundwire_bus ledtrig_audio snd_hda_codec_hdmi kvm_intel snd_soc_core snd_compress snd_pcm_dmaengine uvcvideo ac97_bus videobuf2_vmalloc
[ 1696.498116]  videobuf2_memops videobuf2_v4l2 i2c_designware_platform snd_hda_intel kvm i2c_designware_core videodev snd_intel_dspcfg snd_intel_sdw_acpi snd_usb_audio rndis_host iTCO_wdt snd_hda_codec pmt_telemetry videobuf2_common snd_usbmidi_lib cdc_ether intel_pmc_bxt irqbypass pmt_crashlog intel_sdsi pmt_class iTCO_vendor_support snd_hda_core snd_rawmidi acpi_ipmi mei_gsc rapl atlantic igb idxd intel_lpss_pci joydev input_leds mc aquacomputer_d5next usbnet pcspkr intel_vsec snd_hwdep mei_me idxd_bus intel_cstate ipmi_si intel_lpss dca macsec ipmi_devintf snd_pcm i2c_i801 idma64 evdev spi_intel_pci mei virt_dma spi_intel i2c_smbus tpm_crb mac_hid ipmi_msghandler tpm_tis tpm_tis_core tiny_power_button tpm rng_core button nct6775 nct6775_core hwmon_vid coretemp sg snd_seq snd_seq_device snd_timer snd soundcore vhost_vsock vmw_vsock_virtio_transport_common vsock vhost_net vhost vhost_iotlb tap uhid hci_vhci bluetooth ecdh_generic rfkill ecc crc16 vfio_iommu_type1 vfio iommufd dm_mod
[ 1696.498165]  uinput userio ppp_generic slhc tun loop nvram btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic cuse fuse usbmouse usbkbd hid_generic usbhid hid zfs(PO) i915 zunicode(PO) zzstd(O) drm_buddy intel_gtt zlua(O) video wmi zavl(PO) crct10dif_pclmul crc32_pclmul ast crc32c_intel icp(PO) polyval_clmulni drm_display_helper drm_shmem_helper polyval_generic i2c_algo_bit gf128mul ghash_clmulni_intel drm_kms_helper cec sha512_ssse3 ahci rc_core xhci_pci syscopyarea libahci aesni_intel zcommon(PO) sysfillrect xhci_pci_renesas ttm sysimgblt znvpair(PO) crypto_simd xhci_hcd spl(O) libata cryptd drm usbcore scsi_mod agpgart scsi_common usb_common
[ 1696.498201] CR2: ffff9e7989271cff
[ 1696.498203] ---[ end trace 0000000000000000 ]---
[ 1696.541839] RIP: 0010:kfpu_begin+0x2f/0x70 [icp]
[ 1696.541852] Code: 00 e8 35 b7 55 e6 fa 0f 1f 44 00 00 48 8b 1d d8 0f c6 ff e8 03 e8 fb e6 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[ 1696.541854] RSP: 0018:ffffad15e50075b0 EFLAGS: 00010096
[ 1696.541855] RAX: 00000000ffffffff RBX: ffff9e7984dae400 RCX: ffff9e798926f000
[ 1696.541857] RDX: 00000000ffffffff RSI: ffffffffa810adb1 RDI: ffffffffa81191e3
[ 1696.541858] RBP: ffff9e7ba31c0000 R08: 0000000000000000 R09: 0000000007af41f3
[ 1696.541859] R10: ffff9ef6fa272de8 R11: 0000000000000140 R12: 0000000000000000
[ 1696.541860] R13: 0000000000007fe0 R14: 0000000000000010 R15: ffffad15e5007980
[ 1696.541861] FS:  0000000000000000(0000) GS:ffff9ef6fa240000(0000) knlGS:0000000000000000
[ 1696.541863] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1696.541864] CR2: ffff9e7989271cff CR3: 000000807e810006 CR4: 0000000000772ee0
[ 1696.541865] PKRU: 55555554
[ 1696.541866] note: z_wr_iss[67531] exited with irqs disabled
[ 1696.541885] note: z_wr_iss[67531] exited with preempt_count 1

This time avx512bw was selected as fletcher4 implementation. Even though this is in a different module (icp versus zcommon), I think it may be related to the previous errors, as both have kfpu_begin as a RIP value.

Since I still have some kernel logs saved, I ran grep RIP: on these logs:

@40000000647888df3b850d8c.u:2023-06-01T11:55:47.87229 kern.warn: [   71.907896] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
@40000000647888df3b850d8c.u:2023-06-01T11:55:47.87298 kern.warn: [   71.911387] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
@40000000647c77942668e89c.u:2023-06-04T11:30:07.93488 kern.warn: [  206.201241] RIP: 0010:kfpu_begin+0x2f/0x70 [icp]
@40000000647c77942668e89c.u:2023-06-04T11:30:07.95871 kern.warn: [  206.712136] RIP: 0010:kfpu_begin+0x2f/0x70 [icp]
@40000000648ddeaf18f2e104.s:2023-06-17T05:56:42.01656 kern.warn: [   70.390231] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
@40000000648ddeaf18f2e104.s:2023-06-17T05:56:42.01711 kern.warn: [   70.435252] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
@4000000064904914229b9c8c.s:2023-06-18T15:12:50.62505 kern.warn: [78747.699749] RIP: 0010:i915_gem_object_pin_map+0x1e2/0x220 [i915]
@4000000064904914229b9c8c.s:2023-06-18T15:12:50.62557 kern.warn: [78747.700090] RIP: 0033:0x7fa94d2b95ab
@4000000064904914229b9c8c.s:2023-06-18T15:12:50.62587 kern.warn: [78747.700179] RIP: 0010:dma_buf_vmap+0xce/0xe0
@4000000064904914229b9c8c.s:2023-06-18T15:12:50.62636 kern.warn: [78747.700342] RIP: 0033:0x7fa94d2b95ab
@4000000064904914229b9c8c.s:2023-06-19T12:09:24.30436 kern.warn: [  143.711963] RIP: 0010:kfpu_end+0x33/0xb0 [zcommon]
@4000000064904914229b9c8c.s:2023-06-19T12:09:24.30507 kern.warn: [  143.914878] RIP: 0010:kfpu_end+0x33/0xb0 [zcommon]
@4000000064906914328f32fc.u:2023-06-19T14:24:27.53029 kern.warn: [ 1696.497595] RIP: 0010:kfpu_begin+0x2f/0x70 [icp]
@4000000064906914328f32fc.u:2023-06-19T14:24:27.53098 kern.warn: [ 1696.541839] RIP: 0010:kfpu_begin+0x2f/0x70 [icp]

(The i915 one is unrelated)

It turns out I had the similar crash to today’s back on June 4th, but all of them have RIP values of either kfpu_begin or kfpu_end, which handles XSAVES/XRSTORS for FPU/SIMD context save/restore.

Now looking at Xeon 4th Generation Errata, there appears to be some issue regarding XSAVES (although this one is related to AMX TILEDATA) and XRSTORS, but I’m unsure if they’re related to how ZFS uses them.

I ran out of ideas of where else to look. Does anyone else here running ZFS on Sapphire Rapids(-WS) (possibly also as a VM host) saw this crash?

(If this is indeed the issue, one workaround is to probably add noxsave to kernel parameters, but that also disables AVX…)

1 Like

You might be on the right track with the XSAVE suspicion - The disassembly of the “Code” bytes do contain the XSAVES instruction:

$ grep Code: kernel_log_{0,1} | sed 's/.*Code: \(.*\)/\1/' \
| while read line ; do ndisasm -b64 <(echo $line | xxd -r -p) ; done | grep -i xsave

0000002A  0FC729            xsaves [rcx]
0000002A  0FC729            xsaves [rcx]
0000002A  0FC729            xsaves [rcx]
0000002A  0FC729            xsaves [rcx]

the logged rcx value is different to the page fault address, and the page fault address isn’t at the start/end of the page (which would normally be the case if the memory access was underflowing/overflowing out of the page bounds), but instead offset +2CFF from rcx (FFFF96EFE92CDCFF-FFFF96EFE92CB000) and PTE bit 0 is 1 so the page is read/write, so I don’t see what would cause the page fault.

From the Intel SDM section 13.1, the standard x87/SSE/AVX regs only take the first 416 bytes, AVX512 takes 2kB, but it looks like AMX state can take up to 8kB (section 13.5.14), and the PF happens at +11kB, which might be close to the end of the AMX tile state.

AMX support is a couple of years old (https://lwn.net/Articles/877883/) but I wonder if there’s something new on SPR

Looks like there were some issues with the size of AMX tile state with KVM before : [PATCH v2 0/8] AMX support in Qemu - Yang Zhong

Shot in the dark: I wonder if disabling AMX tile support with kernel param clearcpuid=600 (X86_FEATURE_AMX_TILE) might help.

2 Likes

Thank you for the analysis! I didn’t quite realize how to disassembly the Code line (TIL) and seeing XSAVES there is pretty reassuring (as one of my suspicious was that my CPU may be faulty).

If XSAVES/XRSTORS is indeed faulty, one fix may be applying the pending XSAVEC patch (it was originally to fix Zen 2’s XSAVES bug). Maybe that won’t trigger this bug…

I will try this. Though, I’m not quite sure how to trigger the bug reliably, it’ll probably take some time to know if it helps. (I only managed to get one crash every 5–6 reboots, and not even reliably).

2 Likes

Just out of interest, can you paste the output of dmesg|grep x86/fpu ?

I get this on an AMD Zen3, I’m guessing your Sapphire Rapids has some more lines:

[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: xstate_offset[9]:  832, xstate_sizes[9]:    8
[    0.000000] x86/fpu: Enabled xstate features 0x207, context size is 840 bytes, using 'compacted' format.
1 Like

Sure thing,

[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x400: 'PASID state'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x20000: 'AMX Tile config'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x40000: 'AMX Tile data'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: xstate_offset[5]:  832, xstate_sizes[5]:   64
[    0.000000] x86/fpu: xstate_offset[6]:  896, xstate_sizes[6]:  512
[    0.000000] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
[    0.000000] x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]:    8
[    0.000000] x86/fpu: xstate_offset[10]: 2440, xstate_sizes[10]:    8
[    0.000000] x86/fpu: xstate_offset[17]: 2496, xstate_sizes[17]:   64
[    0.000000] x86/fpu: xstate_offset[18]: 2560, xstate_sizes[18]: 8192
[    0.000000] x86/fpu: Enabled xstate features 0x606e7, context size is 10752 bytes, using 'compacted' format.

I’m trying the patch from the GitHub Issue that excludes AMX bits from XSAVES/XRSTORS in ZFS. Two days in, lots of reboots trying to trigger the error, no crashes so far. Looks like XSAVES/XRSTORS in ZFS including the AMX context may be what caused the crash.

I hope I’m not speaking too soon…

3 Likes

Hi, I’m newbie to this stuff, ran into the same problem on our new rented Xeon(R) Gold 6438Y+ based server with samsung nvme disks MZQL23T8HCLS-00A07, we just installed a new build of proxmox 8 choosing zfs raid1 with default settings, VMs be then windows or freebsd hangs when partitioning a virtual disk in the zfs pool

Aug 08 16:43:14 var kernel: BUG: unable to handle page fault for address: ff1e002635bb2cff
Aug 08 16:43:14 var kernel: #PF: supervisor write access in kernel mode
Aug 08 16:43:14 var kernel: #PF: error_code(0x0003) - permissions violation
Aug 08 16:43:14 var kernel: PGD 52d1a01067 P4D 52d1a02067 PUD 60908f7063 PMD 135be1063 PTE 8000000135bb2161
Aug 08 16:43:14 var kernel: Oops: 0003 [#1] PREEMPT SMP NOPTI
Aug 08 16:43:14 var kernel: CPU: 2 PID: 71043 Comm: z_rd_int_9 Tainted: P W O 6.2.16-3-pve #1
Aug 08 16:43:14 var kernel: Hardware name: Dell Inc. PowerEdge R660/0R5JJC, BIOS 1.3.2 03/28/2023
Aug 08 16:43:14 var kernel: RIP: 0010:kfpu_begin+0x31/0xa0 [zcommon]
Aug 08 16:43:14 var kernel: Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 88 89 00 00 65 8b 05 6d 95 65 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc 0f 1f 44
Aug 08 16:43:14 var kernel: RSP: 0018:ff31bce29603f8b0 EFLAGS: 00010082
Aug 08 16:43:14 var kernel: RAX: 00000000ffffffff RBX: ff1e0085adce0000 RCX: ff1e002635bb0000
Aug 08 16:43:14 var kernel: RDX: 00000000ffffffff RSI: ff1e0085adce0000 RDI: ff31bce29603fa00
Aug 08 16:43:14 var kernel: RBP: ff31bce29603f8b0 R08: 0000000000000000 R09: 000000d40f17f000
Aug 08 16:43:14 var kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ff1e0085adce2000
Aug 08 16:43:14 var kernel: R13: ff31bce29603fa00 R14: 0000000000002000 R15: 0000000000000000
Aug 08 16:43:14 var kernel: FS: 0000000000000000(0000) GS:ff1e0083ff240000(0000) knlGS:0000000000000000
Aug 08 16:43:14 var kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 08 16:43:14 var kernel: CR2: ff1e002635bb2cff CR3: 00000052d0010001 CR4: 0000000000773ee0
Aug 08 16:43:14 var kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 08 16:43:14 var kernel: DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
Aug 08 16:43:14 var kernel: PKRU: 55555554
Aug 08 16:43:14 var kernel: Call Trace:
Aug 08 16:43:14 var kernel:
Aug 08 16:43:14 var kernel: fletcher_4_avx512f_native+0x1d/0xb0 [zcommon]
Aug 08 16:43:14 var kernel: abd_fletcher_4_iter+0x71/0xe0 [zcommon]
Aug 08 16:43:14 var kernel: abd_iterate_func+0x104/0x1e0 [zfs]
Aug 08 16:43:14 var kernel: ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zcommon]
Aug 08 16:43:14 var kernel: ? __pfx_abd_fletcher_4_native+0x10/0x10 [zfs]
Aug 08 16:43:14 var kernel: abd_fletcher_4_native+0x89/0xd0 [zfs]
Aug 08 16:43:14 var kernel: zio_checksum_error_impl+0x1b3/0x800 [zfs]
Aug 08 16:43:14 var kernel: ? update_load_avg+0x82/0x810
Aug 08 16:43:14 var kernel: ? dequeue_entity+0x107/0x4b0
Aug 08 16:43:14 var kernel: ? psi_group_change+0x219/0x530
Aug 08 16:43:14 var kernel: ? newidle_balance+0x325/0x4b0
Aug 08 16:43:14 var kernel: ? vdev_queue_io_to_issue+0x41a/0xd60 [zfs]
Aug 08 16:43:14 var kernel: zio_checksum_error+0x6e/0xf0 [zfs]
Aug 08 16:43:14 var kernel: zio_checksum_verify+0x4a/0x160 [zfs]
Aug 08 16:43:14 var kernel: zio_execute+0x94/0x170 [zfs]
Aug 08 16:43:14 var kernel: taskq_thread+0x2ac/0x4d0 [spl]
Aug 08 16:43:14 var kernel: ? __pfx_default_wake_function+0x10/0x10
Aug 08 16:43:14 var kernel: ? __pfx_zio_execute+0x10/0x10 [zfs]
Aug 08 16:43:14 var kernel: ? __pfx_taskq_thread+0x10/0x10 [spl]
Aug 08 16:43:14 var kernel: kthread+0xe6/0x110
Aug 08 16:43:14 var kernel: ? __pfx_kthread+0x10/0x10
Aug 08 16:43:14 var kernel: ret_from_fork+0x29/0x50
Aug 08 16:43:14 var kernel:
Aug 08 16:43:14 var kernel: Modules linked in: tcp_diag inet_diag ebtable_filter ebtables ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables iptable_raw ipt_REJECT nf_reject_ipv4 xt_NFLOG xt_limit xt_physdev xt_addrtype xt_tcpudp xt_multiport xt_conntrack xt_set xt_comment xt_mark iptable_filter ip_set_hash_net ip_set nf_tables iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter bonding tls softdog sunrpc binfmt_misc nfnetlink_log nfnetlink ipmi_ssif intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel mgag200 crypto_simd qat_4xxx acpi_ipmi drm_shmem_helper cryptd cmdlinepart drm_kms_helper ipmi_si pmt_crashlog pmt_telemetry spi_nor dell_smbios i2c_algo_bit intel_qat isst_if_mmio isst_if_mbox_pci syscopyarea crc8 intel_sdsi pmt_class
Aug 08 16:43:14 var kernel: idxd rapl mei_me sysfillrect ipmi_devintf mtd pcspkr sysimgblt mei wmi_bmof dcdbas isst_if_common idxd_bus intel_vsec dell_wmi_descriptor authenc ipmi_msghandler intel_cstate acpi_power_meter input_leds joydev mac_hid vhost_net vhost vhost_iotlb tap efi_pstore drm dmi_sysfs ip_tables x_tables autofs4 hid_generic usbkbd usbmouse usbhid hid zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c simplefb xhci_pci xhci_pci_renesas nvme nvme_core crc32_pclmul i2c_i801 nvme_common ahci spi_intel_pci megaraid_sas spi_intel xhci_hcd i2c_smbus libahci i2c_ismt tg3 wmi pinctrl_emmitsburg
Aug 08 16:43:14 var kernel: CR2: ff1e002635bb2cff
Aug 08 16:43:14 var kernel: —[ end trace 0000000000000000 ]—
Aug 08 16:43:14 var kernel: RIP: 0010:kfpu_begin+0x31/0xa0 [zcommon]
Aug 08 16:43:14 var kernel: Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 88 89 00 00 65 8b 05 6d 95 65 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc 0f 1f 44
Aug 08 16:43:14 var kernel: RSP: 0018:ff31bce29603f8b0 EFLAGS: 00010082
Aug 08 16:43:14 var kernel: RAX: 00000000ffffffff RBX: ff1e0085adce0000 RCX: ff1e002635bb0000
Aug 08 16:43:14 var kernel: RDX: 00000000ffffffff RSI: ff1e0085adce0000 RDI: ff31bce29603fa00
Aug 08 16:43:14 var kernel: RBP: ff31bce29603f8b0 R08: 0000000000000000 R09: 000000d40f17f000
Aug 08 16:43:14 var kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ff1e0085adce2000
Aug 08 16:43:14 var kernel: R13: ff31bce29603fa00 R14: 0000000000002000 R15: 0000000000000000
Aug 08 16:43:14 var kernel: FS: 0000000000000000(0000) GS:ff1e0083ff240000(0000) knlGS:0000000000000000
Aug 08 16:43:14 var kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 08 16:43:14 var kernel: CR2: ff1e002635bb2cff CR3: 00000052d0010001 CR4: 0000000000773ee0
Aug 08 16:43:14 var kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 08 16:43:14 var kernel: DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
Aug 08 16:43:14 var kernel: PKRU: 55555554
Aug 08 16:43:14 var kernel: note: z_rd_int_9[71043] exited with irqs disabled
Aug 08 16:43:14 var kernel: note: z_rd_int_9[71043] exited with preempt_count 1

For the time being, try disabling AMX by adding clearcpuid=600 to /etc/kernel/cmdline and run /usr/sbin/proxmox-boot-tool refresh. Though, I have only tested this for a few days, so I can’t say for sure whether this will fix the issue or not.

A more reliable way to fix is to patch ZFS source and disable AMX during context save/restore. I’ve been running this patch for a month now with zero crashes, but on Proxmox this can be slightly a bit more involved.

It has been a while since I patched something on Debian-based, but something like this should do (untested, snapshot your / before trying):

(EDIT: Instructions below do not work with Proxmox as Proxmox doesn’t use DKMS for ZFS. Properly patching ZFS on Proxmox involves building a kernel. Likely better use clearcpuid=600 for the time being.)

Verify the ZFS version you’re running, take note of the version:

$ sudo apt info zfs-dkms

Make sure the package list is up-to-date:

$ sudo apt update

Prepare a patch (change version as appropriate):

$ sudo apt install devscripts quilt
$ dget -u http://ftp.jp.debian.org/debian/pool/contrib/z/zfs-linux/zfs-linux_2.1.11-1.dsc
$ cd zfs-linux-2.1.11
$ export QUILT_PATCHES=debian/patches
$ quilt new fix-context-amx.patch
$ quilt add include/os/linux/kernel/linux/simd_x86.h
$ curl -sSL https://gist.githubusercontent.com/sirn/2ce3867869483dc7b65e3c03022cf40c/raw/4975be66223e4465fde833eb08d80a924acef93b/zfs-fix-amx-context.patch | patch -p1
$ quilt refresh

Patch & build:

$ sudo apt install pbuilder dh-sequence-dkms dh-python python3-sphinx
$ sudo /usr/sbin/pbuilder create
$ sudo env DEB_BUILD_OPTIONS="parallel=$(nproc)" pdebuild

Snapshot & install:

$ zfs snapshot rpool/ROOT/pve-1@$(date "+%s")
$ sudo dpkg -i /var/cache/pbuilder/result/zfs-dkms_2.1.11-1_all.deb
3 Likes

This is an under-rated thread.

One of my take-aways is that AMD introduced Memory Protection Keys [0] in Zen 3. It’s not a headline feature for now. But I never saw online reviews & articles mentioned AMD added that in Zen 3. I would think this is going to be a crucial addition for the years to come.

[0] Memory protection keys [LWN.net]

2 Likes

Thank you so much for posting and the info. I just tripped over this issue for a video.

3 Likes

hi all, i have the same issue with proxmox 8.

if i create some other “drive” as a zfs is working fine…
the setting with clearcpuid=600 is not working for me, and the patch i cannot install on my proxmox 8. any idea how bring this working?

Aug 15 10:03:18 prox1 kernel: BUG: unable to handle page fault for address: ff460a20db923cff
Aug 15 10:03:18 prox1 kernel: #PF: supervisor write access in kernel mode
Aug 15 10:03:18 prox1 kernel: #PF: error_code(0x0003) - permissions violation
Aug 15 10:03:18 prox1 kernel: PGD 1d5801067 P4D 1d5802067 PUD 102f31063 PMD 118932063 PTE 800000011b923161
Aug 15 10:03:18 prox1 kernel: Oops: 0003 [#1] PREEMPT SMP NOPTI
Aug 15 10:03:18 prox1 kernel: CPU: 23 PID: 6479 Comm: z_wr_iss Tainted: P           O       6.2.16-6-pve #1
Aug 15 10:03:18 prox1 kernel: Hardware name: Supermicro SYS-621P-TR/X13DEI, BIOS 1.3a 06/02/2023
Aug 15 10:03:18 prox1 kernel: RIP: 0010:kfpu_begin+0x31/0xa0 [zcommon]
Aug 15 10:03:18 prox1 kernel: Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 88 89 00 00 65 8b 05 6d b5 58 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc 0f 1f 44
Aug 15 10:03:18 prox1 kernel: RSP: 0018:ff5a1fca841eb930 EFLAGS: 00010082
Aug 15 10:03:18 prox1 kernel: RAX: 00000000ffffffff RBX: ff460a20eac34000 RCX: ff460a20db921000
Aug 15 10:03:18 prox1 kernel: RDX: 00000000ffffffff RSI: ff460a20eac34000 RDI: ff5a1fca841eba80
Aug 15 10:03:18 prox1 kernel: RBP: ff5a1fca841eb930 R08: 0000000000000000 R09: 0000000000000000
Aug 15 10:03:18 prox1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ff460a20eac35000
Aug 15 10:03:18 prox1 kernel: R13: ff5a1fca841eba80 R14: 0000000000001000 R15: 0000000000000000
Aug 15 10:03:18 prox1 kernel: FS:  0000000000000000(0000) GS:ff460a281ffc0000(0000) knlGS:0000000000000000
Aug 15 10:03:18 prox1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 15 10:03:18 prox1 kernel: CR2: ff460a20db923cff CR3: 00000001d3e10006 CR4: 0000000000773ee0
Aug 15 10:03:18 prox1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 15 10:03:18 prox1 kernel: DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
Aug 15 10:03:18 prox1 kernel: PKRU: 55555554
Aug 15 10:03:18 prox1 kernel: Call Trace:
Aug 15 10:03:18 prox1 kernel:  <TASK>
Aug 15 10:03:18 prox1 kernel:  fletcher_4_avx512f_native+0x1d/0xb0 [zcommon]
Aug 15 10:03:18 prox1 kernel:  abd_fletcher_4_iter+0x71/0xe0 [zcommon]
Aug 15 10:03:18 prox1 kernel:  abd_iterate_func+0x104/0x1e0 [zfs]
Aug 15 10:03:18 prox1 kernel:  ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zcommon]
Aug 15 10:03:18 prox1 kernel:  ? __pfx_abd_fletcher_4_native+0x10/0x10 [zfs]
Aug 15 10:03:18 prox1 kernel:  abd_fletcher_4_native+0x89/0xd0 [zfs]
Aug 15 10:03:18 prox1 kernel:  ? kmem_cache_alloc+0x18e/0x360
Aug 15 10:03:18 prox1 kernel:  zio_checksum_compute+0x154/0x550 [zfs]
Aug 15 10:03:18 prox1 kernel:  ? __kmem_cache_alloc_node+0x19d/0x340
Aug 15 10:03:18 prox1 kernel:  ? spl_kmem_alloc+0xc3/0x120 [spl]
Aug 15 10:03:18 prox1 kernel:  ? spl_kmem_alloc+0xc3/0x120 [spl]
Aug 15 10:03:18 prox1 kernel:  ? __kmalloc_node+0x52/0xe0
Aug 15 10:03:18 prox1 kernel:  ? spl_kmem_alloc+0xc3/0x120 [spl]
Aug 15 10:03:18 prox1 kernel:  zio_checksum_generate+0x4d/0x80 [zfs]
Aug 15 10:03:18 prox1 kernel:  zio_execute+0x94/0x170 [zfs]
Aug 15 10:03:18 prox1 kernel:  taskq_thread+0x2ac/0x4d0 [spl]
Aug 15 10:03:18 prox1 kernel:  ? __pfx_default_wake_function+0x10/0x10
Aug 15 10:03:18 prox1 kernel:  ? __pfx_zio_execute+0x10/0x10 [zfs]
Aug 15 10:03:18 prox1 kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
Aug 15 10:03:18 prox1 kernel:  kthread+0xe6/0x110
Aug 15 10:03:18 prox1 kernel:  ? __pfx_kthread+0x10/0x10
Aug 15 10:03:18 prox1 kernel:  ret_from_fork+0x29/0x50
Aug 15 10:03:18 prox1 kernel:  </TASK>
Aug 15 10:03:18 prox1 kernel: Modules linked in: tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables sunrpc bonding tls softdog binfmt_misc nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd cryptd ast pmt_crashlog pmt_telemetry drm_shmem_helper intel_sdsi pmt_class cmdlinepart drm_kms_helper rapl mei_me spi_nor i2c_algo_bit syscopyarea idxd isst_if_mbox_pci isst_if_mmio sysfillrect intel_cstate pcspkr isst_if_common intel_vsec idxd_bus mtd sysimgblt mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad pfr_telemetry pfr_update joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap efi_pstore drm dmi_sysfs ip_tables x_tables
Aug 15 10:03:18 prox1 kernel:  autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c simplefb rndis_host cdc_ether usbnet mii hid_generic usbkbd usbmouse nvme nvme_core nvme_common usbhid hid xhci_pci xhci_pci_renesas i2c_i801 ahci spi_intel_pci crc32_pclmul qlcnic tg3 vmd xhci_hcd spi_intel i2c_smbus libahci i2c_ismt wmi pinctrl_emmitsburg
Aug 15 10:03:18 prox1 kernel: CR2: ff460a20db923cff
Aug 15 10:03:18 prox1 kernel: ---[ end trace 0000000000000000 ]---
Aug 15 10:03:18 prox1 kernel: RIP: 0010:kfpu_begin+0x31/0xa0 [zcommon]
Aug 15 10:03:18 prox1 kernel: Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 88 89 00 00 65 8b 05 6d b5 58 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc 0f 1f 44
Aug 15 10:03:18 prox1 kernel: RSP: 0018:ff5a1fca841eb930 EFLAGS: 00010082
Aug 15 10:03:18 prox1 kernel: RAX: 00000000ffffffff RBX: ff460a20eac34000 RCX: ff460a20db921000
Aug 15 10:03:18 prox1 kernel: RDX: 00000000ffffffff RSI: ff460a20eac34000 RDI: ff5a1fca841eba80
Aug 15 10:03:18 prox1 kernel: RBP: ff5a1fca841eb930 R08: 0000000000000000 R09: 0000000000000000
Aug 15 10:03:18 prox1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ff460a20eac35000
Aug 15 10:03:18 prox1 kernel: R13: ff5a1fca841eba80 R14: 0000000000001000 R15: 0000000000000000
Aug 15 10:03:18 prox1 kernel: FS:  0000000000000000(0000) GS:ff460a281ffc0000(0000) knlGS:0000000000000000
Aug 15 10:03:18 prox1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 15 10:03:18 prox1 kernel: CR2: ff460a20db923cff CR3: 00000001d3e10006 CR4: 0000000000773ee0
Aug 15 10:03:18 prox1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 15 10:03:18 prox1 kernel: DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
Aug 15 10:03:18 prox1 kernel: PKRU: 55555554
Aug 15 10:03:18 prox1 kernel: note: z_wr_iss[6479] exited with irqs disabled
Aug 15 10:03:18 prox1 kernel: note: z_wr_iss[6479] exited with preempt_count 1
Aug 15 10:03:31 prox1 kernel: BUG: unable to handle page fault for address: ff460a20db923cff
Aug 15 10:03:31 prox1 kernel: #PF: supervisor write access in kernel mode
Aug 15 10:03:31 prox1 kernel: #PF: error_code(0x0003) - permissions violation
Aug 15 10:03:31 prox1 kernel: PGD 1d5801067 P4D 1d5802067 PUD 102f31063 PMD 118932063 PTE 800000011b923161
Aug 15 10:03:31 prox1 kernel: Oops: 0003 [#2] PREEMPT SMP NOPTI
Aug 15 10:03:31 prox1 kernel: CPU: 23 PID: 573 Comm: z_rd_int_1 Tainted: P      D    O       6.2.16-6-pve #1
Aug 15 10:03:31 prox1 kernel: Hardware name: Supermicro SYS-621P-TR/X13DEI, BIOS 1.3a 06/02/2023
Aug 15 10:03:31 prox1 kernel: RIP: 0010:kfpu_begin+0x31/0xa0 [zcommon]
Aug 15 10:03:31 prox1 kernel: Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 88 89 00 00 65 8b 05 6d b5 58 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc 0f 1f 44
Aug 15 10:03:31 prox1 kernel: RSP: 0018:ff5a1fca845fb8b0 EFLAGS: 00010082
Aug 15 10:03:31 prox1 kernel: RAX: 00000000ffffffff RBX: ff460a217aec0000 RCX: ff460a20db921000
Aug 15 10:03:31 prox1 kernel: RDX: 00000000ffffffff RSI: ff460a217aec0000 RDI: ff5a1fca845fba00
Aug 15 10:03:31 prox1 kernel: RBP: ff5a1fca845fb8b0 R08: 0000000000000000 R09: 000000182ab9e000
Aug 15 10:03:31 prox1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ff460a217aee0000
Aug 15 10:03:31 prox1 kernel: R13: ff5a1fca845fba00 R14: 0000000000020000 R15: 0000000000000000
Aug 15 10:03:31 prox1 kernel: FS:  0000000000000000(0000) GS:ff460a281ffc0000(0000) knlGS:0000000000000000
Aug 15 10:03:31 prox1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 15 10:03:31 prox1 kernel: CR2: ff460a20db923cff CR3: 000000010fb28002 CR4: 0000000000773ee0
Aug 15 10:03:31 prox1 kernel: PKRU: 55555554
Aug 15 10:03:31 prox1 kernel: Call Trace:
Aug 15 10:03:31 prox1 kernel:  <TASK>
Aug 15 10:03:31 prox1 kernel:  fletcher_4_avx512f_native+0x1d/0xb0 [zcommon]
Aug 15 10:03:31 prox1 kernel:  abd_fletcher_4_iter+0x71/0xe0 [zcommon]
Aug 15 10:03:31 prox1 kernel:  abd_iterate_func+0x104/0x1e0 [zfs]
Aug 15 10:03:31 prox1 kernel:  ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zcommon]
Aug 15 10:03:31 prox1 kernel:  ? __pfx_abd_fletcher_4_native+0x10/0x10 [zfs]
Aug 15 10:03:31 prox1 kernel:  abd_fletcher_4_native+0x89/0xd0 [zfs]
Aug 15 10:03:31 prox1 kernel:  zio_checksum_error_impl+0x1b3/0x800 [zfs]
Aug 15 10:03:31 prox1 kernel:  ? update_load_avg+0x82/0x810
Aug 15 10:03:31 prox1 kernel:  ? dequeue_entity+0x107/0x4b0
Aug 15 10:03:31 prox1 kernel:  ? autoremove_wake_function+0x12/0x50
Aug 15 10:03:31 prox1 kernel:  ? psi_group_change+0x219/0x530
Aug 15 10:03:31 prox1 kernel:  ? newidle_balance+0x325/0x4b0
Aug 15 10:03:31 prox1 kernel:  ? vdev_queue_io_to_issue+0x41a/0xd60 [zfs]
Aug 15 10:03:31 prox1 kernel:  zio_checksum_error+0x6e/0xf0 [zfs]
Aug 15 10:03:31 prox1 kernel:  zio_checksum_verify+0x4a/0x160 [zfs]
Aug 15 10:03:31 prox1 kernel:  zio_execute+0x94/0x170 [zfs]
Aug 15 10:03:31 prox1 kernel:  taskq_thread+0x2ac/0x4d0 [spl]
Aug 15 10:03:31 prox1 kernel:  ? __pfx_default_wake_function+0x10/0x10
Aug 15 10:03:31 prox1 kernel:  ? __pfx_zio_execute+0x10/0x10 [zfs]
Aug 15 10:03:31 prox1 kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
Aug 15 10:03:31 prox1 kernel:  kthread+0xe6/0x110
Aug 15 10:03:31 prox1 kernel:  ? __pfx_kthread+0x10/0x10
Aug 15 10:03:31 prox1 kernel:  ret_from_fork+0x29/0x50
Aug 15 10:03:31 prox1 kernel:  </TASK>
Aug 15 10:03:31 prox1 kernel: Modules linked in: tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables sunrpc bonding tls softdog binfmt_misc nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd cryptd ast pmt_crashlog pmt_telemetry drm_shmem_helper intel_sdsi pmt_class cmdlinepart drm_kms_helper rapl mei_me spi_nor i2c_algo_bit syscopyarea idxd isst_if_mbox_pci isst_if_mmio sysfillrect intel_cstate pcspkr isst_if_common intel_vsec idxd_bus mtd sysimgblt mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad pfr_telemetry pfr_update joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap efi_pstore drm dmi_sysfs ip_tables x_tables
Aug 15 10:03:31 prox1 kernel:  autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c simplefb rndis_host cdc_ether usbnet mii hid_generic usbkbd usbmouse nvme nvme_core nvme_common usbhid hid xhci_pci xhci_pci_renesas i2c_i801 ahci spi_intel_pci crc32_pclmul qlcnic tg3 vmd xhci_hcd spi_intel i2c_smbus libahci i2c_ismt wmi pinctrl_emmitsburg
Aug 15 10:03:31 prox1 kernel: CR2: ff460a20db923cff
Aug 15 10:03:31 prox1 kernel: ---[ end trace 0000000000000000 ]---
Aug 15 10:03:31 prox1 kernel: RIP: 0010:kfpu_begin+0x31/0xa0 [zcommon]
Aug 15 10:03:31 prox1 kernel: Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 88 89 00 00 65 8b 05 6d b5 58 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc 0f 1f 44
Aug 15 10:03:31 prox1 kernel: RSP: 0018:ff5a1fca841eb930 EFLAGS: 00010082
Aug 15 10:03:31 prox1 kernel: RAX: 00000000ffffffff RBX: ff460a20eac34000 RCX: ff460a20db921000
Aug 15 10:03:31 prox1 kernel: RDX: 00000000ffffffff RSI: ff460a20eac34000 RDI: ff5a1fca841eba80
Aug 15 10:03:31 prox1 kernel: RBP: ff5a1fca841eb930 R08: 0000000000000000 R09: 0000000000000000
Aug 15 10:03:31 prox1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ff460a20eac35000
Aug 15 10:03:31 prox1 kernel: R13: ff5a1fca841eba80 R14: 0000000000001000 R15: 0000000000000000
Aug 15 10:03:31 prox1 kernel: FS:  0000000000000000(0000) GS:ff460a281ffc0000(0000) knlGS:0000000000000000
Aug 15 10:03:31 prox1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 15 10:03:31 prox1 kernel: CR2: ff460a20db923cff CR3: 000000010fb28002 CR4: 0000000000773ee0
Aug 15 10:03:31 prox1 kernel: PKRU: 55555554
Aug 15 10:03:31 prox1 kernel: note: z_rd_int_1[573] exited with irqs disabled
Aug 15 10:03:31 prox1 kernel: note: z_rd_int_1[573] exited with preempt_count 1
Aug 15 10:03:39 prox1 pvedaemon[2311]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout
Aug 15 10:03:46 prox1 pvestatd[2280]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries
Aug 15 10:03:46 prox1 pvestatd[2280]: status update time (8.042 seconds)
Aug 15 10:03:56 prox1 pvestatd[2280]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries
Aug 15 10:03:56 prox1 pvestatd[2280]: status update time (8.041 seconds)
Aug 15 10:04:04 prox1 pvedaemon[2311]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries
Aug 15 10:04:06 prox1 pvestatd[2280]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries

type or paste code here

Welcome to the community!

Do you have any error messages when trying to make a patched deb or during installation?

Can you cat /proc/cmdline to see if clearcpuid is properly applied?

thank you for your reply, right now we setup proxmox 7.4 and it works fine with this version. if we check the zfs version, you will see it bellow. so you mean if we setup again proxmox 8 and can apply your patch, it will work?

Package: zfs-dkms
Version: 2.0.3-9+deb11u1
Priority: optional
Section: contrib/kernel
Source: zfs-linux
Maintainer: Debian ZFS on Linux maintainers <[email protected]>

Interesting that it works on Proxmox 7! I guess the kernel in Proxmox 7 predates Intel’s AMX fpstate patch, so XSAVES/XRSTORS in ZFS sidestepped this entire SPR4 erratum.

For your question, the patch should work with Proxmox 8 (or ZFS 2.1), but the instruction may be a bit under baked. I’ve been using the patch for my workstation machine for over a month (with lots of VMs) with no crashes, but admittedly, I don’t run Proxmox there. (I tested the instruction with another machine running Proxmox.)

If you can post where it fails during deb file creation or during patching, I may be able to help.

Alternatively, staying on Proxmox 7 until the upstream patch is merged and released may also be a viable option. Although you may need to be careful to not opt in 6.x LTS kernel in Proxmox as that contains AMX fpstate support and might trigger the bug.

thank you, so I install proxmox 8 for testing again. I came until the last step

dkms -i /var/cache/pbuilder/result/zfs-dkms_2.1.11-1_all.deb
-bash: dkms: command not found

so I install it with
apt-get install dkms

But not working
dkms -i /var/cache/pbuilder/result/zfs-dkms_2.1.11-1_all.deb
Error! Unknown option: -i

The dkms version is
dkms-3.0.10

is this the wrong dkms package?

Oops! It was supposed to be sudo dpkg -i /var/cache/pbuilder/result/zfs-dkms_2.1.11-1_all.deb not dkms! Sorry about that!

ok thank you, I became this error

dpkg -i /var/cache/pbuilder/result/zfs-dkms_2.1.11-1_all.deb
Selecting previously unselected package zfs-dkms.
(Reading database ... 70181 files and directories currently installed.)
Preparing to unpack .../zfs-dkms_2.1.11-1_all.deb ...
Unpacking zfs-dkms (2.1.11-1) ...
dpkg: dependency problems prevent configuration of zfs-dkms:
 zfs-dkms depends on dkms (>> 2.1.1.2-5); however:
  Package dkms is not installed.
 zfs-dkms depends on lsb-release; however:
  Package lsb-release is not installed.

dpkg: error processing package zfs-dkms (--install):
 dependency problems - leaving unconfigured
Processing triggers for initramfs-tools (0.142) ...
update-initramfs: Generating /boot/initrd.img-6.2.16-8-pve
Running hook script 'zz-proxmox-boot'..
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
Copying and configuring kernels on /dev/disk/by-uuid/C49A-DEF1
	Copying kernel and creating boot-entry for 6.2.16-3-pve
	Copying kernel and creating boot-entry for 6.2.16-8-pve
Copying and configuring kernels on /dev/disk/by-uuid/C49B-247F
	Copying kernel and creating boot-entry for 6.2.16-3-pve
	Copying kernel and creating boot-entry for 6.2.16-8-pve
Errors were encountered while processing:
 zfs-dkms

Weird that it is saying dkms is not installed, given that you’ve previously installed it. Can you try apt install dkms lsb-release again?

Much better… so I think the patch is now in place?

root@prox2:~/zfs-linux-2.1.11# apt install dkms lsb-release
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Suggested packages:
  menu
Recommended packages:
  linux-headers-generic | linux-headers-686-pae | linux-headers-amd64 | linux-headers
The following NEW packages will be installed:
  dkms lsb-release
0 upgraded, 2 newly installed, 0 to remove and 0 not upgraded.
1 not fully installed or removed.
Need to get 0 B/55.1 kB of archives.
After this operation, 208 kB of additional disk space will be used.
Selecting previously unselected package lsb-release.
(Reading database ... 71117 files and directories currently installed.)
Preparing to unpack .../lsb-release_12.0-1_all.deb ...
Unpacking lsb-release (12.0-1) ...
Setting up lsb-release (12.0-1) ...
Selecting previously unselected package dkms.
(Reading database ... 71122 files and directories currently installed.)
Preparing to unpack .../dkms_3.0.10-8+deb12u1_all.deb ...
Unpacking dkms (3.0.10-8+deb12u1) ...
Setting up dkms (3.0.10-8+deb12u1) ...
Setting up zfs-dkms (2.1.11-1) ...
Loading new zfs-2.1.11 DKMS files...
Building for 6.2.16-3-pve 6.2.16-8-pve
Module build for kernel 6.2.16-3-pve was skipped since the
kernel headers for this kernel do not seem to be installed.
Module build for kernel 6.2.16-8-pve was skipped since the
kernel headers for this kernel do not seem to be installed.
Processing triggers for man-db (2.11.2-2) ...
Processing triggers for initramfs-tools (0.142) ...
update-initramfs: Generating /boot/initrd.img-6.2.16-8-pve
Running hook script 'zz-proxmox-boot'..
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
Copying and configuring kernels on /dev/disk/by-uuid/C49A-DEF1
	Copying kernel and creating boot-entry for 6.2.16-3-pve
	Copying kernel and creating boot-entry for 6.2.16-8-pve
Copying and configuring kernels on /dev/disk/by-uuid/C49B-247F
	Copying kernel and creating boot-entry for 6.2.16-3-pve
	Copying kernel and creating boot-entry for 6.2.16-8-pve
root@prox2:~/zfs-linux-2.1.11# dpkg -i /var/cache/pbuilder/result/zfs-dkms_2.1.11-1_all.deb
(Reading database ... 71137 files and directories currently installed.)
Preparing to unpack .../zfs-dkms_2.1.11-1_all.deb ...
Deleting module zfs-2.1.11 completely from the DKMS tree.
Unpacking zfs-dkms (2.1.11-1) over (2.1.11-1) ...
Setting up zfs-dkms (2.1.11-1) ...
Loading new zfs-2.1.11 DKMS files...
Building for 6.2.16-3-pve 6.2.16-8-pve
Module build for kernel 6.2.16-3-pve was skipped since the
kernel headers for this kernel do not seem to be installed.
Module build for kernel 6.2.16-8-pve was skipped since the
kernel headers for this kernel do not seem to be installed.
Processing triggers for initramfs-tools (0.142) ...
update-initramfs: Generating /boot/initrd.img-6.2.16-8-pve
Running hook script 'zz-proxmox-boot'..
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
Copying and configuring kernels on /dev/disk/by-uuid/C49A-DEF1
	Copying kernel and creating boot-entry for 6.2.16-3-pve
	Copying kernel and creating boot-entry for 6.2.16-8-pve
Copying and configuring kernels on /dev/disk/by-uuid/C49B-247F
	Copying kernel and creating boot-entry for 6.2.16-3-pve
	Copying kernel and creating boot-entry for 6.2.16-8-pve

Yup, I think so! The patch should work, unless there’s some issue in my building instructions (if it doesn’t, I’ll fix the instruction :slight_smile: )