For the past few weeks, I’ve been trying to debug an issue regarding ZFS crashes on my Sapphire Rapids machine (w9-3495X), usually with a VM running. The crash looks something like this:
[ 70.390218] BUG: unable to handle page fault for address: ffff96efe92cdcff
[ 70.390221] #PF: supervisor write access in kernel mode
[ 70.390222] #PF: error_code(0x0003) - permissions violation
[ 70.390223] PGD 807f601067 P4D 807f601067 PUD 1085d8063 PMD 132d13063 PTE 80000001292cd161
[ 70.390226] Oops: 0003 [#1] PREEMPT SMP NOPTI
[ 70.390228] CPU: 0 PID: 1671 Comm: z_rd_int_8 Tainted: P O 6.2.15_2 #1
[ 70.390230] Hardware name: Supermicro Super Server/X13SWA-TF, BIOS 1.1 02/15/2023
[ 70.390231] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
[ 70.390237] Code: 00 e8 f5 46 c2 db fa 0f 1f 44 00 00 48 8b 1d 98 af 00 00 e8 c3 77 68 dc 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[ 70.390238] RSP: 0018:ffffbbb848b5b908 EFLAGS: 00010086
[ 70.390240] RAX: 00000000ffffffff RBX: ffff96efdae5ac00 RCX: ffff96efe92cb000
[ 70.390241] RDX: 00000000ffffffff RSI: ffffffff9d10adb1 RDI: ffffffff9d1191e3
[ 70.390242] RBP: ffff96f11a190000 R08: ffffbbb848b5ba20 R09: ffffffffc0e288a0
[ 70.390242] R10: 0000000000000000 R11: 00000000c12c4cc9 R12: ffffbbb848b5ba40
[ 70.390243] R13: 0000000000010000 R14: 0000000000000000 R15: 0000000000000008
[ 70.390244] FS: 0000000000000000(0000) GS:ffff976d3a200000(0000) knlGS:0000000000000000
[ 70.390245] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 70.390246] CR2: ffff96efe92cdcff CR3: 000000807e810005 CR4: 0000000000772ef0
[ 70.390247] PKRU: 55555554
[ 70.390247] Call Trace:
[ 70.390249] <TASK>
[ 70.390251] fletcher_4_avx512f_native+0x18/0xa0 [zcommon]
[ 70.390254] ? abd_fletcher_4_iter+0x64/0xc0 [zcommon]
[ 70.390259] ? abd_iterate_func+0xe8/0x1c0 [zfs]
[ 70.390321] ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zcommon]
[ 70.390325] ? abd_fletcher_4_native+0x7b/0xc0 [zfs]
[ 70.390369] ? preempt_count_add+0x47/0xa0
[ 70.390373] ? abd_iter_map+0x6c/0xb0 [zfs]
[ 70.390411] ? abd_copy_to_buf_off_cb+0x1b/0x30 [zfs]
[ 70.390438] ? abd_iter_advance+0x3f/0x70 [zfs]
[ 70.390474] ? abd_iterate_func+0x10d/0x1c0 [zfs]
[ 70.390500] ? zio_checksum_error_impl+0xfb/0x630 [zfs]
[ 70.390536] ? __pfx_abd_fletcher_4_native+0x10/0x10 [zfs]
[ 70.390570] ? update_load_avg+0x7e/0x780
[ 70.390573] ? newidle_balance+0x2eb/0x420
[ 70.390575] ? vdev_queue_io_to_issue+0x3db/0xc40 [zfs]
[ 70.390622] ? zio_checksum_error+0x64/0xd0 [zfs]
[ 70.390659] ? zio_checksum_verify+0x36/0x140 [zfs]
[ 70.390698] ? preempt_count_add+0x6a/0xa0
[ 70.390699] ? _raw_spin_lock+0x13/0x40
[ 70.390702] ? _raw_spin_unlock+0x12/0x30
[ 70.390703] ? zio_wait_for_children+0x88/0xc0 [zfs]
[ 70.390740] ? zio_vdev_io_assess+0x20/0x2d0 [zfs]
[ 70.390776] ? zio_execute+0x80/0x120 [zfs]
[ 70.390812] ? taskq_thread+0x2e0/0x510 [spl]
[ 70.390817] ? __pfx_default_wake_function+0x10/0x10
[ 70.390819] ? __pfx_zio_execute+0x10/0x10 [zfs]
[ 70.390855] ? __pfx_taskq_thread+0x10/0x10 [spl]
[ 70.390859] ? kthread+0xe6/0x110
[ 70.390861] ? __pfx_kthread+0x10/0x10
[ 70.390862] ? ret_from_fork+0x29/0x50
[ 70.390865] </TASK>
[ 70.390865] Modules linked in: snd_seq_dummy snd_hrtimer xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables bridge stp llc bnep msr nls_iso8859_1 nls_cp437 vfat fat hid_logitech_hidpp snd_usb_audio hid_logitech_dj snd_usbmidi_lib snd_rawmidi mc mei_hdcp mei_pxp ipmi_ssif snd_sof_pci_intel_tgl intel_rapl_msr intel_rapl_common snd_sof_intel_hda_common intel_uncore_frequency intel_uncore_frequency_common soundwire_intel soundwire_generic_allocation i10nm_edac soundwire_cadence nfit snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp x86_pkg_temp_thermal intel_powerclamp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus snd_soc_core kvm_intel snd_hda_codec_realtek snd_hda_codec_generic snd_compress i2c_designware_platform i2c_designware_core
[ 70.390894] snd_hda_codec_hdmi ledtrig_audio snd_pcm_dmaengine ac97_bus kvm snd_hda_intel snd_intel_dspcfg iTCO_wdt intel_pmc_bxt snd_intel_sdw_acpi snd_hda_codec mei_gsc irqbypass snd_hda_core intel_lpss_pci pmt_crashlog iTCO_vendor_support rndis_host pmt_telemetry mei_me rapl snd_hwdep intel_lpss acpi_ipmi i2c_i801 atlantic igb intel_sdsi idma64 cdc_ether spi_intel_pci pmt_class idxd intel_cstate pcspkr tpm_crb intel_vsec idxd_bus input_leds usbnet macsec snd_pcm aquacomputer_d5next dca virt_dma mei spi_intel i2c_smbus ipmi_si ipmi_devintf ipmi_msghandler tpm_tis tpm_tis_core tpm tiny_power_button rng_core evdev joydev button mac_hid nct6775 nct6775_core hwmon_vid coretemp sg snd_seq snd_seq_device snd_timer snd soundcore vhost_vsock vmw_vsock_virtio_transport_common vsock vhost_net vhost vhost_iotlb tap uhid hci_vhci bluetooth ecdh_generic rfkill ecc crc16 vfio_iommu_type1 vfio iommufd dm_mod uinput userio ppp_generic slhc tun loop nvram btrfs blake2b_generic xor raid6_pq libcrc32c
[ 70.390936] crc32c_generic cuse fuse usbkbd hid_generic usbmouse usbhid hid i915 zfs(PO) zunicode(PO) zzstd(O) drm_buddy intel_gtt zlua(O) video zavl(PO) wmi crct10dif_pclmul ast icp(PO) crc32_pclmul drm_display_helper drm_shmem_helper crc32c_intel i2c_algo_bit polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel cec sha512_ssse3 drm_kms_helper xhci_pci ahci xhci_pci_renesas rc_core zcommon(PO) syscopyarea aesni_intel znvpair(PO) libahci sysfillrect ttm crypto_simd sysimgblt xhci_hcd spl(O) libata cryptd drm usbcore scsi_mod agpgart scsi_common usb_common
[ 70.390960] CR2: ffff96efe92cdcff
[ 70.390961] ---[ end trace 0000000000000000 ]---
[ 70.435252] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
[ 70.435259] Code: 00 e8 f5 46 c2 db fa 0f 1f 44 00 00 48 8b 1d 98 af 00 00 e8 c3 77 68 dc 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[ 70.435261] RSP: 0018:ffffbbb848b5b908 EFLAGS: 00010086
[ 70.435262] RAX: 00000000ffffffff RBX: ffff96efdae5ac00 RCX: ffff96efe92cb000
[ 70.435264] RDX: 00000000ffffffff RSI: ffffffff9d10adb1 RDI: ffffffff9d1191e3
[ 70.435265] RBP: ffff96f11a190000 R08: ffffbbb848b5ba20 R09: ffffffffc0e288a0
[ 70.435266] R10: 0000000000000000 R11: 00000000c12c4cc9 R12: ffffbbb848b5ba40
[ 70.435267] R13: 0000000000010000 R14: 0000000000000000 R15: 0000000000000008
[ 70.435268] FS: 0000000000000000(0000) GS:ffff976d3a200000(0000) knlGS:0000000000000000
[ 70.435269] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 70.435270] CR2: ffff96efe92cdcff CR3: 000000807e810005 CR4: 0000000000772ef0
[ 70.435272] PKRU: 55555554
[ 70.435273] note: z_rd_int_8[1671] exited with irqs disabled
[ 70.435280] note: z_rd_int_8[1671] exited with preempt_count 2
The issue is very inconsistent: it only happens during some reboots, and never happens in some others, regardless of how many times I’ve spun up and down a VM. When it happens, it happens with the first VM I started, and so far, only when avx512f
is selected as fletcher4 implementation.
There are no errors in IPMI event logs, and I’m already running an ECC memory with no overlocks. My first suspect would be fletcher_4_avx512f_native
implementation, so I’ve filed a bug upstream accordingly…
Well, today I saw another crash, but with a different kind of error:
[ 1696.497577] BUG: unable to handle page fault for address: ffff9e7989271cff
[ 1696.497583] #PF: supervisor write access in kernel mode
[ 1696.497585] #PF: error_code(0x0003) - permissions violation
[ 1696.497586] PGD 807f601067 P4D 807f601067 PUD 10855b063 PMD 11c71a063 PTE 8000000109271061
[ 1696.497589] Oops: 0003 [#1] PREEMPT SMP NOPTI
[ 1696.497592] CPU: 1 PID: 67531 Comm: z_wr_iss Tainted: P O 6.2.15_2 #1
[ 1696.497594] Hardware name: Supermicro Super Server/X13SWA-TF, BIOS 1.1a 04/25/2023
[ 1696.497595] RIP: 0010:kfpu_begin+0x2f/0x70 [icp]
[ 1696.497607] Code: 00 e8 35 b7 55 e6 fa 0f 1f 44 00 00 48 8b 1d d8 0f c6 ff e8 03 e8 fb e6 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[ 1696.497608] RSP: 0018:ffffad15e50075b0 EFLAGS: 00010096
[ 1696.497610] RAX: 00000000ffffffff RBX: ffff9e7984dae400 RCX: ffff9e798926f000
[ 1696.497611] RDX: 00000000ffffffff RSI: ffffffffa810adb1 RDI: ffffffffa81191e3
[ 1696.497612] RBP: ffff9e7ba31c0000 R08: 0000000000000000 R09: 0000000007af41f3
[ 1696.497613] R10: ffff9ef6fa272de8 R11: 0000000000000140 R12: 0000000000000000
[ 1696.497614] R13: 0000000000007fe0 R14: 0000000000000010 R15: ffffad15e5007980
[ 1696.497615] FS: 0000000000000000(0000) GS:ffff9ef6fa240000(0000) knlGS:0000000000000000
[ 1696.497616] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1696.497617] CR2: ffff9e7989271cff CR3: 000000807e810006 CR4: 0000000000772ee0
[ 1696.497618] PKRU: 55555554
[ 1696.497619] Call Trace:
[ 1696.497621] <TASK>
[ 1696.497623] gcm_mode_encrypt_contiguous_blocks+0x552/0x7c0 [icp]
[ 1696.497631] ? __pfx_aes_encrypt_block+0x10/0x10 [icp]
[ 1696.497639] ? __pfx_aes_encrypt_contiguous_blocks+0x10/0x10 [icp]
[ 1696.497645] aes_encrypt_contiguous_blocks+0xac/0xc0 [icp]
[ 1696.497651] ? __pfx_aes_copy_block+0x10/0x10 [icp]
[ 1696.497657] ? __pfx_aes_xor_block+0x10/0x10 [icp]
[ 1696.497662] crypto_update_uio+0xe8/0x140 [icp]
[ 1696.497671] aes_encrypt_atomic+0x149/0x330 [icp]
[ 1696.497679] crypto_encrypt+0x281/0x350 [icp]
[ 1696.497686] ? zio_crypt_bp_do_hmac_updates+0x6b/0xa0 [zfs]
[ 1696.497744] ? i_mod_hash_find_nosync+0x5f/0xa0 [icp]
[ 1696.497752] ? preempt_count_add+0x47/0xa0
[ 1696.497755] ? up_read+0x37/0x70
[ 1696.497759] zio_do_crypt_uio+0x1dc/0x2d0 [zfs]
[ 1696.497800] zio_do_crypt_data+0x273/0x10a0 [zfs]
[ 1696.497835] ? preempt_count_add+0x47/0xa0
[ 1696.497837] ? up_read+0x37/0x70
[ 1696.497839] spa_do_crypt_abd+0x150/0x330 [zfs]
[ 1696.497897] zio_encrypt+0x300/0x730 [zfs]
[ 1696.497955] zio_execute+0x80/0x120 [zfs]
[ 1696.498007] taskq_thread+0x2e0/0x510 [spl]
[ 1696.498014] ? __pfx_default_wake_function+0x10/0x10
[ 1696.498016] ? __pfx_zio_execute+0x10/0x10 [zfs]
[ 1696.498066] ? __pfx_taskq_thread+0x10/0x10 [spl]
[ 1696.498071] kthread+0xe6/0x110
[ 1696.498074] ? __pfx_kthread+0x10/0x10
[ 1696.498076] ret_from_fork+0x29/0x50
[ 1696.498079] </TASK>
[ 1696.498080] Modules linked in: snd_seq_dummy snd_hrtimer xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables bridge stp llc bnep msr nls_iso8859_1 nls_cp437 vfat fat hid_logitech_hidpp hid_logitech_dj mei_hdcp mei_pxp ipmi_ssif snd_sof_pci_intel_tgl snd_sof_intel_hda_common intel_rapl_msr intel_rapl_common soundwire_intel soundwire_generic_allocation intel_uncore_frequency intel_uncore_frequency_common soundwire_cadence snd_sof_intel_hda snd_sof_pci i10nm_edac snd_sof_xtensa_dsp nfit snd_sof x86_pkg_temp_thermal intel_powerclamp snd_sof_utils snd_soc_hdac_hda snd_hda_codec_realtek snd_hda_ext_core snd_soc_acpi_intel_match snd_hda_codec_generic snd_soc_acpi soundwire_bus ledtrig_audio snd_hda_codec_hdmi kvm_intel snd_soc_core snd_compress snd_pcm_dmaengine uvcvideo ac97_bus videobuf2_vmalloc
[ 1696.498116] videobuf2_memops videobuf2_v4l2 i2c_designware_platform snd_hda_intel kvm i2c_designware_core videodev snd_intel_dspcfg snd_intel_sdw_acpi snd_usb_audio rndis_host iTCO_wdt snd_hda_codec pmt_telemetry videobuf2_common snd_usbmidi_lib cdc_ether intel_pmc_bxt irqbypass pmt_crashlog intel_sdsi pmt_class iTCO_vendor_support snd_hda_core snd_rawmidi acpi_ipmi mei_gsc rapl atlantic igb idxd intel_lpss_pci joydev input_leds mc aquacomputer_d5next usbnet pcspkr intel_vsec snd_hwdep mei_me idxd_bus intel_cstate ipmi_si intel_lpss dca macsec ipmi_devintf snd_pcm i2c_i801 idma64 evdev spi_intel_pci mei virt_dma spi_intel i2c_smbus tpm_crb mac_hid ipmi_msghandler tpm_tis tpm_tis_core tiny_power_button tpm rng_core button nct6775 nct6775_core hwmon_vid coretemp sg snd_seq snd_seq_device snd_timer snd soundcore vhost_vsock vmw_vsock_virtio_transport_common vsock vhost_net vhost vhost_iotlb tap uhid hci_vhci bluetooth ecdh_generic rfkill ecc crc16 vfio_iommu_type1 vfio iommufd dm_mod
[ 1696.498165] uinput userio ppp_generic slhc tun loop nvram btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic cuse fuse usbmouse usbkbd hid_generic usbhid hid zfs(PO) i915 zunicode(PO) zzstd(O) drm_buddy intel_gtt zlua(O) video wmi zavl(PO) crct10dif_pclmul crc32_pclmul ast crc32c_intel icp(PO) polyval_clmulni drm_display_helper drm_shmem_helper polyval_generic i2c_algo_bit gf128mul ghash_clmulni_intel drm_kms_helper cec sha512_ssse3 ahci rc_core xhci_pci syscopyarea libahci aesni_intel zcommon(PO) sysfillrect xhci_pci_renesas ttm sysimgblt znvpair(PO) crypto_simd xhci_hcd spl(O) libata cryptd drm usbcore scsi_mod agpgart scsi_common usb_common
[ 1696.498201] CR2: ffff9e7989271cff
[ 1696.498203] ---[ end trace 0000000000000000 ]---
[ 1696.541839] RIP: 0010:kfpu_begin+0x2f/0x70 [icp]
[ 1696.541852] Code: 00 e8 35 b7 55 e6 fa 0f 1f 44 00 00 48 8b 1d d8 0f c6 ff e8 03 e8 fb e6 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[ 1696.541854] RSP: 0018:ffffad15e50075b0 EFLAGS: 00010096
[ 1696.541855] RAX: 00000000ffffffff RBX: ffff9e7984dae400 RCX: ffff9e798926f000
[ 1696.541857] RDX: 00000000ffffffff RSI: ffffffffa810adb1 RDI: ffffffffa81191e3
[ 1696.541858] RBP: ffff9e7ba31c0000 R08: 0000000000000000 R09: 0000000007af41f3
[ 1696.541859] R10: ffff9ef6fa272de8 R11: 0000000000000140 R12: 0000000000000000
[ 1696.541860] R13: 0000000000007fe0 R14: 0000000000000010 R15: ffffad15e5007980
[ 1696.541861] FS: 0000000000000000(0000) GS:ffff9ef6fa240000(0000) knlGS:0000000000000000
[ 1696.541863] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1696.541864] CR2: ffff9e7989271cff CR3: 000000807e810006 CR4: 0000000000772ee0
[ 1696.541865] PKRU: 55555554
[ 1696.541866] note: z_wr_iss[67531] exited with irqs disabled
[ 1696.541885] note: z_wr_iss[67531] exited with preempt_count 1
This time avx512bw
was selected as fletcher4 implementation. Even though this is in a different module (icp versus zcommon), I think it may be related to the previous errors, as both have kfpu_begin
as a RIP
value.
Since I still have some kernel logs saved, I ran grep RIP:
on these logs:
@40000000647888df3b850d8c.u:2023-06-01T11:55:47.87229 kern.warn: [ 71.907896] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
@40000000647888df3b850d8c.u:2023-06-01T11:55:47.87298 kern.warn: [ 71.911387] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
@40000000647c77942668e89c.u:2023-06-04T11:30:07.93488 kern.warn: [ 206.201241] RIP: 0010:kfpu_begin+0x2f/0x70 [icp]
@40000000647c77942668e89c.u:2023-06-04T11:30:07.95871 kern.warn: [ 206.712136] RIP: 0010:kfpu_begin+0x2f/0x70 [icp]
@40000000648ddeaf18f2e104.s:2023-06-17T05:56:42.01656 kern.warn: [ 70.390231] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
@40000000648ddeaf18f2e104.s:2023-06-17T05:56:42.01711 kern.warn: [ 70.435252] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
@4000000064904914229b9c8c.s:2023-06-18T15:12:50.62505 kern.warn: [78747.699749] RIP: 0010:i915_gem_object_pin_map+0x1e2/0x220 [i915]
@4000000064904914229b9c8c.s:2023-06-18T15:12:50.62557 kern.warn: [78747.700090] RIP: 0033:0x7fa94d2b95ab
@4000000064904914229b9c8c.s:2023-06-18T15:12:50.62587 kern.warn: [78747.700179] RIP: 0010:dma_buf_vmap+0xce/0xe0
@4000000064904914229b9c8c.s:2023-06-18T15:12:50.62636 kern.warn: [78747.700342] RIP: 0033:0x7fa94d2b95ab
@4000000064904914229b9c8c.s:2023-06-19T12:09:24.30436 kern.warn: [ 143.711963] RIP: 0010:kfpu_end+0x33/0xb0 [zcommon]
@4000000064904914229b9c8c.s:2023-06-19T12:09:24.30507 kern.warn: [ 143.914878] RIP: 0010:kfpu_end+0x33/0xb0 [zcommon]
@4000000064906914328f32fc.u:2023-06-19T14:24:27.53029 kern.warn: [ 1696.497595] RIP: 0010:kfpu_begin+0x2f/0x70 [icp]
@4000000064906914328f32fc.u:2023-06-19T14:24:27.53098 kern.warn: [ 1696.541839] RIP: 0010:kfpu_begin+0x2f/0x70 [icp]
(The i915 one is unrelated)
It turns out I had the similar crash to today’s back on June 4th, but all of them have RIP values of either kfpu_begin
or kfpu_end
, which handles XSAVES
/XRSTORS
for FPU/SIMD context save/restore.
Now looking at Xeon 4th Generation Errata, there appears to be some issue regarding XSAVES (although this one is related to AMX TILEDATA) and XRSTORS, but I’m unsure if they’re related to how ZFS uses them.
I ran out of ideas of where else to look. Does anyone else here running ZFS on Sapphire Rapids(-WS) (possibly also as a VM host) saw this crash?
(If this is indeed the issue, one workaround is to probably add noxsave
to kernel parameters, but that also disables AVX…)