ZFS Kernel Panic on heavy usage

theskaz · April 7, 2020, 8:04pm

I have 2 zfs pools on an unraid 6.8.3 server. I have a docker image for nzbget and when it downloads (it averages around 800 Mbps) and unpacks at the same time, I get a kernel panic and the pool doesnt respond anymore. below is an example output of the logs. this zfs pool has 2 raidz arrays of 4 2TB disks each (WD Red). In my research it seems to be occuring because there is too much throughput? is there something I can do to fix it, or should I move off of zfs for stability?

Apr 2 01:11:48 Tower kernel: PANIC: zfs: accessing past end of object e26/543cf (size=6656 access=6308+1033)
Apr 2 01:11:48 Tower kernel: Showing stack for process 25214
Apr 2 01:11:48 Tower kernel: CPU: 2 PID: 25214 Comm: nzbget Tainted: P O 4.19.107-Unraid #1
Apr 2 01:11:48 Tower kernel: Hardware name: ASUSTeK COMPUTER INC. Z9PE-D16 Series/Z9PE-D16 Series, BIOS 5601 06/11/2015
Apr 2 01:11:48 Tower kernel: Call Trace:
Apr 2 01:11:48 Tower kernel: dump_stack+0x67/0x83
Apr 2 01:11:48 Tower kernel: vcmn_err+0x8b/0xd4 [spl]
Apr 2 01:11:48 Tower kernel: ? spl_kmem_alloc+0xc9/0xfa [spl]
Apr 2 01:11:48 Tower kernel: ? _cond_resched+0x1b/0x1e
Apr 2 01:11:48 Tower kernel: ? mutex_lock+0xa/0x25
Apr 2 01:11:48 Tower kernel: ? dbuf_find+0x130/0x14c [zfs]
Apr 2 01:11:48 Tower kernel: ? _cond_resched+0x1b/0x1e
Apr 2 01:11:48 Tower kernel: ? mutex_lock+0xa/0x25
Apr 2 01:11:48 Tower kernel: ? arc_buf_access+0x69/0x1f4 [zfs]
Apr 2 01:11:48 Tower kernel: ? _cond_resched+0x1b/0x1e
Apr 2 01:11:48 Tower kernel: zfs_panic_recover+0x67/0x7e [zfs]
Apr 2 01:11:48 Tower kernel: ? spl_kmem_zalloc+0xd4/0x107 [spl]
Apr 2 01:11:48 Tower kernel: dmu_buf_hold_array_by_dnode+0x92/0x3b6 [zfs]
Apr 2 01:11:48 Tower kernel: dmu_write_uio_dnode+0x46/0x11d [zfs]
Apr 2 01:11:48 Tower kernel: ? txg_rele_to_quiesce+0x24/0x32 [zfs]
Apr 2 01:11:48 Tower kernel: dmu_write_uio_dbuf+0x48/0x5e [zfs]
Apr 2 01:11:48 Tower kernel: zfs_write+0x6a3/0xbe8 [zfs]
Apr 2 01:11:48 Tower kernel: zpl_write_common_iovec+0xae/0xef [zfs]
Apr 2 01:11:48 Tower kernel: zpl_iter_write+0xdc/0x10d [zfs]
Apr 2 01:11:48 Tower kernel: do_iter_readv_writev+0x110/0x146
Apr 2 01:11:48 Tower kernel: do_iter_write+0x86/0x15c
Apr 2 01:11:48 Tower kernel: vfs_writev+0x90/0xe2
Apr 2 01:11:48 Tower kernel: ? list_lru_add+0x63/0x13a
Apr 2 01:11:48 Tower kernel: ? vfs_ioctl+0x19/0x26
Apr 2 01:11:48 Tower kernel: ? do_vfs_ioctl+0x533/0x55d
Apr 2 01:11:48 Tower kernel: ? syscall_trace_enter+0x163/0x1aa
Apr 2 01:11:48 Tower kernel: do_writev+0x6b/0xe2
Apr 2 01:11:48 Tower kernel: do_syscall_64+0x57/0xf2
Apr 2 01:11:48 Tower kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr 2 01:11:48 Tower kernel: RIP: 0033:0x14c478acbf90
Apr 2 01:11:48 Tower kernel: Code: 89 74 24 10 48 89 e5 48 89 04 24 49 29 c6 48 89 54 24 18 4c 89 74 24 08 49 01 d6 48 63 7b 78 49 63 d7 4c 89 e8 48 89 ee 0f 05 <48> 89 c7 e8 1b 85 fd ff 49 39 c6 75 19 48 8b 43 58 48 8b 53 60 48
Apr 2 01:11:48 Tower kernel: RSP: 002b:000014c478347640 EFLAGS: 00000216 ORIG_RAX: 0000000000000014
Apr 2 01:11:48 Tower kernel: RAX: ffffffffffffffda RBX: 0000558040d4e920 RCX: 000014c478acbf90
Apr 2 01:11:48 Tower kernel: RDX: 0000000000000002 RSI: 000014c478347640 RDI: 0000000000000005
Apr 2 01:11:48 Tower kernel: RBP: 000014c478347640 R08: 0000000000000001 R09: 000014c478b15873
Apr 2 01:11:48 Tower kernel: R10: 0000000000000006 R11: 0000000000000216 R12: 000000000000000b
Apr 2 01:11:48 Tower kernel: R13: 0000000000000014 R14: 0000000000000409 R15: 0000000000000002

Trooper_ish · April 7, 2020, 8:22pm

how does it compare to the “Panic: zfs: accessing past end of object” reports on the ZOL github issue tracker?
might help them to post a bug report, if it is consistent?
Look slike it has been intermittent with non-ECC memory, and has been noticed on and off in the past?

there were a couple of things you might try,
'zfs_recover = 1

as a temporary cludge to keep pool up while having errors

apart from that, just ensure the zfs app on the box is upgraded (not the pool, the OS tools)

theskaz · April 7, 2020, 8:56pm

This looks like it addresses the issue. I could test if I knew how to update zfs plugin on unraid…

Trooper_ish · April 7, 2020, 9:33pm

looks like the author of the plug-in does update it often, maybe leave a comment on his thread he has open, explain the issue you have, the PR you found, and see if he can/will update, or suggest different course/ tip in the mean time?

Posted maybe an hour ago from now:

https://forums.unraid.net/topic/41333-zfs-plugin-for-unraid/page/14/

theskaz · April 7, 2020, 10:35pm

I originally posted the issue in that thread (page 13) I just posted again with the PR. thanks for your help!

Trooper_ish · April 7, 2020, 10:54pm

Nice, I see now, and here’s hoping perhaps the guy can help.
Would you post back here if it is fixed? for posterity?

theskaz · April 14, 2020, 3:37pm

Didn’t work. Still got the panic. Will be posting to their GitHub