Hi all,
I’ve been struggling the past few days with XFS metadata corruption errors and I’m not sure how to proceed in troubleshooting the issue to narrow it down, or if I should try switching to another file system like ext4/btrfs/zfs, or if I’ve just set some parameters incorrectly. I have a backup of all my data (maybe 99% of it) as well as my previous drive array (both of which are also XFS, maybe cause for concern?) so while I am concerned about data loss I should have enough to be able to restore everything. It is mainly media that is accessed frequently/scanned for changes but some of it is cold storage and it’s also an NFS share. It tends to see lots of small IO but also large bursts, such as when I add and backup said newly added media.
I am running 5 6TB Seagate Ironwolf drives in raid 5 with mdadm, and the XFS filesystem is on an LVM where the PV was created against the mdadm array. The drives are passed through to a QEMU/KVM VM managed by oVirt. They were previously in my backup server because I put bigger drives in, but that is physical hardware and not a VM.
uname -a
Linux media 4.10.11-1.el7.elrepo.x86_64 #1 SMP Tue Apr 18 12:41:24 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
The array was created with mdadm --create --verbose /dev/md0 --level=5 --raid-devices=5 /dev/sd[b-f]1
mdadm --detail /dev/md127
/dev/md127:
Version : 1.2
Creation Time : Sun Oct 8 22:13:33 2017
Raid Level : raid5
Array Size : 23441561600 (22355.62 GiB 24004.16 GB)
Used Dev Size : 5860390400 (5588.90 GiB 6001.04 GB)
Raid Devices : 5
Total Devices : 5
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Thu Jan 25 10:21:30 2018
State : clean
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : unknown
Name : media:0
UUID : daa32da3:34b4292b:c99b9bc5:ee4bba9a
Events : 6965
Number Major Minor RaidDevice State
0 8 49 0 active sync /dev/sdd1
1 8 17 1 active sync /dev/sdb1
2 8 33 2 active sync /dev/sdc1
3 8 65 3 active sync /dev/sde1
5 8 81 4 active sync /dev/sdf1
cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 sdf1[5] sdc1[2] sdd1[0] sde1[3] sdb1[1]
23441561600 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
bitmap: 0/44 pages [0KB], 65536KB chunk
unused devices: <none>
XFS was partitioned using mkfs.xfs -f -d su=512k,sw=4 /dev/media_raid/lv_raid
and mounted with the defaults /dev/mapper/media_raid-lv_raid /mnt/raid xfs defaults 0 0
xfs_info /mnt/raid
meta-data=/dev/mapper/media_raid-lv_raid isize=512 agcount=32, agsize=183137152 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=0 spinodes=0
data = bsize=4096 blocks=5860388864, imaxpct=5
= sunit=128 swidth=512 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=521728, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
smartctl is not indicating any errors with the drives and per the usual metrics checked by Backblaze for failing drives they also look good, as they should since they’re not that old. I don’t believe this issue to be related to the drives themselves.
for f in {b,c,d,e,f} ; do echo $f ; sudo smartctl -a /dev/sd${f} | egrep " 5 |187 |188 |189 |197 |198 " | grep -v "Not_testing" ; echo ; done
b
LU WWN Device Id: 5 000c50 0a4881e11
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
c
LU WWN Device Id: 5 000c50 0a4894aca
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
d
LU WWN Device Id: 5 000c50 0a487859c
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
e
LU WWN Device Id: 5 000c50 0a48ce793
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
f
LU WWN Device Id: 5 000c50 0a4887eb8
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
Here’s a snippet of the errors that came up this morning, but I don’t usually see this Buffer I/O error.
Jan 25 10:21:27 media kernel: XFS (dm-2): metadata I/O error: block 0x165ffbde0 ("xfs_trans_read_buf_map") error 74 numblks 8
Jan 25 10:21:27 media kernel: XFS (dm-2): page discard on page ffffea0006f81bc0, inode 0x2150db0bd, offset 346996736.
Jan 25 10:21:29 media kernel: XFS (dm-2): Corruption warning: Metadata has LSN (-1614377583:-2057253873) ahead of current LSN (13:564048). Please unmount and run xfs_repair (>= v4.3) to resolve.
Jan 25 10:21:29 media kernel: XFS (dm-2): Metadata CRC error detected at xfs_allocbt_read_verify+0x6c/0xc0 [xfs], xfs_allocbt block 0x165ffbde0
Jan 25 10:21:29 media kernel: XFS (dm-2): Unmount and run xfs_repair
Jan 25 10:21:29 media kernel: XFS (dm-2): First 64 bytes of corrupted metadata buffer:
Jan 25 10:21:29 media kernel: ffff880141088000: cf db f3 3d 0c a5 e8 c9 bf 21 24 6b f4 a7 35 1c ...=.....!$k..5.
Jan 25 10:21:29 media kernel: ffff880141088010: 61 a5 16 34 94 d9 ed 5f 9f c6 8d 91 85 60 cc 0f a..4..._.....`..
Jan 25 10:21:29 media kernel: ffff880141088020: b5 19 b7 bb 2e e1 5d 42 73 37 93 c4 e3 0c af 7a ......]Bs7.....z
Jan 25 10:21:29 media kernel: ffff880141088030: 56 e0 4e cd bf db d2 11 66 f6 aa 2d db 20 b5 52 V.N.....f..-. .R
Jan 25 10:21:29 media kernel: XFS (dm-2): metadata I/O error: block 0x165ffbde0 ("xfs_trans_read_buf_map") error 74 numblks 8
Jan 25 10:21:29 media kernel: XFS (dm-2): xfs_do_force_shutdown(0x1) called from line 315 of file fs/xfs/xfs_trans_buf.c. Return address = 0xffffffffa02a9d7f
Jan 25 10:21:29 media kernel: XFS (dm-2): I/O Error Detected. Shutting down filesystem
Jan 25 10:21:29 media kernel: XFS (dm-2): Please umount the filesystem and rectify the problem(s)
Jan 25 10:21:29 media kernel: XFS (dm-2): xfs_do_force_shutdown(0x8) called from line 236 of file fs/xfs/libxfs/xfs_defer.c. Return address = 0xffffffffa024c087
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540303, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540304, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540305, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540306, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540307, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540308, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540309, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540310, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540311, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540312, lost async page write
And one from last night which is more typically seen.
Jan 24 20:45:04 media kernel: XFS (dm-2): Metadata corruption detected at xfs_inode_buf_verify+0x71/0xf0 [xfs], xfs_inode block 0x373069420
Jan 24 20:45:04 media kernel: XFS (dm-2): Unmount and run xfs_repair
Jan 24 20:45:04 media kernel: XFS (dm-2): First 64 bytes of corrupted metadata buffer:
Jan 24 20:45:04 media kernel: ffff88012672f000: fe ed ba b3 00 00 00 7e 02 13 be 0c 00 00 00 72 .......~.......r
Jan 24 20:45:04 media kernel: ffff88012672f010: 02 14 4d 82 00 00 01 fa 02 28 db 02 00 3f ac 72 ..M......(...?.r
Jan 24 20:45:04 media kernel: ffff88012672f020: 02 29 6a 8e 00 00 00 72 02 3e 84 0e 00 00 00 72 .)j....r.>.....r
Jan 24 20:45:04 media kernel: ffff88012672f030: 02 3f 13 8e 00 00 00 72 02 47 26 0e 00 00 00 72 .?.....r.G&....r
This is the dry-run output of xfs_repair
.
xfs_repair -vn /dev/mapper/media_raid-lv_raid
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
- block cache size set to 400312 entries
Phase 2 - using internal log
- zero log...
zero_log: head block 564096 tail block 564024
- scan filesystem freespace and inode maps...
Metadata CRC error detected at xfs_allocbt block 0x165ffbde0/0x1000
btree block 4/18229692 is suspect, error -74
bad magic # 0xcfdbf33d in btcnt block 4/18229692
agf_freeblks 127155501, counted 49257 in ag 4
agf_longest 120734092, counted 92 in ag 4
agf_btreeblks 10, counted 9 in ag 4
agf_freeblks 98206318, counted 98206328 in ag 2
sb_ifree 987, counted 1099
sb_fdblocks 3201338748, counted 3109395872
- 11:21:17: scanning filesystem freespace - 32 of 32 allocation groups done
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- 11:21:17: scanning agi unlinked lists - 32 of 32 allocation groups done
- process known inodes and perform inode discovery...
- agno = 15
- agno = 0
- agno = 30
- agno = 1
- agno = 16
- agno = 31
- agno = 2
- agno = 17
bad nblocks 85438 for inode 4965660805, would reset to 87743
bad nextents 156 for inode 4965660805, would reset to 159
bad nblocks 66863 for inode 4965660806, would reset to 70963
bad nextents 112 for inode 4965660806, would reset to 118
bad nblocks 80290 for inode 4965660808, would reset to 89766
bad nextents 121 for inode 4965660808, would reset to 134
bad nblocks 68171 for inode 4965660809, would reset to 77249
bad nextents 100 for inode 4965660809, would reset to 113
bad key in bmbt root (is 9471, would reset to 5375) in inode 4965660810 data fork
bad nblocks 78019 for inode 4965660810, would reset to 83142
bad nextents 122 for inode 4965660810, would reset to 130
bad nblocks 73478 for inode 4965660811, would reset to 79919
bad nextents 112 for inode 4965660811, would reset to 119
data fork in ino 4965660813 claims free block 622085049
data fork in ino 4965660813 claims free block 622085050
bad nblocks 77815 for inode 4965660813, would reset to 80960
bad nextents 121 for inode 4965660813, would reset to 125
bad nblocks 61406 for inode 4965660814, would reset to 84130
bad nextents 92 for inode 4965660814, would reset to 126
bad nblocks 76607 for inode 4965660815, would reset to 78913
bad nextents 127 for inode 4965660815, would reset to 130
data fork in ino 4965660816 claims free block 621362631
bad nblocks 67514 for inode 4965660816, would reset to 70376
bad nextents 103 for inode 4965660816, would reset to 107
bad nblocks 76790 for inode 4965660817, would reset to 82298
bad nextents 114 for inode 4965660817, would reset to 122
- agno = 3
- agno = 18
- agno = 4
- agno = 19
- agno = 5
- agno = 20
- agno = 6
- agno = 21
- agno = 7
- agno = 22
- agno = 8
- agno = 23
- agno = 9
- agno = 24
- agno = 10
- agno = 25
- agno = 11
- agno = 26
- agno = 12
- agno = 27
- agno = 13
- agno = 28
- agno = 14
- agno = 29
- 11:21:41: process known inodes and inode discovery - 666688 of 666688 inodes done
- process newly discovered inodes...
- 11:21:41: process newly discovered inodes - 32 of 32 allocation groups done
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- 11:21:41: setting up duplicate extent list - 32 of 32 allocation groups done
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 2
- agno = 1
- agno = 3
- agno = 8
- agno = 10
- agno = 11
- agno = 7
- agno = 12
- agno = 9
- agno = 4
- agno = 6
- agno = 13
- agno = 5
- agno = 14
- agno = 15
- agno = 16
- agno = 17
- agno = 18
- agno = 19
- agno = 20
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- agno = 25
- agno = 26
bad nblocks 85438 for inode 4965660805, would reset to 87743
bad nextents 156 for inode 4965660805, would reset to 159
bad nblocks 66863 for inode 4965660806, would reset to 70963
bad nextents 112 for inode 4965660806, would reset to 118
bad nblocks 80290 for inode 4965660808, would reset to 89766
bad nextents 121 for inode 4965660808, would reset to 134
bad nblocks 68171 for inode 4965660809, would reset to 77249
bad nextents 100 for inode 4965660809, would reset to 113
bad key in bmbt root (is 9471, would reset to 5375) in inode 4965660810 data fork
bad nblocks 78019 for inode 4965660810, would reset to 83142
bad nextents 122 for inode 4965660810, would reset to 130
bad nblocks 73478 for inode 4965660811, would reset to 79919
bad nextents 112 for inode 4965660811, would reset to 119
bad nblocks 77815 for inode 4965660813, would reset to 80960
bad nextents 121 for inode 4965660813, would reset to 125
bad nblocks 61406 for inode 4965660814, would reset to 84130
bad nextents 92 for inode 4965660814, would reset to 126
bad nblocks 76607 for inode 4965660815, would reset to 78913
bad nextents 127 for inode 4965660815, would reset to 130
bad nblocks 67514 for inode 4965660816, would reset to 70376
bad nextents 103 for inode 4965660816, would reset to 107
bad nblocks 76790 for inode 4965660817, would reset to 82298
bad nextents 114 for inode 4965660817, would reset to 122
- agno = 27
- agno = 28
- agno = 29
- agno = 30
- agno = 31
- 11:21:41: check for inodes claiming duplicate blocks - 666688 of 666688 inodes done
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- agno = 15
- agno = 16
- agno = 17
- agno = 18
- agno = 19
- agno = 20
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- agno = 25
- agno = 26
- agno = 27
- agno = 28
- agno = 29
- agno = 30
- agno = 31
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
- 11:21:58: verify and correct link counts - 32 of 32 allocation groups done
No modify flag set, skipping filesystem flush and exiting.
XFS_REPAIR Summary Thu Jan 25 11:21:58 2018
Phase Start End Duration
Phase 1: 01/25 11:21:14 01/25 11:21:16 2 seconds
Phase 2: 01/25 11:21:16 01/25 11:21:17 1 second
Phase 3: 01/25 11:21:17 01/25 11:21:41 24 seconds
Phase 4: 01/25 11:21:41 01/25 11:21:41
Phase 5: Skipped
Phase 6: 01/25 11:21:41 01/25 11:21:58 17 seconds
Phase 7: 01/25 11:21:58 01/25 11:21:58
Total run time: 44 seconds
I could not run xfs_repair this time without using the -L flag but it appears that at least all my media is still present, if anything in cold storage is missing I can restore it from my backup server but I first need to figure out what to do about this, it’s continually happening while trying to restore missing files, and it’s thrown a lot into lost+found so it’s not really gone it’s just not named properly anymore.
du -hs /mnt/raid/lost+found/
533G /mnt/raid/lost+found/