Continual XFS Metadata Corruption

lessaj · January 25, 2018, 4:48pm

Hi all,

I’ve been struggling the past few days with XFS metadata corruption errors and I’m not sure how to proceed in troubleshooting the issue to narrow it down, or if I should try switching to another file system like ext4/btrfs/zfs, or if I’ve just set some parameters incorrectly. I have a backup of all my data (maybe 99% of it) as well as my previous drive array (both of which are also XFS, maybe cause for concern?) so while I am concerned about data loss I should have enough to be able to restore everything. It is mainly media that is accessed frequently/scanned for changes but some of it is cold storage and it’s also an NFS share. It tends to see lots of small IO but also large bursts, such as when I add and backup said newly added media.

I am running 5 6TB Seagate Ironwolf drives in raid 5 with mdadm, and the XFS filesystem is on an LVM where the PV was created against the mdadm array. The drives are passed through to a QEMU/KVM VM managed by oVirt. They were previously in my backup server because I put bigger drives in, but that is physical hardware and not a VM.

uname -a

Linux media 4.10.11-1.el7.elrepo.x86_64 #1 SMP Tue Apr 18 12:41:24 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

The array was created with mdadm --create --verbose /dev/md0 --level=5 --raid-devices=5 /dev/sd[b-f]1

mdadm --detail /dev/md127

/dev/md127:
           Version : 1.2
     Creation Time : Sun Oct  8 22:13:33 2017
        Raid Level : raid5
        Array Size : 23441561600 (22355.62 GiB 24004.16 GB)
     Used Dev Size : 5860390400 (5588.90 GiB 6001.04 GB)
      Raid Devices : 5
     Total Devices : 5
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Thu Jan 25 10:21:30 2018
             State : clean 
    Active Devices : 5
   Working Devices : 5
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : unknown

              Name : media:0
              UUID : daa32da3:34b4292b:c99b9bc5:ee4bba9a
            Events : 6965

    Number   Major   Minor   RaidDevice State
       0       8       49        0      active sync   /dev/sdd1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
       3       8       65        3      active sync   /dev/sde1
       5       8       81        4      active sync   /dev/sdf1

cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4] 
md127 : active raid5 sdf1[5] sdc1[2] sdd1[0] sde1[3] sdb1[1]
      23441561600 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      bitmap: 0/44 pages [0KB], 65536KB chunk

unused devices: <none>

XFS was partitioned using mkfs.xfs -f -d su=512k,sw=4 /dev/media_raid/lv_raid and mounted with the defaults /dev/mapper/media_raid-lv_raid /mnt/raid xfs defaults 0 0

xfs_info /mnt/raid

meta-data=/dev/mapper/media_raid-lv_raid isize=512    agcount=32, agsize=183137152 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=5860388864, imaxpct=5
         =                       sunit=128    swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

smartctl is not indicating any errors with the drives and per the usual metrics checked by Backblaze for failing drives they also look good, as they should since they’re not that old. I don’t believe this issue to be related to the drives themselves.

for f in {b,c,d,e,f} ; do echo $f ; sudo smartctl -a /dev/sd${f} | egrep " 5 |187 |188 |189 |197 |198 " | grep -v "Not_testing" ; echo ; done

b
LU WWN Device Id: 5 000c50 0a4881e11
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

c
LU WWN Device Id: 5 000c50 0a4894aca
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

d
LU WWN Device Id: 5 000c50 0a487859c
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

e
LU WWN Device Id: 5 000c50 0a48ce793
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

f
LU WWN Device Id: 5 000c50 0a4887eb8
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

Here’s a snippet of the errors that came up this morning, but I don’t usually see this Buffer I/O error.

Jan 25 10:21:27 media kernel: XFS (dm-2): metadata I/O error: block 0x165ffbde0 ("xfs_trans_read_buf_map") error 74 numblks 8
Jan 25 10:21:27 media kernel: XFS (dm-2): page discard on page ffffea0006f81bc0, inode 0x2150db0bd, offset 346996736.
Jan 25 10:21:29 media kernel: XFS (dm-2): Corruption warning: Metadata has LSN (-1614377583:-2057253873) ahead of current LSN (13:564048). Please unmount and run xfs_repair (>= v4.3) to resolve.
Jan 25 10:21:29 media kernel: XFS (dm-2): Metadata CRC error detected at xfs_allocbt_read_verify+0x6c/0xc0 [xfs], xfs_allocbt block 0x165ffbde0
Jan 25 10:21:29 media kernel: XFS (dm-2): Unmount and run xfs_repair
Jan 25 10:21:29 media kernel: XFS (dm-2): First 64 bytes of corrupted metadata buffer:
Jan 25 10:21:29 media kernel: ffff880141088000: cf db f3 3d 0c a5 e8 c9 bf 21 24 6b f4 a7 35 1c  ...=.....!$k..5.
Jan 25 10:21:29 media kernel: ffff880141088010: 61 a5 16 34 94 d9 ed 5f 9f c6 8d 91 85 60 cc 0f  a..4..._.....`..
Jan 25 10:21:29 media kernel: ffff880141088020: b5 19 b7 bb 2e e1 5d 42 73 37 93 c4 e3 0c af 7a  ......]Bs7.....z
Jan 25 10:21:29 media kernel: ffff880141088030: 56 e0 4e cd bf db d2 11 66 f6 aa 2d db 20 b5 52  V.N.....f..-. .R
Jan 25 10:21:29 media kernel: XFS (dm-2): metadata I/O error: block 0x165ffbde0 ("xfs_trans_read_buf_map") error 74 numblks 8
Jan 25 10:21:29 media kernel: XFS (dm-2): xfs_do_force_shutdown(0x1) called from line 315 of file fs/xfs/xfs_trans_buf.c.  Return address = 0xffffffffa02a9d7f
Jan 25 10:21:29 media kernel: XFS (dm-2): I/O Error Detected. Shutting down filesystem
Jan 25 10:21:29 media kernel: XFS (dm-2): Please umount the filesystem and rectify the problem(s)
Jan 25 10:21:29 media kernel: XFS (dm-2): xfs_do_force_shutdown(0x8) called from line 236 of file fs/xfs/libxfs/xfs_defer.c.  Return address = 0xffffffffa024c087
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540303, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540304, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540305, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540306, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540307, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540308, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540309, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540310, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540311, lost async page write
Jan 25 10:21:29 media kernel: Buffer I/O error on dev dm-2, logical block 451540312, lost async page write

And one from last night which is more typically seen.

Jan 24 20:45:04 media kernel: XFS (dm-2): Metadata corruption detected at xfs_inode_buf_verify+0x71/0xf0 [xfs], xfs_inode block 0x373069420
Jan 24 20:45:04 media kernel: XFS (dm-2): Unmount and run xfs_repair
Jan 24 20:45:04 media kernel: XFS (dm-2): First 64 bytes of corrupted metadata buffer:
Jan 24 20:45:04 media kernel: ffff88012672f000: fe ed ba b3 00 00 00 7e 02 13 be 0c 00 00 00 72  .......~.......r
Jan 24 20:45:04 media kernel: ffff88012672f010: 02 14 4d 82 00 00 01 fa 02 28 db 02 00 3f ac 72  ..M......(...?.r
Jan 24 20:45:04 media kernel: ffff88012672f020: 02 29 6a 8e 00 00 00 72 02 3e 84 0e 00 00 00 72  .)j....r.>.....r
Jan 24 20:45:04 media kernel: ffff88012672f030: 02 3f 13 8e 00 00 00 72 02 47 26 0e 00 00 00 72  .?.....r.G&....r

This is the dry-run output of xfs_repair.

xfs_repair -vn /dev/mapper/media_raid-lv_raid
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
        - block cache size set to 400312 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 564096 tail block 564024
        - scan filesystem freespace and inode maps...
Metadata CRC error detected at xfs_allocbt block 0x165ffbde0/0x1000
btree block 4/18229692 is suspect, error -74
bad magic # 0xcfdbf33d in btcnt block 4/18229692
agf_freeblks 127155501, counted 49257 in ag 4
agf_longest 120734092, counted 92 in ag 4
agf_btreeblks 10, counted 9 in ag 4
agf_freeblks 98206318, counted 98206328 in ag 2
sb_ifree 987, counted 1099
sb_fdblocks 3201338748, counted 3109395872
        - 11:21:17: scanning filesystem freespace - 32 of 32 allocation groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - 11:21:17: scanning agi unlinked lists - 32 of 32 allocation groups done
        - process known inodes and perform inode discovery...
        - agno = 15
        - agno = 0
        - agno = 30
        - agno = 1
        - agno = 16
        - agno = 31
        - agno = 2
        - agno = 17
bad nblocks 85438 for inode 4965660805, would reset to 87743
bad nextents 156 for inode 4965660805, would reset to 159
bad nblocks 66863 for inode 4965660806, would reset to 70963
bad nextents 112 for inode 4965660806, would reset to 118
bad nblocks 80290 for inode 4965660808, would reset to 89766
bad nextents 121 for inode 4965660808, would reset to 134
bad nblocks 68171 for inode 4965660809, would reset to 77249
bad nextents 100 for inode 4965660809, would reset to 113
bad key in bmbt root (is 9471, would reset to 5375) in inode 4965660810 data fork
bad nblocks 78019 for inode 4965660810, would reset to 83142
bad nextents 122 for inode 4965660810, would reset to 130
bad nblocks 73478 for inode 4965660811, would reset to 79919
bad nextents 112 for inode 4965660811, would reset to 119
data fork in ino 4965660813 claims free block 622085049
data fork in ino 4965660813 claims free block 622085050
bad nblocks 77815 for inode 4965660813, would reset to 80960
bad nextents 121 for inode 4965660813, would reset to 125
bad nblocks 61406 for inode 4965660814, would reset to 84130
bad nextents 92 for inode 4965660814, would reset to 126
bad nblocks 76607 for inode 4965660815, would reset to 78913
bad nextents 127 for inode 4965660815, would reset to 130
data fork in ino 4965660816 claims free block 621362631
bad nblocks 67514 for inode 4965660816, would reset to 70376
bad nextents 103 for inode 4965660816, would reset to 107
bad nblocks 76790 for inode 4965660817, would reset to 82298
bad nextents 114 for inode 4965660817, would reset to 122
        - agno = 3
        - agno = 18
        - agno = 4
        - agno = 19
        - agno = 5
        - agno = 20
        - agno = 6
        - agno = 21
        - agno = 7
        - agno = 22
        - agno = 8
        - agno = 23
        - agno = 9
        - agno = 24
        - agno = 10
        - agno = 25
        - agno = 11
        - agno = 26
        - agno = 12
        - agno = 27
        - agno = 13
        - agno = 28
        - agno = 14
        - agno = 29
        - 11:21:41: process known inodes and inode discovery - 666688 of 666688 inodes done
        - process newly discovered inodes...
        - 11:21:41: process newly discovered inodes - 32 of 32 allocation groups done
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - 11:21:41: setting up duplicate extent list - 32 of 32 allocation groups done
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 1
        - agno = 3
        - agno = 8
        - agno = 10
        - agno = 11
        - agno = 7
        - agno = 12
        - agno = 9
        - agno = 4
        - agno = 6
        - agno = 13
        - agno = 5
        - agno = 14
        - agno = 15
        - agno = 16
        - agno = 17
        - agno = 18
        - agno = 19
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
bad nblocks 85438 for inode 4965660805, would reset to 87743
bad nextents 156 for inode 4965660805, would reset to 159
bad nblocks 66863 for inode 4965660806, would reset to 70963
bad nextents 112 for inode 4965660806, would reset to 118
bad nblocks 80290 for inode 4965660808, would reset to 89766
bad nextents 121 for inode 4965660808, would reset to 134
bad nblocks 68171 for inode 4965660809, would reset to 77249
bad nextents 100 for inode 4965660809, would reset to 113
bad key in bmbt root (is 9471, would reset to 5375) in inode 4965660810 data fork
bad nblocks 78019 for inode 4965660810, would reset to 83142
bad nextents 122 for inode 4965660810, would reset to 130
bad nblocks 73478 for inode 4965660811, would reset to 79919
bad nextents 112 for inode 4965660811, would reset to 119
bad nblocks 77815 for inode 4965660813, would reset to 80960
bad nextents 121 for inode 4965660813, would reset to 125
bad nblocks 61406 for inode 4965660814, would reset to 84130
bad nextents 92 for inode 4965660814, would reset to 126
bad nblocks 76607 for inode 4965660815, would reset to 78913
bad nextents 127 for inode 4965660815, would reset to 130
bad nblocks 67514 for inode 4965660816, would reset to 70376
bad nextents 103 for inode 4965660816, would reset to 107
bad nblocks 76790 for inode 4965660817, would reset to 82298
bad nextents 114 for inode 4965660817, would reset to 122
        - agno = 27
        - agno = 28
        - agno = 29
        - agno = 30
        - agno = 31
        - 11:21:41: check for inodes claiming duplicate blocks - 666688 of 666688 inodes done
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - agno = 15
        - agno = 16
        - agno = 17
        - agno = 18
        - agno = 19
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
        - agno = 28
        - agno = 29
        - agno = 30
        - agno = 31
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
        - 11:21:58: verify and correct link counts - 32 of 32 allocation groups done
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Thu Jan 25 11:21:58 2018

Phase		Start		End		Duration
Phase 1:	01/25 11:21:14	01/25 11:21:16	2 seconds
Phase 2:	01/25 11:21:16	01/25 11:21:17	1 second
Phase 3:	01/25 11:21:17	01/25 11:21:41	24 seconds
Phase 4:	01/25 11:21:41	01/25 11:21:41	
Phase 5:	Skipped
Phase 6:	01/25 11:21:41	01/25 11:21:58	17 seconds
Phase 7:	01/25 11:21:58	01/25 11:21:58	

Total run time: 44 seconds

I could not run xfs_repair this time without using the -L flag but it appears that at least all my media is still present, if anything in cold storage is missing I can restore it from my backup server but I first need to figure out what to do about this, it’s continually happening while trying to restore missing files, and it’s thrown a lot into lost+found so it’s not really gone it’s just not named properly anymore.

du -hs /mnt/raid/lost+found/

533G	/mnt/raid/lost+found/

Tjj226_Angel · January 25, 2018, 5:30pm

Software raid 5? Yeah, you are trying to do way to much and it could be a problem with whole host of things.

I would not be surprised in the slightest if this is a hardware issue. Not only could the drives be acting up, but I have NEVER had success with running raid 5 without a raid card. Something always gets messed up.

My suggesting to you for the moment is to simply take all the stuff off your drives, update your kernel to the newest version, update your bios to the newest version and rebuild the array to run without the LVM.

Just make a plain jane raid 5 array and see if that experiences any of the same issues.

We need some way to isolate the issue down to a particular step in the procedure.

MarcT · January 25, 2018, 6:49pm

What’s the hardware platform? (CPU / Chipset / Motherboard, etc)
Do you have error correcting RAM? If so, is it enabled?

Edit: That aside, you probably want to be on a newer kernel. There have been quite a few mdraid fixes since 4.10.11.

lessaj · January 25, 2018, 7:34pm

My hypervisor is running

Mobo: SUPERMICRO MBD-X10DRL-I
CPUs: 2xE5-2683 v4 (engineering samples)
RAM: 64GB of Kingston ValueRAM (4 x 16GB) DDR4 2400 with ECC

I believe that ECC should be enabled but I would have to reboot the hypervisor to enter the bios to check, I did a grep for ecc in dmesg but it didn’t return anything and I think it should report a message when ECC is enabled. I recently updated the BIOS to the latest available - 2.0b.

4.10.11 seems to be the latest available for me but I can see that kernel-ml-4.14.14-1 is available from elrepo-kernel, I’m not sure why it’s not showing me any newer versions, I’ll look into that.

lessaj · January 25, 2018, 7:38pm

I probably don’t need LVM on top since it’s really just mass storage, I don’t intend to have multiple volumes or anything like that but since the rest of the system is using LVM I figured why not, get some good experience with it. I can certainly blow it away and copy all the data back, it’ll take a couple days since it’s over 9TB worth. Should I consider sticking with XFS or should I change to a different file system as well? I’m sure XFS is fine for this application but I’d be willing to try something different. Similarly, I’d be willing to dish out some $$$ for a raid card, I mainly wanted to avoid that to keep everything software based in case the hardware dies, just toss it in another machine and good to go.

MarcT · January 25, 2018, 8:35pm

Might also be worth posting the output of

vgdisplay -v

I have a similar-ish setup, except I use software RAID1, LVM and ext4.

If you have the capacity, RAID1 is so much easier to recover from as you have identical mirrors; any of which can even be split off and incrementally resynced. (Also, useful for backups with a 3 way mirror where one is usually detached).

Anyhow - in your case something really bad happened as this line indicates:
Jan 25 10:21:29 media kernel: XFS (dm-2): Corruption warning: Metadata has LSN (-1614377583:-2057253873) ahead of current LSN (13:564048). Please unmount and run xfs_repair (>= v4.3) to resolve.

The LSN values are not even close, which means that whatever XFS read is badly corrupted.

You maybe want to kick off a background check of the entire array as follows:

From:
http://neil.brown.name/blog/20050727141521

(I should really do another TODO list shouldn’t I …)
“background check/repair” is stable in 2.6.18 and later. (maybe even eariler, I’m not sure)… You echo check > /sys/block/mdX/md/sync_action where X is the number of the md array. This causes a background check to run which will read all block and trigger auto-recovery
of any read errors. Any inconsistencies will be counted and included in cat /sys/block/mdX/md/mismatch_cnt If you instead
echo repair > /sys/block/mdX/md/sync_action
it will do the same check process but if any inconsistencies are found they will automatically be corrected.
Finally you can echo idle > /sys/block/mdX/md/sync_action
to abort any running check or repair.

I’d do a “check” and then see what the mismatch_cnt is.

lessaj · January 25, 2018, 8:56pm

I figured out why it wasn’t updating to the latest available, I had exclude=kernel* in my /etc/yum.conf because when it updates the base kernel (still 3.10) it kept setting as default, I removed the * since I’m using kernel-ml and now I’m on 4.14.15-1.

Sure here is the output of vgdisplay -v. Definitely that was the worst error I’ve seen thus far, I hadn’t seen any LSN value errors

  --- Volume group ---
  VG Name               media_raid
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  3
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               21.83 TiB
  PE Size               4.00 MiB
  Total PE              5723037
  Alloc PE / Size       5723037 / 21.83 TiB
  Free  PE / Size       0 / 0   
  VG UUID               cZocpH-DdHJ-o9EZ-jFSi-z6Ua-ifXQ-drG58x
   
  --- Logical volume ---
  LV Path                /dev/media_raid/lv_raid
  LV Name                lv_raid
  VG Name                media_raid
  LV UUID                di0t03-4Rxz-cg5R-6Fhs-QlIw-mQR2-RZXa8V
  LV Write Access        read/write
  LV Creation host, time media, 2017-10-16 02:24:39 -0400
  LV Status              available
  # open                 1
  LV Size                21.83 TiB
  Current LE             5723037
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:2
   
  --- Physical volumes ---
  PV Name               /dev/md127     
  PV UUID               YM1ylp-TZio-XQ4q-PTS3-DKk8-HWQi-CIeDGP
  PV Status             allocatable
  Total PE / Free PE    5723037 / 0
   
  --- Volume group ---
  VG Name               media_coreOS
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  3
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               2
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               <29.51 GiB
  PE Size               4.00 MiB
  Total PE              7554
  Alloc PE / Size       7554 / <29.51 GiB
  Free  PE / Size       0 / 0   
  VG UUID               VWNKBg-s1w0-6ryh-9ezT-1lMV-8woV-LhuWAw
   
  --- Logical volume ---
  LV Path                /dev/media_coreOS/lv_root
  LV Name                lv_root
  VG Name                media_coreOS
  LV UUID                7c67dP-fevA-Mwwx-w1La-SfbS-ehl5-sC2MWo
  LV Write Access        read/write
  LV Creation host, time media, 2017-02-06 16:03:03 -0500
  LV Status              available
  # open                 1
  LV Size                <21.51 GiB
  Current LE             5506
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:0
   
  --- Logical volume ---
  LV Path                /dev/media_coreOS/lv_swap
  LV Name                lv_swap
  VG Name                media_coreOS
  LV UUID                A9NacS-Fs9A-sMrr-zrzs-iyTI-MTFN-JZpblX
  LV Write Access        read/write
  LV Creation host, time media, 2017-02-06 16:03:04 -0500
  LV Status              available
  # open                 2
  LV Size                8.00 GiB
  Current LE             2048
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:1
   
  --- Physical volumes ---
  PV Name               /dev/sda2     
  PV UUID               xqxD6Y-6YFE-sJh6-rdxG-20y6-cuV3-pWFdoF
  PV Status             allocatable
  Total PE / Free PE    7554 / 0

I’ve kicked off a check, it’ll take about 9 hours to complete. In the few minutes I’ve spent writing this post it’s already found some.

Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 sdb1[1] sdc1[2] sdd1[0] sde1[3] sdf1[5]
      23441561600 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  resync =  0.8% (49650556/5860390400) finish=537.2min speed=180256K/sec
      bitmap: 0/44 pages [0KB], 65536KB chunk

cat /sys/block/md127/md/mismatch_cnt 
48

Tjj226_Angel · January 25, 2018, 9:44pm

yeah, there have been quite a few xfs bug fixes between kernel 3.10 and 4.14

and to answer your previous question, I don’t think this is an issue with the file system. I think if you used ext4, you would still have issues.

lessaj · January 25, 2018, 9:58pm

Good thing I was on 4.10 and not 3.10 then. Probably still missing a few fixes but we’re up to the latest now, that was my bad on the config.

It’s possible that there would still be issues with ext4 but of course until I try it we don’t know. I built this server almost a year ago and I was running the same set of 4 4TB drives for about 8 or 9 months without an errors at all and I had been using those drives for about 3 years prior to that, they were in an MDADM raid 5 during that time but without LVM, and it was ext4. The 5 6TB drives I moved in to replace the 4 were in my backup server for probably 3-4 months tops, the raid was set up in the same way on that server with MDADM and LVM just with one more drive. The backup server now has 5 10TB drives.

Currently at 13.4% with 6952 mismatch counts. Oh boy…

MarcT · January 26, 2018, 1:57am

A note on mismatch_cnt :

With RAID1 there are some scenarios where mismatches may be “acceptable” - because the mismatch is in free space. There are a few things that can cause this; kernel optimizations (eg avoiding updates to deleted files), and mirroring devices which support trim/discard (most SSDs) with those that don’t (HDDs) - again likely to cause mismatches in the free space areas.

…but on RAID5 mismatches probably shouldn’t occur as much (if at all?). It’s possible (I suppose) the kernel could avoid updating a stripe’s parity if it knows for sure the entire raid5 stripe is unallocated (and therefore parity would be updated on the next write). That could lead to a benign mismatch, but I don’t know if the Linux kernel behaves that way. I haven’t used software raid5 on Linux to any degree so have never researched it (though we used software raid5 extensively on SUN Solaris at my old company, back in the day…)

Anyhow, my point is mismatches may not indicate a disaster.

The background check will have also swept though your entire RAID volume and found any latent sector errors, which is a valuable test - so check your dmesg output and logs for any I/O errors.

Finally, if you decide to run an mdraid repair; it will recalculate & re-write the parity where there is a mismatch and no I/O error (ie mdraid will trust the data more than the parity) - but that could be a dangerous assumption in some (unusual) cases (ie a drive replacement if the raid5 hasn’t properly rebuilt).

So, prior to the repair; I would recommend taking a backup and also running md5sum or sha1sum on every single file - and again after the repair completes, to see if anything changes.

Eg - something like this before:
find /mnt/raid -type f -exec md5sum -b {} \; | tee /tmp/raid_md5sums.txt

…and check afterwards with:
md5sum -c /tmp/raid_md5sums.txt

If a file’s checksum changes, it can be investigated.

Right - enough for one day…

lessaj · January 26, 2018, 2:13am

I will probably do a compare against my backup server since every time I was trying to sync any files back after a while I would start getting the corruption messages, I mainly care about the actual media files since any of the metadata files associated with them (IE nfos, jpgs) are not valuable, Emby will just recreate them if needed. This will also help me find what’s missing but ultimately I’ll just sync everything back from the backup server, they were copied with rsync. I’ll add some switches to the find command to only look at files over 100MB or something like that. So far nothing has been reported in dmesg, there has been nothing since starting the check. If it doesn’t finish too late tonight I’ll reboot the hypervisor and double check if ECC is enabled.

Dynamic_Gravity · January 26, 2018, 5:10am

At work i have to support an embedded linux system with kernel 2.6 running xfs. ;_;

I cry

Tjj226_Angel · January 26, 2018, 6:32am

Oh thank god. Your previous post said something about 3.10 so I was just sitting here thinking you were some OLD OLD school debian server guy.

And while you aren’t wrong about the notion that we won’t know if we try, I will say that XFS is a pretty solid FS. If you had BTRFS or ZFS or some other crazy FS, I would tell you to switch. But XFS is right there with EXT in being a old, well established, and simple FS. There just isn’t a whole hell of a lot to go wrong.

If you want to try EXT4, be my guest. It would definitely rule out the FS as a possibility. I just think that if you are still running into these issues on a basic raid 5 array with XFS that something else is going on. The other issue is that when I google your issue, I am not getting a whole lot of relevant info. XFS is a hugely popular FS. If the file system was having an issue, the whole god darn linux community would have been up in arms about it.

Couple that with the fact that the seagate hard drives are known for having reliability issues, and I just really don’t feel like XFS is the issue.

Now from what I understand this current setup you are testing is a plain raid 5 array with xfs. No LVMs, no weird VM passthrough systems. Just a plain blank raid 5 setup correct?

If you are still having issues, then it can only be 1 of 4 things. The raid software (or what ever is controlling the raid array), the hard drives, the FS, or possibly even MDADM.

Tjj226_Angel · January 26, 2018, 6:37am

I have never understood companies who think its cool to support old linux kernels.

I don’t mind it if they want to use a slightly older LTS kernel or something like that. Heck I have seen special cases where they hire a linux developer to make a custom static kernel for printer kiosks and what not.

But if it is a device that connects to the network and has to be maintained, PLEASE update it with a kernel from the past 2 years.

Dynamic_Gravity · January 26, 2018, 6:55am

The devices are pelco video surveillance cameras. We weren’t supposed to support them since they went out of warranty in Oct 2016.

They run Pelco Linux which is their custom distro.

Basically, we had a capital upgrade of all the cameras in the works but was denied this year because paperwork wasn’t properly submitted, so we are having to wait for it to go through the system.

Since they run linux, I’m like the only one who knows how to service them and stuff. Oh and I also get to work with two RHEL 3 systems.

Tjj226_Angel · January 26, 2018, 7:19am

Can people remote into the cameras or does the recording system back up the data to an offsite location? If so, I would be more than happy to tell your company to cram their paper work where the sun don’t shine and to buy some quality cameras.

I work with the morons who think it acceptable for their company to deploy RHEL 3 systems on cloud servers.

Sooooo yeah.

Someone the other day asked me if they could deploy ubuntu 10 inside a docker container and my brain just about melted.

Dynamic_Gravity · January 26, 2018, 7:28am

Nope, sshd is disabled.

Nope. All local baby.

Tjj226_Angel · January 26, 2018, 7:29am

Meh, then thats fine. Not ideal, but at least you aren’t opening yourself up to every script kiddy with a kali linux flash drive.

Dynamic_Gravity · January 26, 2018, 7:31am

Yeah, it runs on restricted vlans. Only intranet access from cameras on the same vlan.

Dark af, but holy hell is it a honeypot if anything actually bad connects on there (which would have to be physical and you’d have to disconnect a camera; would would prove difficult since most are analog that hook up to an encoder which is networked to the storage units.)

lessaj · January 26, 2018, 4:02pm

No no, never used Debian before. But Centos still ships with 3.10 so when it updates it it gets set as the default and I have to run grub2-set-default 0 to get it to boot my 4.X kernel, just my exclude was not completely correct. Hell, I just got a linux VM at work and it’s RHEL 7.2 with 3.10.0-327. We have some pretty old systems, the server I need to help rebuild, even though it’s not my area, I’m pretty sure I saw was running 2.4 - it died because of a RAID failure, but I mean we’re talking a server that should have been refreshed AT LEAST 3-4 years ago. I don’t really see the point in bringing it back up as an old server, better off building a newer one but $$$$ as usual. We have a lot of Solaris systems and a few HPUX that they never update, some have uptimes of 2-3+ years. Welcome to “we need it to be stable, but we won’t pay you to update it” corporate.

I haven’t changed over to remove the LVM setup yet, I was waiting for the rescan to complete which finished late last night, mismatch_cnt was 9504 and no errors reported in dmesg. I ran the repair as well and no errors either. I don’t think I’ll have time tonight but before I get too deep into this I’ll try tomorrow to also update the hypervisor to 4.14 and make sure ECC is on, I’ve been a little behind on doing that busy doing other things. I will have to do passthrough because the whole point of the setup was separated VMs for specific functions, but yes I will be trying putting XFS directly onto the RAID without the LVM overhead. Based on the command generated by oVirt/libvirt domain XML the drives are passed through as such:

-drive file=/dev/sg4,if=none,id=drive-hostdev0 
-device scsi-generic,bus=scsi0.0,channel=0,scsi-id=0,lun=1,drive=drive-hostdev0,id=hostdev0 
-drive file=/dev/sg5,if=none,id=drive-hostdev1
-device scsi-generic,bus=scsi0.0,channel=0,scsi-id=0,lun=2,drive=drive-hostdev1,id=hostdev1 
-drive file=/dev/sg6,if=none,id=drive-hostdev2 
-device scsi-generic,bus=scsi0.0,channel=0,scsi-id=0,lun=3,drive=drive-hostdev2,id=hostdev2 
-drive file=/dev/sg2,if=none,id=drive-hostdev3 
-device scsi-generic,bus=scsi0.0,channel=0,scsi-id=0,lun=4,drive=drive-hostdev3,id=hostdev3 
-drive file=/dev/sg3,if=none,id=drive-hostdev4 
-device scsi-generic,bus=scsi0.0,channel=0,scsi-id=0,lun=5,drive=drive-hostdev4,id=hostdev4

I had heard about reliability with Seagate drives being an issue but I don’t think it’s the drives, I saw a couple of these corruptions happen with my old 4 drive WD array but not nearly this bad so I’m thinking this is more on the software side. Those drives were only 5400 RPM and didn’t report a lot of SMART data, didn’t even have temperature! They are all on the same controller on the motherboard, it has 2 - referred to in the BIOS as iSata and sSata.

lspci | grep -i sata

00:11.4 SATA controller: Intel Corporation C610/X99 series chipset sSATA Controller [AHCI mode] (rev 05)
00:1f.2 SATA controller: Intel Corporation C610/X99 series chipset 6-Port SATA Controller [AHCI mode] (rev 05)