Files do not seem to copy correctly from a fs to another

Hi,

I’m migrating a home media server from raid6+lvm2 to zfs, however it seems that not all files are copied correctly…

First I tried with rsync: the copy ended without errors(*), but since I’m paranoid and I don’t want to lose my data, I did another run with rsync which, to my surprise, started copying some files again.
The rsync command is the following:

rsync -a -c --progress /mnt/srv/casa/ /srv/casa/

Since I’m really paranoid and I really don’t want to lose my data, I’ve created a script that does md5 checksum file by file… and it turns out some files are different, and I double checked with sha256sum:

sha256sum /mnt/srv/casa/data_backup/navicella-linux-2018-04-04/odroid/debian-pan.img
6a5c6dd185e5526b50e5f2190a578db6ae93d44f3cbee8206678d3584b771996 
sha256sum /srv/casa/data_backup/navicella-linux-2018-04-04/odroid/debian-pan.img
e98d53453c6a0545c8b81ae0df178b747f7d33746972e3922e0aecfdb177753c  /srv/casa/data_backup/navicella-linux-2018-04-04/odroid/debian-pan.img

So the files are really different.
Just to be sure, I also tried to copy the file with tar:

tar cf - * | pv | tar xfp - -C /srv/casa/

but the files are still different.

zpool status does not show anything particular (apart from two errors on files that are not really important; and please ignore the degraded state of the array):

pool: bpool
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using ‘zpool online’ or replace the device with
‘zpool replace’.
scan: none requested
config:

    NAME                                       STATE     READ WRITE CKSUM
    bpool                                      DEGRADED     0     0     0
      raidz2-0                                 DEGRADED     0     0     0
        ata-ST2000DM006-2DM164_Z4ZAMLX9-part3  ONLINE       0     0     0
        ata-ST2000DM006-2DM164_Z4ZAQ5ZH-part3  ONLINE       0     0     0
        ata-ST2000DM008-2FR102_WFL353JL-part3  ONLINE       0     0     0
        ata-TOSHIBA_HDWA120_857N5DYKS-part3    ONLINE       0     0     0
        ata-TOSHIBA_HDWA120_86140KXGS-part3    ONLINE       0     0     0
        /tmp/fake-part3                        OFFLINE      0     0     0

errors: No known data errors

pool: rpool
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: resilvered 150M in 0 days 00:00:12 with 0 errors on Wed Dec 23 11:34:09 2020
 config: 

    NAME                                       STATE     READ WRITE CKSUM
    rpool                                      DEGRADED     0     0     0
      raidz2-0                                 DEGRADED     0     0     0
        ata-ST2000DM006-2DM164_Z4ZAMLX9-part4  ONLINE       0     0     0
        ata-ST2000DM006-2DM164_Z4ZAQ5ZH-part4  ONLINE       0     0     0
        ata-ST2000DM008-2FR102_WFL353JL-part4  ONLINE       0     0     0
        ata-TOSHIBA_HDWA120_857N5DYKS-part4    ONLINE       0     0     0
        ata-TOSHIBA_HDWA120_86140KXGS-part4    ONLINE       0     0     0
        /tmp/fake-part4                        OFFLINE      0     0     0

errors: Permanent errors have been detected in the following files:

    /srv/casa/incoming/navicella/salome_7.5.1/salome_appli_7.5.1/bin/salome/PYLIGHTGUI.pyo
    /srv/casa/nwnx4/lib/ruby/lib

So, what could be the cause?

Time. Linux stores timestamps with files, as to distinguish between versions. Those timestamps cause the checksums to differ.

The -a option on the rsync command should be preserving the time stamps. I would use cp -a for a straight copy instead of rsync, but you don’t get the progress.

A change in the time stamp should represent a change in checksum, but if access time is the only thing changing on the files then the data should be fine. I’m pretty sure you can change file system parameters to modify the behavior of the access time. I would look at any default and mount options associated with your source and destination file systems. You want noatime or relatime on both source and destination to prevent the copy/rsync from modifying the time stamps at either end.

Here is some info on ZFS atime behavior (looks like atime=on is default)
https://wiki.archlinux.org/index.php/ZFS#Tuning

You could turn off atime on the dataset you are copying to and see if the problem persists. Then you could turn atime back on if you want it after copy is confirmed.

Another thought: I see you are copying to a root pool, is it possible a service on the system is accessing the files in question after copying? e.g. accessing the image files to create thumbnails or a running service suddenly seeing files in a directory it is watching?

Hi,

Thanks for your answer, however I’m not sure to understand how file checksum is affected by timestamps; I did a quick test ant it does not seem to be the case:

andrea@atlante:~$ echo "whatever" > test
andrea@atlante:~$ stat test
  File: test
  Size: 9               Blocks: 2          IO Block: 512    regular file
Device: 3dh/61d Inode: 434         Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/  andrea)   Gid: ( 1000/  andrea)
Access: 2020-12-28 08:25:40.724563636 +0100
Modify: 2020-12-28 08:25:40.792562346 +0100
Change: 2020-12-28 08:25:40.792562346 +0100
 Birth: -
andrea@atlante:~$ sha256sum test
cd293be6cea034bd45a0352775a219ef5dc7825ce55d1f7dae9762d80ce64411  test
andrea@atlante:~$ touch test
andrea@atlante:~$ stat test
  File: test
  Size: 9               Blocks: 2          IO Block: 512    regular file
Device: 3dh/61d Inode: 434         Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/  andrea)   Gid: ( 1000/  andrea)
Access: 2020-12-28 08:26:03.752132365 +0100
Modify: 2020-12-28 08:26:03.752132365 +0100
Change: 2020-12-28 08:26:03.752132365 +0100
 Birth: -
andrea@atlante:~$ sha256sum test
cd293be6cea034bd45a0352775a219ef5dc7825ce55d1f7dae9762d80ce64411  test

And also some of the copied files have different timestamps but same checksum… how does it work? Do you have maybe a link to an article explaining the question?

Thanks
Andrea

sha1sum (and related md5sum etc tools) only look at contents.

When you use tar, tar might store owners and timestamps and so on.

Running rsync -avPr src/ dest as root would preserve timestamps and ownership… I’m not sure about extended attributes. Check the rsync manage for various useful flags in case you run into more differences: https://manpages.debian.org/testing/rsync/rsync.1.en.html ; specifically look at explanations for --itemize-changes --dry-run and --size-only.


Try copying again, or if it’s a large file use my script as a starting point for getting hashes of blocks:

https://raw.githubusercontent.com/google/parallel-chunks/master/pyhash.py

you can then patch up the blocks that differ with dd.


Also, this type of corruption is not normal, it’s most likely you have bad ram or badly seated ram, or badly seated cpu, or bad power supply… it’s possible your source data is already corrupted and it’s possible you might have further issues. (I’d start by taking out the ram and running the dimm pcb connectors between fingers using a sheet of plain white paper as if it were sand paper, then putting it back). Also, reset the bios/uefi/firmware to optimized defaults.

1 Like

I stand corrected. I thought the time stamps did affect the md5sum but then again it wouldn’t be very useful if it did.

I would add bad sata cable or backplane to @risk 's list of potential hardware issues.

It turns out it was a bad memory module; after I removed the offending module all the mismatched md5sums has gone away :slight_smile: Thank you for pointing me on the right direction.

Bye
Andrea

3 Likes