Help me diagnose bad copy performance

I’m moving a bunch of files between 2 spinning rust drives using rsync.

Every 5 seconds, like clockwork, the transfer stops (goes to 0 b/s), then resumes at a rate normal for the drives.

Source drive is xfs, destination drive is btrfs.

The rsync progress indicator stops updating for a couple sec, then resumes.
I am also monitoring progress via watch -n 1 progress -q -w (which reports 0 b/s during the stall).
CPU usage also drops at that moment (got htop running).

I know fio exists, but I’m not sure it’s appropriate here or how to use it in this situation.
I don’t know how to tell if it’s some kind of filesystem problem or buggy rsync or whatever.
(It’s unlikely to be a disk/hardware problem as they’ve been tested in other ways and can sustain 1Gbit transfers consistently and indefinitely.)

Extra observations:
htop also reports disk IO. Because of write caching reads and writes are staggered - you get a hysteresis essentially. CPU usage is high during reading. When writing from cache CPU is quite low.
Also, the fan ramp up/down is sooo annoying. :sob:

  1. What process/kernel thread is using the CPU? Reading/writing at spinner speeds
    should be <1% CPU on anything in the last decade.

  2. How are the drives connected? SATA/USB/SAS?

  3. Run iostat -xz 10 and run the rsync for long enough that it exhibits the issue.

Check the w_await column (average time in ms for the device to service the request), and make sure they’re all under about 500.

  1. Install bcc-tools and run biolatency.

In another terminal do the rsync again until it shows the issue.

Ctrl-c the biolatency process and it will output a histogram of the latency
of all disk block requests for each block device.

For a modern spinner all of the requests should be between 32k to 512k usec.

If there’s any higher than that then there is likely a drive seek/surface problem.

  1. rsync (tho technically the CPU is a bit older than a decade. :grin: still, the write part is effectively <1%)
  2. SATA
  3. source w_await = 0; dest w_away ~ 25
  4. Where can I find the command after I install them? (On ubuntu I had to install bpfcc-tools instead.)
    Edit: Think I found it - biolatency-bpfcc, but I get Exception: Failed to attach BPF program b'trace_req_done' to kprobe b'blk_account_io_done' (In any case, as I said, hardware problem seems the least likely to me atm. No substitute for a new test, but still…)

Monitoring Dirty and Writeback from /proc/meminfo the behaviour I see is as follows:

  1. Read from source
  2. Fill up cache - dirty increases, writeback=0 (high CPU usage)
  3. Stop reading. Transfer speed according to progress goes to 0.
  4. Write out cache - dirty decreases, writeback>0 (low CPU usage)

Today I have completed a few more tests.

The problem is rsync --sparse. In contrast, no sparse or cp works as expected and immensely faster (transfers complete in approximately half the time).

time rsync --sparse

real    4m8.791s
user    0m23.558s
sys     2m43.257s

time rsync (no -s)

real    2m32.607s
user    0m39.175s
sys     1m33.885s

time cp

real    2m31.811s
user    0m0.254s
sys     0m52.627s

The data included some VM images, so the flag seemed appropriate.
Ah well…