I’m moving a bunch of files between 2 spinning rust drives using rsync
.
Every 5 seconds, like clockwork, the transfer stops (goes to 0 b/s), then resumes at a rate normal for the drives.
Source drive is xfs, destination drive is btrfs.
The rsync progress indicator stops updating for a couple sec, then resumes.
I am also monitoring progress via watch -n 1 progress -q -w
(which reports 0 b/s during the stall).
CPU usage also drops at that moment (got htop
running).
I know fio
exists, but I’m not sure it’s appropriate here or how to use it in this situation.
I don’t know how to tell if it’s some kind of filesystem problem or buggy rsync or whatever.
(It’s unlikely to be a disk/hardware problem as they’ve been tested in other ways and can sustain 1Gbit transfers consistently and indefinitely.)
Extra observations:
htop also reports disk IO. Because of write caching reads and writes are staggered - you get a hysteresis essentially. CPU usage is high during reading. When writing from cache CPU is quite low.
Also, the fan ramp up/down is sooo annoying.
-
What process/kernel thread is using the CPU? Reading/writing at spinner speeds
should be <1% CPU on anything in the last decade.
-
How are the drives connected? SATA/USB/SAS?
-
Run iostat -xz 10
and run the rsync for long enough that it exhibits the issue.
Check the w_await
column (average time in ms for the device to service the request), and make sure they’re all under about 500.
- Install
bcc-tools
and run biolatency
.
In another terminal do the rsync again until it shows the issue.
Ctrl-c the biolatency
process and it will output a histogram of the latency
of all disk block requests for each block device.
For a modern spinner all of the requests should be between 32k to 512k usec.
If there’s any higher than that then there is likely a drive seek/surface problem.
- rsync (tho technically the CPU is a bit older than a decade. still, the write part is effectively <1%)
- SATA
- source w_await = 0; dest w_away ~ 25
- Where can I find the command after I install them? (On ubuntu I had to install
bpfcc-tools
instead.)
Edit: Think I found it - biolatency-bpfcc
, but I get Exception: Failed to attach BPF program b'trace_req_done' to kprobe b'blk_account_io_done'
(In any case, as I said, hardware problem seems the least likely to me atm. No substitute for a new test, but still…)
Monitoring Dirty and Writeback from /proc/meminfo
the behaviour I see is as follows:
- Read from source
- Fill up cache - dirty increases, writeback=0 (high CPU usage)
- Stop reading. Transfer speed according to progress goes to 0.
- Write out cache - dirty decreases, writeback>0 (low CPU usage)
Today I have completed a few more tests.
The problem is rsync --sparse
. In contrast, no sparse or cp
works as expected and immensely faster (transfers complete in approximately half the time).
time rsync --sparse
real 4m8.791s
user 0m23.558s
sys 2m43.257s
time rsync
(no -s)
real 2m32.607s
user 0m39.175s
sys 1m33.885s
time cp
real 2m31.811s
user 0m0.254s
sys 0m52.627s
The data included some VM images, so the flag seemed appropriate.
Ah well…