Debugging terrible IO performance

This happens on all my drives. File system copies start at reasonable speeds but quickly settle down to roughly ~300mb/s regardless of device used. I feel like I’m missing something obvious.

Lets take this raid array as an example:

4x 1TB P31 nvme in raid0 with mdadm. Threadripper Pro system. Raw 4k randwrite performance:

sudo mdadm --create /dev/md0 --level=0 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
sudo fio --name=random_write --filename=/dev/md0 --filesize=100g --rw=randwrite --bs=4k --direct=1 --overwrite=0 --numjobs=10 --iodepth=32 --time_based=1 --runtime=30 --ioengine=io_uring --registerfiles --fixedbufs --gtod_reduce=1 --group_reporting
random_write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=32
...
Run status group 0 (all jobs):
  WRITE: bw=8979MiB/s (9415MB/s), 8979MiB/s-8979MiB/s (9415MB/s-9415MB/s), io=263GiB (282GB), run=30001-30001msec

So around 9 GB/s of theoretical write at 4k block size (1m block size is around 12 GB/s fwiw).

Then

sudo mkfs.ext4 /dev/md0
sudo mount -o noatime /dev/md0 /raid0
cp -r large-directory-mixed /raid0

lets look at progress

% progress -m
cp [..]AAAA.MOV
        87.9% (4.2 GiB / 4.7 GiB) 321.3 MiB/s remaining 0:00:01

and iostat (nvme4p2 is the source drive here, an optane):

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
md0              0.00      0.00     0.00   0.00    0.00     0.00 2322.00    565.74     0.00   0.00    0.38   249.49    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.88   6.40
md0p4            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
nvme0n1          0.00      0.00     0.00   0.00    0.00     0.00  616.00    147.82     0.00   0.00    0.35   245.73    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.22   6.40
nvme1n1          0.00      0.00     0.00   0.00    0.00     0.00  601.00    144.19     1.00   0.17    0.37   245.67    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.22   6.40
nvme2n1          0.00      0.00     0.00   0.00    0.00     0.00  592.00    142.76     0.00   0.00    0.33   246.94    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.20   6.40
nvme3n1          0.00      0.00     0.00   0.00    0.00     0.00  583.00    141.34     1.00   0.17    0.34   248.25    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.20   6.40
nvme4n1       2452.00    304.31     0.00   0.00    0.52   127.09    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.27  20.80
nvme4n1p1        0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
nvme4n1p2     2452.00    304.31     0.00   0.00    0.52   127.09    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.27  20.80

and a dstat sample

---system---- --total-cpu-usage-- ------memory-usage----- -dsk/total----dsk/md0---dsk/nvme0n1-dsk/nvme1n1-dsk/nvme2n1-dsk/nvme3n1-dsk/nvme4n1
     time     |usr sys idl wai stl| used  free  buff  cach| read  writ: read  writ: read  writ: read  writ: read  writ: read  writ: read  writ
16-12 09:52:42|  0   3  96   1   0|3318M 1615M 23.4M  243G| 596M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 298M    0
16-12 09:52:43|  0   3  96   1   0|3317M 1572M 23.4M  243G| 608M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 304M    0
16-12 09:52:44|  0   3  96   1   0|3318M 1542M 23.4M  243G| 592M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 296M    0
16-12 09:52:45|  0   3  96   1   0|3319M 1475M 23.4M  243G| 608M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 304M    0
16-12 09:52:46|  0   4  95   1   0|3319M 1501M 23.5M  243G| 592M 1902M:   0  1866M:   0   477M:   0   476M:   0   475M:   0   474M: 296M    0
16-12 09:52:47|  0   3  96   1   0|3319M 1577M 23.5M  243G| 608M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 304M    0
16-12 09:52:48|  0   3  96   1   0|3320M 1416M 23.5M  243G| 624M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 312M    0
16-12 09:52:49|  0   3  96   1   0|3320M 1390M 23.5M  243G| 624M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 312M    0
16-12 09:52:50|  0   3  96   1   0|3322M 1580M 23.5M  243G| 608M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 304M    0
16-12 09:52:51|  0   4  95   1   0|3319M 1610M 23.6M  243G| 576M 1584M:   0  1573M:   0   397M:   0   396M:   0   396M:   0   395M: 288M    0
16-12 09:52:52|  0   5  94   1   0|3324M 1568M 23.8M  243G| 592M 5698M:   0  5662M:   0  1424M:   0  1424M:   0  1424M:   0  1425M: 296M    0
16-12 09:52:53|  0   3  96   1   0|3324M 1571M 23.8M  243G| 606M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 303M    0
16-12 09:52:54|  0   3  96   1   0|3325M 1385M 23.8M  243G| 503M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 252M    0
16-12 09:52:55|  0   3  96   1   0|3320M 1386M 23.8M  243G| 560M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 280M    0
16-12 09:52:56|  0   3  96   1   0|3323M 1370M 23.8M  243G| 576M  232k:   0   232k:   0     0 :   0   232k:   0     0 :   0     0 : 288M    0
16-12 09:52:57|  0   3  96   1   0|3322M 1424M 23.8M  243G| 624M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 312M    0
16-12 09:52:58|  0   3  96   1   0|3323M 1577M 23.8M  243G| 608M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 304M    0
16-12 09:52:59|  0   3  96   1   0|3324M 1572M 23.8M  243G| 608M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 304M    0
16-12 09:53:00|  0   3  96   1   0|3324M 1569M 23.8M  243G| 608M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 304M    0
16-12 09:53:01|  0   3  96   1   0|3323M 1526M 23.8M  243G| 608M   12k:   0    12k:   0     0 :   0    12k:   0     0 :   0     0 : 304M    0
16-12 09:53:02|  0   3  96   1   0|3325M 1625M 23.8M  243G| 592M  944k:   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 296M  472k
16-12 09:53:03|  0   3  96   1   0|3326M 1602M 23.8M  243G| 588M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 294M    0
16-12 09:53:04|  0   3  96   1   0|3327M 1560M 23.8M  243G| 592M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 296M    0
16-12 09:53:05|  0   3  96   1   0|3327M 1482M 23.8M  243G| 592M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 296M    0
16-12 09:53:06|  0   3  96   1   0|3331M 1446M 23.8M  243G| 576M 6996k:   0    28k:   0     0 :   0    28k:   0     0 :   0     0 : 288M 3484k
16-12 09:53:07|  0   3  96   1   0|3332M 1477M 23.8M  243G| 576M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 288M    0
16-12 09:53:08|  0   3  96   1   0|3331M 1454M 23.8M  243G| 592M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 296M    0
16-12 09:53:09|  0   3  96   1   0|3331M 1403M 23.8M  243G| 608M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 304M    0
16-12 09:53:10|  0   3  96   1   0|3330M 1353M 23.8M  243G| 592M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 296M    0
16-12 09:53:11|  0   3  96   1   0|3330M 1564M 23.8M  243G| 640M   12k:   0    12k:   0     0 :   0    12k:   0     0 :   0     0 : 320M    0
16-12 09:53:12|  0   3  96   1   0|3330M 1520M 23.8M  243G| 592M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 296M    0
16-12 09:53:13|  0   4  95   1   0|3332M 1436M 23.9M  243G| 592M 3493M:   0  3472M:   0   874M:   0   873M:   0   873M:   0   873M: 296M    0
16-12 09:53:14|  0   3  96   1   0|3330M 1398M 23.9M  243G| 608M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 304M    0
16-12 09:53:15|  0   3  96   1   0|3331M 1333M 23.9M  243G| 608M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 304M    0
16-12 09:53:16|  0   3  96   1   0|3331M 1615M 23.9M  243G| 608M  148k:   0   148k:   0     0 :   0   148k:   0     0 :   0     0 : 304M    0
16-12 09:53:17|  1   3  95   1   0|3306M 1394M 23.9M  243G| 608M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 304M    0
16-12 09:53:18|  0   3  96   1   0|3307M 1327M 23.9M  243G| 592M  144k:   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 296M   72k
16-12 09:53:19|  0   3  96   1   0|3306M 1372M 23.9M  243G| 560M    0 :   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 280M    0
16-12 09:53:20|  0   3  96   1   0|3308M 1323M 23.9M  243G| 592M  168k:   0     0 :   0     0 :   0     0 :   0     0 :   0     0 : 296M   84k^

What am I missing?

On the surface it sounds like the drives’ SLC or DRAM caches being saturated after a while, however I don’t know whether your actual numbers are anywhere near reasonable given that scenario.

So the expected sustained speed for a single P31, during sequential write, seems to be slightly above 1500Mb/s:

Which is clearly not as bad as your numbers. However, that’s queue depth 32, your cp operation is probably queue depth 1. But still, it sounds low.

1 Like

Good point on the queue depth. Will run more tests.

A different test, copying 184 Gb from this RAID 0 array to an Optane P5800X takes 6 minutes 19 seconds, which comes out to ~485 MB/s.

Queue depth was spot on. Thank you.

Naive me assumed cp would go as fast as possible. However, this one process is not enough to fully entertain the hardware. (trivia: strace shows cp working in 128k blocks)

Running the following to copy yields a much more respectable 2.9 GB/s for the same 184GB directory:

find the-directory -type f > files.txt
cat files.txt | parallel -j 64 \
  rsync --owner --group --archive --copy-links --whole-file --relative --no-compress --progress {} /raid0
1 Like