Is ~74 MiB/sec the maximum speed for SATA 3 RAID 10?

lightnb · December 25, 2020, 4:15am

I setup an NFS server in the same room as a workstation. Both machines have 10 Gbe cards connected via a 10 Gbe switch. The server has 6x mechanical drives in a software RAID 10 on SATA 3 connections.

I’m getting about 74 MiB/sec on a transfer from my workstation (RAID 1 mechanical) to the NFS share.

Am I bottlenecked by server disk write? I don’t think 74MiB is anywhere near network max. Or is there possibly something wrong with the configuration?

Shouldn’t SATA 3 be 6000 Mbit/s? Or is it one if those “one MiB is eight Mbits” things and the transfer is actually running at SATA 3 max?

MazeFrame · December 25, 2020, 4:48am

That is roughy \frac{1}{10} of the speed Sata3 can handle.

I don’t have a Raid to test, can give you benchmark numbers for my WD Gold (1TB) though.

Edit: Found this here:

They say this:

Giving us a write performance formula of simply: NX/2 or .5NX.
[…]
So an eight spindle RAID 10 array would be N = 8 and X = 125 and our resulting calculation comes out to be (8 * 125)/2 which is 500 WIOPS or 4X WIOPS. A 50/50 blend would result in 750 Blended IOPS (1,000 Read IOPS and 500 Write IOPS.)

risk · December 25, 2020, 4:56am

Can you try , on your workstation, something similar to dd if=/dev/urandom of=/my/nfs/mount/test bs=1m count=4096 instead of reading from local disk?

What options do you have in /etc/exports (are you using async?) What distro are you using?

lightnb · December 25, 2020, 5:14am

(rw,no_root_squash,sync)

Debian 11 on the desktop, Debian 10 on the Servers. NFS is running in a VM, but there’s not any other load on the mechanical disks as it’s new, and the mechanical disks are only for the NFS server VM. The NFS share is a logical volume on RAID 10, mounted as an additional (non-system) drive on the guest. The guest and host OS run off an SSD, so they shouldn’t impact the mechanical disk speed.

risk · December 25, 2020, 5:26am

Try async, (or just removing sync, async should be the default iirc).
You can verify proper handling of nfs commits after.

(Also, I’m assuming these after modern machines with decent CPUs and decent network drivers and you can reach 10 Gbps with iperf3 no sweat?)

edit: try increasing the max block size as well (rsize,wsize) to e.g. 64k or 1M. … (It could be defaulting to 8k for some historical reason). And see about increasing tcp socket buffer sizes.

lightnb · December 25, 2020, 5:42am

I’m not sure what iperf3 is. I can look it up though.

Server: It’s a single Epyc 7272 on an AsRock RomeD2T (or something similar sounding). There’s a non-raid PCIe card for the SATA connectors. The 10 Gbe NICs are on the board. I can lookup exact specs if that’s relevant.

Desktop: Ryzen 3950x, Aorus Master, ASUS XG-C100-C 10Gbe PCIe NIC.

The drives on both machines are Western Digital UltraStar. They are the enterprise ones that audibly tick even when nothing is wrong.

I have not installed any drivers by hand, just using whatever Debian automatically provides.

Is a/sync an option that controls if the file transfer buffers to RAM before writing to disk? ie. power failure protection?

Log · December 25, 2020, 6:11am

In ideal conditions (asynchronous sequential transfers of a single large file), the practical performance of SATA3 SSD’s will top out around ~550 MB/s reads or writes. HDD’s tend to be somewhere between 150-220 MB/s max speed depending on the drive. 170-180 MB/s is what my shucked easystores/elements do. This slows down a good 50-80 MB/s toward the spindle.

For reference:
550 MB/s (Megabyte) = 524.521 MiB/s (Mebibyte) = 4400 Mbps (Megabits)

If your workload isn’t ideal, it’s quite possible to horrifically reduce that amount down to single digit MB/s.

Synchonous writes are when the program making the writes demands to know when the write has truly been safely commited to a drive, and wants a response as soon as possible. This causes all sorts of havoc on performance as normal caching optimizations are ignored or altered. NFS by default makes all writes sync writes.

The smaller the size of files you move and the more that there are, the more that moving them resembles random 4K IO. This is why moving a blue ray rip can easily be faster than moving a folder full of small images, text or html files, despite the latter being orders of magnitude smaller in total size.

And for HDD’s, data written to the outer parts of the disk are going to be faster. Additionally, some HDD’s make use of “Device managed” shingle technology, which fucks performance to hell once the cache fills up. Some SSD’s also have a similar issue with write performance once they get filled halfway.

The typical performance I get sending/recieving a ZFS dataset to or from a pool consisting of 3 vdevs of mirrors of nonshingled HDD’s is going to between 400-600 MiB/s once the transfer gets going. This is because ZFS send/receives is close to being one big sequential transfer. Using Rsync to move files to/from the same pool is going to be much, much slower and more variable.

lightnb · December 25, 2020, 6:19am

Good info, thanks!

For this first test, I’m moving about 1 TB of .NEF image files that are roughly 65 MB each.

It sounds like sync is used for transactional work (like database applications), where async is better for NAS type stuff.

I will experiment more with settings once this first transfer is complete and see if async helps.

risk · December 25, 2020, 6:35am

async in /etc/exports (it can be specified in other places, where it means different things), means that writes that nfs server receives return to client as done, as soon as they’re successfully sent to underlying file system on the server host… at which point the data is probably not physically on drive platters yet – there’s usually no reason it shouldn’t be there soon.

If a client cares to ensure writes actually make it to platters, the client application can issue one of many things similar to fsync, that the client nfs filesystem driver should translate to an nfs commit, which the server will receive and then block for a number of milliseconds waiting for HDDs.

If you’re using “sync” on /etc/exports on the server … you’re blocking on ensuring disk writes make it to platters, even when client didn’t explicitly ask for it.

The client should typically ask for it when closing a file or in case of database like workloads (random sqlite DBs your apps might be using) when commiting a transaction.

If power goes out on server, after server has accepted the write, whatever hasn’t made it to platters will be lost. … if power goes out on client before the client has sent out a write, whatever wasn’t sent out is lost. If power goes out on both, who’s to know what actually made it anywhere. Don’t use NFS( e.g. use ceph rbd and more than one machine/power source), or use batteries (large for whole server and client, or use a nvdimm or two on server since you have epyc), or both.

lightnb · December 25, 2020, 3:24pm

For reference, I got between 470 and 540 MiB/sec copying from a CFExpress card (via USB) on the desktop to the desktop’s mechanical hard drives (RAID 1).

I redid the same test from yesterday (another batch of picture files), with async turned on, and am getting from 122 to 140 MiB/sec transfer to the NFS Server, reading from the local RAID 1. So, much faster than yesterday’s 74 MiB/sec, but still about 1/5 of what the drives can do locally.

With 10GbE, the disks should bottleneck (I/O) before the network speed becomes an issue, right?

Is 500 MiB/sec transfer to NFS obtainable?

What effect does RAM have? I have 1 GB allotted to the VM. Free says 708 MB available, 85MB free, 100 MB used. Doesn’t seem to be doing much with it. Zero swap. Is it a waste to increase the allocation?

I’ll have to do more network experimenting/tweaking after the festivities.

Merry Christmas!

risk · December 25, 2020, 4:43pm

Merry Christmas

I’m doubtful that data was hitting the platters. It’s more likely that the files were just sitting in the filesystem caches (dirty pages) for a while before the caches started flowing to disks in the background. How many gigabytes of data did you copy this way?

You can also try running sync on the command line immediately after copying and see how long that takes. (This forces dirty pages out to disk … or to network).

You can also try to repeat that test, but before starting the copy run iostat -x 5 which will display drive statistics for every 5 second interval, including how much data is going to each of your block devices.

Most numbers are self explanatory, as a rule of thumb, if you’re just reading/writing sequentially and your drives aren’t the bottleneck, you should see the % util last column of iostat be something less than 100%. When it’s at 100% it just means that the drive is always doing something…not that it couldn’t do more with its time.

Check top on your server and client while you’re doing the transfer, just in case you’re running out of CPU on the VM somehow, iperf3 between hosts should be able to saturate your network.

In general yes, but it needs tuning.

Since you’re running a VM, how are you virtualizing the Network and the drives?

Verify network between your client and VM can sustain >5Gbps using iperf3.

Then see this page and increase how much data the kernel is allowed to hold in its network buffer per tcp connection: (tcp_rmem/tcp_wmem) https://cromwell-intl.com/open-source/performance-tuning/tcp.html for reasoning.

Check cat /proc/interrupts and try installing and running irq-balance if you’re having issues.

Try increasing the mtu size on the network interfaces – that helps sometimes because it allows the kernel drivers to shuffle more data around while the firewalls and congestion control algorithms and network card drivers are doing the same amount of work.

Try rsize and wsize nfs, client side mount parameters, increase them to 1MB (set the mount options to 1048576).

Then check man nfs. There might be other options there that sound interesting.

The same effect you see in your workstation when you see 500MB/s writes to raid-1 spinning rust.

This is also tunable, there’s various sysctl options with dirty in the name, they control how much data and for how long can be kept in ram - I forget what the defaults are.

zlynx · December 26, 2020, 1:19am

I am running a six drive RAID-10 on btrfs. I don’t use NFS but I usually get the full 110 MBps from Samba.

On the server itself I get this:

$ dd if=ubuntu-20.04-desktop-amd64.iso of=/dev/null bs=8M
323+1 records in
323+1 records out
2715254784 bytes (2.7 GB, 2.5 GiB) copied, 6.16328 s, 441 MB/s

So 441 MB/s and I am pretty sure that ISO wasn’t cached. When reading from a btrfs RAID-10 it uses half the drives so that was reading three drives at once.

Anyway, no, you aren’t limited to 74 MB/s by the drives. It might be a network limitation.

lightnb · December 26, 2020, 8:31pm

Here are some network test results:

iperf3 from desktop to server physical (host):

[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.08 GBytes 9.24 Gbits/sec 7 1.49 MBytes
[ 5] 1.00-2.00 sec 1.07 GBytes 9.23 Gbits/sec 4 1.49 MBytes
[ 5] 2.00-3.00 sec 1.07 GBytes 9.22 Gbits/sec 0 1.49 MBytes
[ 5] 3.00-4.00 sec 1.07 GBytes 9.23 Gbits/sec 0 1.49 MBytes
[ 5] 4.00-5.00 sec 1.07 GBytes 9.22 Gbits/sec 12 1.49 MBytes
[ 5] 5.00-6.00 sec 1.07 GBytes 9.23 Gbits/sec 0 1.49 MBytes
[ 5] 6.00-7.00 sec 1.07 GBytes 9.23 Gbits/sec 0 1.49 MBytes
[ 5] 7.00-8.00 sec 1.07 GBytes 9.22 Gbits/sec 0 1.49 MBytes
[ 5] 8.00-9.00 sec 1.07 GBytes 9.23 Gbits/sec 0 1.49 MBytes
[ 5] 9.00-10.00 sec 1.07 GBytes 9.22 Gbits/sec 0 1.49 MBytes
From Desktop to NFS Virtual Machine:

[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.07 GBytes 9.18 Gbits/sec 18 1.22 MBytes
[ 5] 1.00-2.00 sec 1.07 GBytes 9.23 Gbits/sec 0 1.43 MBytes
[ 5] 2.00-3.00 sec 1.07 GBytes 9.23 Gbits/sec 0 1.46 MBytes
[ 5] 3.00-4.00 sec 1.07 GBytes 9.22 Gbits/sec 0 1.48 MBytes
[ 5] 4.00-5.00 sec 1.07 GBytes 9.23 Gbits/sec 0 1.49 MBytes
[ 5] 5.00-6.00 sec 1.07 GBytes 9.22 Gbits/sec 0 1.51 MBytes
[ 5] 6.00-7.00 sec 1.07 GBytes 9.23 Gbits/sec 0 1.55 MBytes
[ 5] 7.00-8.00 sec 1.07 GBytes 9.23 Gbits/sec 0 1.56 MBytes
[ 5] 8.00-9.00 sec 1.07 GBytes 9.22 Gbits/sec 0 1.56 MBytes
[ 5] 9.00-10.00 sec 1.07 GBytes 9.23 Gbits/sec 0 1.60 MBytes
From Host machine to VM guest (over virtual network bridge (br0), using the guest’s LAN IP address from the host):

[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 2.27 GBytes 19.5 Gbits/sec 0 3.05 MBytes
[ 5] 1.00-2.00 sec 2.21 GBytes 19.0 Gbits/sec 0 3.05 MBytes
[ 5] 2.00-3.00 sec 2.21 GBytes 19.0 Gbits/sec 0 3.05 MBytes
[ 5] 3.00-4.00 sec 2.21 GBytes 19.0 Gbits/sec 0 3.05 MBytes
[ 5] 4.00-5.00 sec 2.21 GBytes 19.0 Gbits/sec 0 3.05 MBytes
[ 5] 5.00-6.00 sec 2.22 GBytes 19.0 Gbits/sec 0 3.05 MBytes
[ 5] 6.00-7.00 sec 2.22 GBytes 19.1 Gbits/sec 0 3.05 MBytes
[ 5] 7.00-8.00 sec 2.22 GBytes 19.1 Gbits/sec 0 3.05 MBytes
[ 5] 8.00-9.00 sec 2.22 GBytes 19.1 Gbits/sec 0 3.05 MBytes
[ 5] 9.00-10.00 sec 2.22 GBytes 19.1 Gbits/sec 0 3.05 MBytes

So it looks like there’s no bottleneck on the virtual network (19Gb/s), and the physical network with 10GbE cards, through the switch, is hitting 9.2 Gb/s consistently between physical machines.

I think that’s faster than the disks, so I’m not going to worry about network card drivers, cables, etc at this point. I’ll look into drive I/O performance next.

lightnb · December 26, 2020, 10:11pm

iostat on NFS guest during file transfer from local disk to NFS:

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vdb              0.20  191.60      0.80 125641.60     0.00    19.80   0.00   9.37   15.00  133.99  25.46     4.00   655.75   2.34  44.80
vdb              0.20  193.80      0.80 131521.60     0.00    20.60   0.00   9.61    7.00  158.18  30.65     4.00   678.65   2.36  45.76

/dev/vdb is the mount point in the NFS vm that is exported via NFS.

From client:

dd if=/dev/zero of=/nfs/raw_images/test/zerotest bs=512 count=1073741824
116978806272 bytes (117 GB, 109 GiB) copied, 522.284 s, 224 MB/s

dm-20 (Host LV) utilization % during dd = ~95.04%

vdb utilization % during dd = 96%,95%,98%,97%

Larger byte size (1024):

dd if=/dev/zero of=/nfs/raw_images/test/zerotest bs=1024 count=10737410    
10995107840 bytes (11 GB, 10 GiB) copied, 42.3035 s, 260 MB/s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    2.35    6.02    0.02   91.60

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vda              0.16    0.02      2.90      0.73     0.00     0.03   0.00  61.26    2.57    3.42   0.00    18.23    33.67   0.66   0.01
vdb              0.31   25.61     32.07  16888.89     0.01     3.89   2.20  13.20    8.35  104.88   2.63   103.14   659.40   2.06   5.33

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   64.94   34.69    0.37    0.00

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vda              0.78    0.00     17.19      0.00     0.00     0.00   0.00   0.00    0.75    0.00   0.00    22.00     0.00   0.00   0.00
vdb              0.00  346.48      0.00 255571.09     0.00     7.42   0.00   2.10    0.00  180.65  63.10     0.00   737.61   2.80  97.03

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   64.34   35.66    0.00    0.00

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
vdb              0.20  356.60      0.80 259072.80     0.00    12.80   0.00   3.47  254.00  196.04  63.74     4.00   726.51   2.76  98.40

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   61.76   38.24    0.00    0.00

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
vdb              0.00  333.86      0.00 236910.76     0.00    13.35   0.00   3.84    0.00  237.10  75.42     0.00   709.60   2.91  97.29

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   65.44   34.56    0.00    0.00

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
vdb              0.20  326.20      0.80 245212.00     0.00    13.80   0.00   4.06  198.00  177.09  57.80     4.00   751.72   2.93  95.68

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   53.06   46.65    0.29    0.00

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
vdb              0.20  365.60      0.80 268933.60     0.00    15.20   0.00   3.99  166.00  191.04  66.96     4.00   735.60   2.63  96.16

Larger Byte Size (2048):

dd if=/dev/zero of=/nfs/raw_images/test/zerotest bs=2048 count=10737410
21990215680 bytes (22 GB, 20 GiB) copied, 85.6124 s, 257 MB/s


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    2.36    6.02    0.02   91.59

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vda              0.16    0.02      2.92      0.73     0.00     0.03   0.00  61.26    2.62    3.42   0.00    18.28    33.67   0.67   0.01
vdb              0.31   25.70     32.05  16957.11     0.01     3.90   2.20  13.16    8.37  105.25   2.65   103.07   659.71   2.06   5.36

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   58.42   41.58    0.00    0.00

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vda              2.20    0.00     27.20      0.00     0.00     0.00   0.00   0.00    0.45    0.00   0.00    12.36     0.00   0.00   0.00
vdb              0.20  352.60      0.80 255639.20     0.00     9.80   0.00   2.70  183.00  215.25  72.57     4.00   725.01   2.76  97.20

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   53.95   46.05    0.00    0.00

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
vdb              0.00  342.40      0.00 251108.80     0.00    12.40   0.00   3.49    0.00  191.87  64.78     0.00   733.38   2.84  97.36

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   56.29   43.38    0.33    0.00

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
vdb              0.20  353.00      0.80 257269.60     0.00    14.20   0.00   3.87  180.00  185.41  65.10     4.00   728.81   2.71  95.60

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   61.75   38.25    0.00    0.00

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
vdb              0.00  345.95      0.00 252920.46     0.00    13.71   0.00   3.81    0.00  200.15  69.61     0.00   731.10   2.78  96.22

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   54.55   45.45    0.00    0.00

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
vdb              0.20  326.72      0.79 236160.31     0.00    11.79   0.00   3.48  184.00  203.22  65.05     4.00   722.82   2.97  97.21

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   59.63   40.37    0.00    0.00

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vda              8.40    0.00     33.60      0.00     0.00     0.00   0.00   0.00    0.24    0.00   0.00     4.00     0.00   0.00   0.00
vdb              0.20  328.40      0.80 237140.00     0.00    12.00   0.00   3.53  208.00  193.70  62.11     4.00   722.11   3.00  98.72

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   52.15   47.52    0.33    0.00

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
vdb              0.00  327.20      0.00 236866.40     0.00    10.60   0.00   3.14    0.00  226.45  74.14     0.00   723.92   3.02  98.96

risk · December 28, 2020, 8:06pm

Sorry didn’t have time to look at this, this is good data, everything is there, i need a sec to study this, but here’s a few observations.

Network seems fine as you say.

Your first iostat is saying you have on average 30 outgoing requests to the drive and that somehow with that it’s finishing about 190 requests per second, on average taking about 2.5ms … and drives are idle-ing most of the time. Write chunks are pretty big bursts, and drives are idle-ing.

I’ve seen this when page caches are misconfigured for the amount of ram present, it’s not conclusive but it fits the pattern.

bs=2048 means a block is 2kilobytes, that’s a teeny tiny block size… Heck it might even fit into a qr code on a bus stop. Try bs=64k or bs=1m (maybe capital letter k or m or just do 65536 or 1048576 instead).

Small writes there say you’re doing 342 writes per second at 250MB/s. On average requests stay about 10ms in the queue, and then take 2.5-3 ms to actually run. Somewhat bursty but it’s a VM so whatever. There’s 700KB / request on average. On average there’s about 60requests on the queue which means there’s about 40 mega of data in flight or about 200ms worth, which seems healthy - that’s plenty of time.

This isn’t bad, other than the fact that it should be faster for a 6 drive raid 10.

Can you do iostat on the host while doing this kind of transfer, I’m curious how LVM 6 drive raid 10 is doing dispatching of write requests across drives. This is 70MB/s per drive… faster should be possible.

Also, give the VM a bit more ram… to allow it more space for buffers of various kinds. In theory you can tune this, but it’s easier to give it a bit more ram.

I think it could be maybe virtio could do with some tuning, or LVM itself might need tuning to better utilize the drives.

On the bright side it does seem like even small NFS writes are being batched up into larger ones, that’s good.

lightnb · December 29, 2020, 6:27pm

Thanks! I’ll do a bit more experimenting and post results.

In the meantime, how much of a role does CPU play in NFS speed? I have some older motherboards/CPUs running around with an AMD Phenom II X6 1090T and an Athlon FX 4Ghz 8 core.

Both of those CPUs have more PCIe lanes than Ryzen so I could put 2x Avago SATA cards (for 16x SATA) and a 10GbE card on those older boards.

And then use the Epyc for something else, like a video editing workstation.

zlynx · December 29, 2020, 9:50pm

I haven’t run it in a while, but I can say that it does play some role. Not as much as other protocols because there’s no encryption involved unless you’re running IPSec.

On Linux you used to be able to improve NFS performance quite a bit by compiling a kernel without connection tracking or firewall. I never tried it but there were also some ways to do zero-copy networking with Ethernet card protocol acceleration and direct IO. Some people reported better performance using 9,000 byte packets and turning interrupt mitigation on or off depending on your need for low latency or throughput. There were also various benefits to UDP or TCP modes.

So because I remember there were speed benefits to zero-copy and removing firewall code, I assume a faster CPU and faster RAM will benefit anyone running NFS without doing those things.

lightnb · December 30, 2020, 2:38am

Here’s some disk speed tests, run on the host, with a LV on the RAID 10 array mounted to the host. (ie. no VM and no network involved).

#> dd if=/dev/zero of=diskbench bs=1M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.01235 s, 356 MB/s
#> dd if=/dev/zero of=diskbench bs=1M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.32202 s, 462 MB/s
#> dd if=/dev/zero of=diskbench bs=1M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.4362 s, 441 MB/s
#> dd if=/dev/zero of=diskbench bs=2M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 4.94802 s, 434 MB/s
#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 39.6829 s, 541 MB/s
#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 37.4604 s, 573 MB/s
#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 37.3123 s, 576 MB/s

It seems to get faster each time I run the test. Caching?

Next, same test, but from inside the NFS VM. No network protocol, just disk writing to the same disks through the virtualization layer.

[AoD.NFS]#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 79.4929 s, 270 MB/s
[AoD.NFS]#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 87.1805 s, 246 MB/s
[AoD.NFS]#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 88.1575 s, 244 MB/s
[AoD.NFS]#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 89.4389 s, 240 MB/s

Holy Bandwidth, Batman! That VM layer is cutting disk IO in half!

I’m not sure why I got 356, 441, and 462 MB/s on the first three host tests, but all subsequent host tests have been 540-570 MB/s.

Other than the first test of 270 MB/s, all the other tests from inside the VM ran at about 240 MB/s.

Next test: Reboot the VM and watch free.

NFS VM, after fresh reboot:

free -m
              total        used        free      shared  buff/cache   available
Mem:            987          95         766           2         124         758
Swap:             0           0           0

After starting dd:

[AoD.NFS]#> free -m
              total        used        free      shared  buff/cache   available
Mem:            987         117          70           2         799         712
Swap:             0           0           0

Upon completion of first dd:

AoD.NFS]#> free -m
              total        used        free      shared  buff/cache   available
Mem:            987          96         101           2         789         731
Swap:             0           0           0

[AoD.NFS]#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 98.3258 s, 218 MB/s

So it looks like it’s keeping a bunch of stuff in cache. And memory may be an issue.

Second dd:

While running:

[AoD.NFS]#> free -m
              total        used        free      shared  buff/cache   available
Mem:            987         117          64           2         805         710
Swap:             0           0           0

[AoD.NFS]#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 87.1268 s, 246 MB/s

Second one was a little faster, so I think some cache has come into play.

Bump VM RAM from 1G to 4G:

After reboot:

[AoD.NFS]#> free -m
              total        used        free      shared  buff/cache   available
Mem:           3946         104        3750           5          91        3675
Swap:             0           0           0

RAM dropped like a fly until it bottomed out here during the transfer:

[AoD.NFS]#> free -m
              total        used        free      shared  buff/cache   available
Mem:           3946         125         104           5        3710        3550
Swap:             0           0           0


[AoD.NFS]#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 99.764 s, 215 MB/s

Still only 215MB/s though. Same test again:

[AoD.NFS]#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 90.4059 s, 238 MB/s
AoD.NFS]#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 86.9767 s, 247 MB/s
[AoD.NFS]#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 87.9707 s, 244 MB/s

OK. Now I’m mad! Allocating 32GB of RAM to the VM:

[AoD.NFS]#> free -m
              total        used        free      shared  buff/cache   available
Mem:          32170         152       31922           8          95       31706
Swap:             0           0           0

[AoD.NFS]#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 99.23 s, 216 MB/s

[AoD.NFS]#> free -m
              total        used        free      shared  buff/cache   available
Mem:          32170         156       10760           8       21253       31549
Swap:             0           0           0

OK, 20GB of buffer/cache! And still only 216MB/s

Again for posterity:

[AoD.NFS]#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 88.7613 s, 242 MB/s
[AoD.NFS]#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 88.3195 s, 243 MB/s

So it appears that 1GB of RAM performs no differently than 32GB of RAM in this test.

I also re-ran on the host after all that, just to be sure nothing changed:

#> dd if=/dev/zero of=diskbench bs=20M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 36.9425 s, 581 MB/s

I can try playing with CPU cores and other stuff, but I think something is going seriously wrong in the virtual disk layer.

Settings in virt-manager for the LVM disk used in the VM as the NFS share:

Device Type: VirtIO Disk 2
Disk Bus: VirtIO
Cache Mode: none
Discard Mode: Hypervisor Default
Detect Zeros: Hypervisor Default

Again, none of the tests above used the NFS protocol, so network is not a factor.

risk · December 30, 2020, 6:42am

550MB/s you’ve been getting with 6 drive raid 10 sounds about right to me. (That’s about 180MB/s per drive… about ok for a sequential spinning platter… LVM is probably doing some light journaling as well, wasting time on seeks, that’s alright).

You’ve been seeing lower speeds with smaller blocks, that smells like LVM raid stripe size tuning. It’s especially relevant since NFS block sizes are limited to 1M and you want fast reads as well as fast writes.

OOTH, stripe sizes of 1M makes sense for photography and video work. Your jpegs are around 5-10megs, raw files 30-150 meg, videos much larger. So if you split your jpeg into 1M blocks and scatter them, as long as you’re issuing large reads (e.g. “give me 6 megs of data”) you can read it all in parallel just fine, ie your jpeg would be a single 6-10meg read or write op to the block device.

Unfortunately, NFS read length is limited to 1M, and most filesystems and software suck at doing prefetching. (In this case it’s NFS on the client, so I know it sucks).

There’s two strategies:

Check your LV stripesize, and for 6 drives with NFS make it 128 (in kilobytes). The trade-off is that LVM will have more metadata to deal with, which will knock off some raw LVM performance but NFS will be spread out across all drives, reads will be good.
Rely entirely on nfs and guest kernel batching writes, and client caching, do a large block size of 1M, but take new writes before old ones finish. This will get you higher write throughput, but your read performance will suck, unless caching or prefetching is done by the client software.

Here’s a crazy idea, why not enable NFS on your host… as a test?

lightnb · December 30, 2020, 7:46am

Isn’t it a concern that dding to the same target is half the speed from inside the VM vs on the host? It seems like trying to figure out why the VM write is so much slower would net the biggest gains. I’m wondering if there’s something setup wrong with KVM/QEMU where the virtual drive is emulating rather than doing a passthrough?