Constant broken pipe/timeouts trying to replicate to remote Truenas via Tailscale

jshl · November 30, 2023, 6:13pm

I have a remote Truenas Scale backup node running Tailscale and when trying to push to it the replication fails every 5GB or so. I can restart and it will make more progress but soon fails again.

I can pull from the remote node to local reliably.

Errors include “Broken pipe” and “Network timeout”

VM on the remote system and hosts on my local network can constantly ping 1.1.1.1 with no dropped packets throughout the time when the connection breaks, so it’s not a network issue, it just seems to be an issue with the Tailscale connection

I have tried:

Truenas Core on local system, routed via Tailscale on pfSense router
Truenas Scale on local system, routed via Tailscale on pfSense router
Truenas Scale on local system, routed via Tailscale running on Truenas itself
Deleted receiving datasets entirely and re-created
Scrubbed both pools (no errors reported)

With Truenas Core on the local system I was getting a wide variety of issues resuming the stream, Scale seems to restart reliably, but nothing I tried fixes the connection dropping.

Both Truenas Scale systems are running latest 1.54 Tailscale version

Any ideas? This really seems like a Tailscale issue but I’ve done this in the past reliably and I can pull reliably, which is just weird…

risk · December 1, 2023, 12:22pm

weird indeed.

Can you try other stuff ?
iperf3 … or some kind of
mbuffer -i /dev/urandom -O other_host:port and mbuffer -I port -o /dev/null

jshl · December 2, 2023, 3:49pm

I ran iperf3 but I’m not really sure what to look for. The bandwidth is a little better than my actual transfer speed and doesn’t flake out while I’m doing the test. I can consistently ping, access interface, etc, with zero interruptions for hours, until I get 5-10GB into a replication task then the remote system goes completely unresponsive, data stops moving, and eventually the replication errors out (though increasing the number of attempts before it fails has helped move more data before I need to manually restart the task)

I did get a “device X is causing slow IO on pool” error, at some point on both HDDs in the mirrored pool, more recently just for 1 of them. No other errors or SMART indication that anything is wrong with the drives… but they are molasses slow 5400RPM spinning disks. I’m starting to suspect the remote system is accepting data faster than it can write to disk, and for some reason going unresponsive as it tries to flush cached data from RAM to the pool.

I’m going to try a rate limit…

jshl · December 2, 2023, 5:28pm

Well the rate limit fixed it. My peaks were 22MB/s but it would pause about half the time so my average was maybe 11MB/s but it would error out so it would never get very far. Limited to 7MB/s and it’s rock solid 30GB in.

Seems like a bug or a very poorly thought out feature that the system will ingest data so fast it goes unresponsive as it tries to actually write out/process that data. But at least there is a workaround!

Exard3k · December 2, 2023, 9:26pm

That usually only happens if you use high compression and CPU load goes through the roof. 22MB/s is a piece of cake for HDDs and especially ZFS. ZFS will throttle transfers if dirty data is at maximum and drive queues are full.

You can try to reduce or increase the txg_timeout. But send/receive is serialized and the receiving pool gets 100% favorite work in the form of sequential writes. And every damn 30y old HDD can sequentially write at 20MB/s. So unless you are writing to some junk SSD, it probably won’t be the drives.