The small linux problem thread

Yeah the package is there for configuring the plumbing, but you need to manually trigger dpkg to turn it on. It is a safety feature, not a flaw.

1 Like

Debian tells you "To check unattended-upgrades is enabled, run this in a terminal: ". That’s not a “you have to turn this package on this way”. Language matters!

What I mean is I don’t want to rely on the source, basically make two copies and compare the copies themselves. If they are identical, that’s as good as it’s gonna get?

If this has nothing to do, based on checksums not timestamps preserved by -a, that would mean both copies are identical and that’s it, or am I misinterpreting something?

This doesn’t really make sense.
First of all if you run md5sum twice it reads in the entire file to do the checksum. Now either it always returns the same data, and the checksums match, or it doesn’t, and they don’t.

Second, if you make 2 separate copies and compare them, that’s in practice the same thing as running md5sum twice, just taking longer for virtually no benefit (other then you being able to look at the copies and check what’s “more correct”).

And third, even if you do it with --checksum, it still reads in the source because how else is it going to know the checksum? So you’re having to rely on the source anyway.

Now, if you mean running the rsync on the two copies you run into a different issue. Say you have a file where the checksum doesn’t match, rsync has no way of knowing which of the 2 versions is intact and which is broken, it just has a list of files to copy. It doesn’t know nor care why. So now you’re potentially overwriting a good copy with a bad copy, that doesn’t get you anywhere either.
I guess that’s not a concern with --dry-run, but might as well just run any other checksum for that, doesn’t necessarily need rsync.

No matter which way you look at it you’ll have to rely on the source in some way or another.

That being said, I agree with the comment pointed out above about using ddrescue or any other imaging software so that you access the physical drive as sparsely as you can, by making an image first and working off that.

so, i’ve got a small linux problem.

Recently set up 10G at home.
Have a truenas scale box, exporting a dataset via NFS

Figured i’d run VMs in Ubuntu over NFS from it, as i have a cache device in the Scale box, performance should be reasonable.

All good… until randomly my network will drop and become non-recoverable until i reboot the box

Ubuntu 24.4, TrueNAS scale current stable version.

10G Aquantia NIC in the ubuntu box, some intel quad port in the TrueNAS box.

Problem definitely on the client. Only seems to happen running VMs over NFS using Virt-manager. Things will be fine, even long enough to install Windows and spice guest tools.

But randomly, VM disk IO will stop (due to network and thus NFS going away) and thus VM will freeze, and networking on the Ubuntu box (the vm host) will stop and no matter what i do outside of reboot, not recover.

Any ideas? Anyone seen this?

lspci shows the device as

  • 05:00.0 Ethernet controller: Aquantia Corp. AQtion AQC113 NBase-T/IEEE 802.3an Ethernet Controller [Antigua 10G] (rev 03)

I’ve not seen this without running VM over NFS, i have done a bit of NFS testing, maybe i need to hammer it more to determine if it is NFS itself or a combination of factors.

While not to imply necessarily that Aquantia are of the same quality rating as realtek, does this occur with only using e.g. a normal (intel) 1gbe nic on the client. If your server is gbe you probably do not gain anything from having a wierd client NIC.

Errors with the module/card may be evident in dmesg |grep module_name - find the module_name with lspci -k.

To change the characteristics of how it’s handling packets, you could try iscsi from truenas (but that’s not one-to-many in the same way as a file share), or possibly force the client into connecting with the opposite nfs v3/v4 mode than it is now.

1 Like

Oh, don’t get me wrong, I know where you’re going and I’m beginning to think that the Aquantia 10G card is on par/worse than Realtek.

Very tempted to buy another secondhand intel 10G card for the PC and ditch this brand new Aquantia card.

My server is 10GbE quad port card and I’ve managed to get 8GbE throughput between the two for regular file copies over NFS.

I’m at work at the moment, will do some more digging, cheers!

I’ve had a similar problem years ago with BCM 5720 - was randomly stopping all comms for a few seconds, which was not healthy for VMs.
Ended up swapping it with Intel I350. I only use Intel NICs in servers since then.

Sometimes it might help to play with various modprobe options or ethtool - especially things regarding offloading.
However, it is a hassle, time consuming and can drag forever.

2 Likes

I’m running CachyOS with cifs mounts for Unraid shares. When I move a file in dolphin the transfer is fast to copy, but once the transfer completes the delete operation on the source location waits for 30-60 seconds before completing.

The remote file system is ZFS on a mechanical drive over SMB, the local filesystem is btrfs on a gen 5 nvme.

Here’s an example fstab entry:

//192.168.1.200/cache/dl /mnt/dl cifs credentials=/home/james/.smbcredentials,uid=1000,gid=100,nofail,iocharset=utf8,x-systemd.automount,x-systemd.mount-timeout=10,_netdev 0 0

If anyone has any suggestions I’d appreciate it.

You could setup an iSCSI device and do a throughput and endurance test. iSCSI has built in recovery with more logging than you may see on the nfs client side.

More than likely, cache on write is turned on so the parity checking and verification are happening at the end. Try turning on write-through and see if that makes a difference. Also, why samba/cifs instead of nfs?

1 Like

They started off primarily as Windows shares and I’ve only recently been making the transition to linux. I have seen some anecdotal evidence of the same issue with nfs from googling the problem, so not sure that would solve it either.

Is there an fstab parameter to modify the caching mode for the mount point specifically or does it need to be set system wide?

To check if this is it, you can fire up some network traffic graph - if it still send after 100% in GUI, then data is buffered at sender’s side.

As for mount options: https://linux.die.net/man/8/mount.cifs

  • cache=none disable all caches
2 Likes

No noticeable traffic during the 100% wait period. I’ll try the different cache= settings and see if it makes a difference.

Thanks.

I don’t have exp with Unraid, but it looks like it buffers writes heavily. You can either try to reduce this via some settings on Unraid side, or try sync mount option - this can degrade performance significantly, though.

EDIT:

  • Also, try plain cp from terminal - just to exclude dolphin problems.
  • Do you use any kind of encryption on the source FS? or other feature similar to secure erase?
2 Likes

None of the filesystems are encrypted and I haven’t explicitly enabled secure erase.

I tested this with mv from the array to the nvme cache and it was basically instant. Moved the same file back to the array and the delay was still present, so it doesn’t appear to be limited to Dolphin.

When not using the write cache, Unraid will write the data and the parity simultaneously basically halving the write speed but that should still apply when moving off the array.

When I do this same operation on Windows via the UNC path I still get the speed penalty for both operations, but the transfer does not hang at the end.

I can test with the same file on both OS to compare objectively, but my perception is it’s faster in Windows without that added wait time.

1 Like

Weird.

Sorry, but I cannot get deeper into this due to lack of time and head being full of other stuff.
However, I am eager to know the solution :slight_smile:

2 Likes

If you are using NetworkManager do you have automatic connect on if the connection is lost? I had this happen when my Intel X550-T2 got too hot, it sometimes lost connection. If NetworkManager was not instructed to automatically reconnect the connection would be lost until reboot of the host or VMs affected.

2 Likes

One more kinda random thought: mv when used between different FS’ - be it local or remote - should just be cp + rm.
So, if you rm the local file, does it also lag? In particular:
time cp local.file /mnt/dl/some_existing_dir && time sync && time rm local.file && time sync

time should output 4 sets of timings:

  • 1st one for copying
  • 2nd for syncing the copy (making remote commit write)
  • 3rd for deleting
  • 4th for syncing the delete (waiting for it to actually be done)

If 2nd has significant wall time, then the remote buffers data before writing it, and you are waiting for write to actually happen. Not much to be done on the client - besides shifting progress bars around, time taken will be at least the same.

If 3rd or 4th has a significant wall time, then the problem lies within local FS and its ability to delete things.

EDIT: added extra sync after cp

2 Likes

I’ll check. It’s Ubuntu.

If the card is getting too hot though…. It’s in a well ventilated case with no other significant heat generation at the time…

1 Like