Help me tackle this Packetlossvoodoo

I always had the impression that you guys are strong with networking stuff. For me it is always just a necessary evil. I'm not stupid, but most network stuff is just not enjoyable for me.

Anyway - I have a pretty nasty issue for the past 2-3 weeks and I am losing it if I cannot solve this shit. I have a Cubietruck (similar to Raspberry Pi 2) with ArchArm and a Desktop with Arch Linux, too. Everything is up and running and ping, ssh and webservices work fine. Owncloud sync is a bit unusual slow with some dropouts (compared to a previous setup with debian on my cubietruck)

BUT transfers of large amounts of data die off after a certain (for me not predictable) amount of time. Imagine the following cases using rsync and ssh:

  1. rsync ssh-desktop local - on cubietruck - can transport some 100MB - 1GB, dies randomly
  2. rsync local ssh-desktop - on cubietruck - dies off almost immediately (after 32,765 B)
  3. rsync ssh-cubie local - on desktop - dies off almost immediately (after some KB)
  4. rsync local ssh-cubie - on desktop - can transport some 100MB - 1GB, dies randomly

Bottom line is: outgoing transmission from cubietruck is somehow more reliable than an incoming stream. But everything is not at an acceptable state.
Just for completeness, the command looks like this:
rsync -aP user@remoteip:/path/to/file ./
and the thrown error is not very instructive:

rsync: connection unexpectedly closed (983210406 bytes received so far) [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(226) [receiver=3.1.2]
rsync: connection unexpectedly closed (8542 bytes received so far) [generator]
rsync error: unexplained error (code 255) at io.c(226) [generator=3.1.2]

Important things to note:

  • it is independent of rsync: it also stales with scp
  • it is also independent from ssh: transfer also fails with samba
  • no difference if iptables is active or not! (no firewall issue)
  • no difference if httpd or samba services are running
  • no issue with the switch (desktop and cubie are on the same switch, which is again connected to the router) - I tested this via plugging both into the router directly
  • I am talking about LAN communication - everything outside the router is not an issue

I tried so many things. Measuring the bandwidth with iperf, look at tcpdump, iftop and netstat output, but I did not find anything that would indicate some issue.
One very interesting observation though: If I trace the route to 8.8.4.4 (googles dns server) with mtr, I get ~30% packet loss from the very first hop once the rsync process is running. There is no loss when the rsync process isn't running.

Instead of dumping any log I could produce in here I am asking you to take me by the hand and walk me through this step by step. It's more about checking that everything is covered, than letting someone else do my work.
I am hoping for some clearance by asking for uninvolved opinions. Maybe I was stupid and have overlooked something very obvious (I tried turning it off and on again).
I am hoping for something simple.

If anything important is missing or something is stupid/unclear - don't hesitate to speak up. I am willing to make it easy for you. let me help you help me! It's getting late over here.

Puppytax:

imguroriginal

Are you on a wireless connection?

If so, I would check for metallic items around and have a good line of site. That sounds extremely familiar to a problem I recently diagnosed with wireless echos. With Wireless you can have an amazing signal strength, but as soon as you put a load on the network, the packet loss is incredible.

I would also check the termination quality of the Cat5 cables. At the same office I found flat ethernet cables. These cables are notorious for having bad terminations or having micro fractures in the cable that results in a poor quality connection.

Just because you can "browse the web" doesn't mean that the hardware layer is working properly.

Edit: I just re-read this

Perform a large file transfer to a local computer and ping the local computer. See if you notice the same packet loss issues. (problem could be the ethernet jack on the device)
Perform a ping test from another computer to the internet. (Could be issues upstream of your modem, or between your router and modem.)

1 Like

Sorry for missing that: its an ethernet connection of course. I'll check the cable immediately, when I get home tomorrow. Cliche that I haven't tried this already. - Also a good hint to look at the jack.
Just out of curiosity: is there some way to monitor the connectivity / cable quality via software?

Also I think my original post was a little bit misleading. Every communication is solely within my LAN. I'll edit this as a note

Not that I'm aware of...typically Fluke hardware or similar is used.

Although, you can check the throughput using this: http://totusoft.com/lanspeed/

No (as the hardware usually can't do it), the stuff needed to test cables is quite sophisticated. When it comes to patch-cords, as of today "Take a new one and burn the old" is faily standard as the Flukes, Tektronix, ... and all that make cable testers that are not just LEDs are pretty expensive

Well after a nice and relaxed weekend I am ready to investigate your proposal.
You might be onto something and it is somewhat embarrassing that my complexedconfused brain did not inspect the hardware first …

You might be onto something. I activated the onboard wlan module and the transfer seems to work. I am somewhat relieved, because that points to an easy solution. I'll try to find a good cable and/or inspect the connector/socket

A little edit:
well, another (good) cable did not solve the problem. Now it has to be the socket or something between the driver and the layer where the applications pick up. I'll rule out the Socket by switching the setup to a identical cubietruck with the exact same software, which happens to lay around

1 Like

I don't believe it. It really seems to be a hardware fault. I swapped the boards and the other one works fine!

Thanks for pointing me to that approach. My previous attempts were way to detailed and messy!

1 Like

I am not so much surprised; I had that happen with several Raspberry PI gen 1 where the solder joint at the NIC port got micro fractured. I first thought its the contacts in the socket, but no, it really was at the board.. when one applied force to the whole connector it worked... so I re soldered it and it worked again

1 Like

I'll try fixing it. There seems to be a tiny battery behind the ethernet socket -

(top right, left from the HanRun Socket)
Is it possible that it really is a battery and it might getting low in capacity? (the board in question was used for a long time before)

but I'll check the solder joints, too. thanks for the tip.

A battery on a system-board usually means there is a real-time clock... but a real-time clock should not have any influence on networking.... except some protocols popping when the systems have different UTC times.

1 Like

well the connection was just very instable and setup-depenent. Could be a sync (clock) issue then, right?

Networking sometimes feels very esotheric and like magic no mather how good one is in the matter.. so

it could be.... maybe... maybe not... its try on error now

1 Like

Hehe - packetlossvoodoo, like I said. Maybe this esoteric feeling is exactly why I always find networking somewhat repelling.

I'll investigate further when I need the second board. Maybe I can find a use case where I just need the wifi, which is working.

Nevertheless: I learned a lot & found a lot fancy and nice networkdebugging tools. Thanks for the help you two!

1 Like

I just want to add a bit of information related to this topic but not strictly:
Is possible to test cables with a PC ethernet jack as long as you're using a new Intel NIC (the I218-V if I remember correctly the LAN chip name on my Maximus VII Hero allows it) in the device settings is available such option.
Only on Windows unfortunately.

2 Likes

Oh nice, which tests are available?

Cable quality, connection quality aand I think there's a third. I'll update the answer tomorrow when I can check that out.

What file system are your harddrives formated with?, whats the specs on your device?.

Back to this problem with new motivation. To summarize, this is the issue:
My original setup is working, but now I have a second Cubietruck with a possible hardware problem, that causes the packetloss. It is on the board, so I should look there for a fix.

Any idea of how to approach this effectively? I wouldn't want to renew the soldering on the socket, just as a quick guess. I'd like to find and see the issue, before trying fixing it. (for educational purposes)

So first off, if you do a continuous ping from one device to the next, what happens? Do you have any variance in response time with the default payload on ping? You shouldn't have any veriance on a local lan (only a switch between, rgiht?) You might want to add payload data just to increase the packet size of each packet/frame that's transmitted to more closely emulate actual file transfer. If those look perfectly fine and consistent then your packets are getting across just fine.

That doesn't mean the frames aren't being retransmitted though. If the frame fails, and layer 2 recognizes that, the frame is re-sent, and the packet still gets to the destination. This should generate a counter in the frame statistics though (see below).
The variance in ms of response time might be too low to see on the ping output, but if there are hardware issues, there will be variance. Time variance should be higher for errors with junk data in the payload field though. You might see something there.

You can look in ifconfig for your frame statistics as reported by the kernel. These frame statistics should be exactly the same as what's seen by a switch. If they're not spot on, they're probably pretty damn close.

If you don't see any TX or RX errors/drops, then your ethernet frames are making it from Layer 2 of one device to the next without issues. Packet loss can happen above this, but you should see problems on the ping test if that were the case. That's also much less less likely in normal environments. Maybe that's the issue though, it's not exactly run of the mill hardware and software you're using.

If you do see issues there, you're looking at NIC problems, possibly firmware or drivers, but more likely something physical.
TX errors are on the transmit pins(pins 1 and 2), RX errors are on the receive pins (pins 3 and 6), frame errors mean the frame was received fully, but it failed the crc, so again, receive pins.

___and now I read the thread >.< ___

Yeah I bet you'll see some frame loss there, there's always the possibility that you got a bum network chip, or that some of the soldering got botched. Obviously look for paint and schmoo on contacts for the cables and in the jack.

Definitely not a clock thing, swapping a cable is helpful, but not entirely ruling out physical problems. If you see no frame errors (0 on TX, RX and frame errors) then your hardware is perfectly fine.

Thanks for your reply anyway!
Things made an interesting turn. To test the hardware-issue-hypothesis I installed a different image on the cubietruck in question, which I would consider more stable. And behold! Transfers work again with the supposedly faulty hardware.

So I think I have to summarize everything again to understand where we are:
2 Cubietrucks. Cubietruck A with ArchArm had strange packetloss issues. I swapped SD Card and SATA Drive to Cubietruck B, i.e. the exact same OS, same Hardware, and then it worked.
Cubietruck A had no issues with another Image.
So where is the error? The only difference should be the on board NAND, which can hold boot- / hardware - stuff (.fex file) and possibly < 8GB of the root filesystem. (there are different ways to set this up, but you can only boot from SD and NAND)

So I will try to reproduce the issue with the ALArm image and then your steps could become handy @toorsc. Then I'll try to flash the NAND of Cubietruck A to something similar on Cubietruck B.
Everything under the assumption, that some networking stuff, together with a different configuration in the NAND produces errors somewhere around the driver level