Return to Level1Techs.com

Comprehensive debuging of network issues


#31

Actually the issue has been around a few years… but you’re preaching to the choir. I go to great lengths to fix, optimize, etc.

Most people, including techies I know who said they’ve experienced similar issues, just want to minimize effort, and really wont tolerate having devices disconnected or slowed down. Considering the disruption from the issue can be maybe just a minute every week or so, and easily worked around by reconnecting… the economics of this chart indicate I’ve already spent too much time:


#32

I think you are at the point where you need to replace the wireless nic on the problem pc,

if im not mistaken you have already,
changed out the router, problem persisted.
changed out the access point problem persisted.
and its just this one machine that keeps dropping connection.

so next is the nic… hopefully that is the fix.


#33

It’s not just one machine. My previous machine at the same location experienced the same thing with a different NIC. It seems to have happened with at least one other machine in a different location, but I don’t get good debugging info from other users, often only long after the fact.

It happened with:
-router: Engenius ESR-9850; AP: TP-Link WR841N
-router: Netgear R7000; AP: TP-link WR841N
-router: Netgear R7000; AP: Engenius ESR-9850


#34

What is the brand and model of your power line adaptors?
from what I’ve read
Engenius ESR9850 - it’s a product that’s been around since 2010
TP-Link WR841N - it’s a product that’s been around since 2014
Netgear R7000 - looks like it came out 2016/2017 but it’s had issues

I read a few forum post and angry reviews of people that bought a TP-Link WR841N that were describing issues similar to the ones you are experiencing. Same with netgear R7000 but apparently resolved with newer firmware versions.
If those power line adaptors are TP-Link I’ve got bad news for you, random internet dropouts are a known thing that firmware updates might help with but as a matter of best practice i’d avoid using them if possible, running a Cat6 cable and keeping it well away from existing power cabling would be best.


#35

The powerline adapters are TP-Link TL-PA511KIT and yes there are known issues with them having connection drops due to power savings issues.

I haven’t been able to confirm that my issue is due to the powerline adapters, although I’ve had my suspicions, since in my case when one device has the drop, others are still able to connect via a powerline route. Disconnecting/reconnecting the affected device from wifi seems to “fix” the drop and unless that’s a coincidence of timing I don’t see how that would work if the powerline adapters were the problem.

Also, the WR841N and R7000 are running ddwrt. The ESR9850 unfortunately has no ddwrt support, so it’s on the latest stock firmware and the R7000 was purchased to retire it.


#36

Depends if that 1 minute per week is in the middle of something important now, doesn’t it?


#37

Also depends on whether they complain about it. It seems like I’m the only one losing hair over this.


Not to bore you all, but it happened again. Does the ping output have any interesting clues? I started pinging when I noticed no internet.

$ ping 192.168.2.1
PING 192.168.2.1 (192.168.2.1) 56(84) bytes of data.
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
64 bytes from 192.168.2.1: icmp_seq=1 ttl=64 time=118531 ms
64 bytes from 192.168.2.1: icmp_seq=2 ttl=64 time=117489 ms
64 bytes from 192.168.2.1: icmp_seq=3 ttl=64 time=116465 ms
64 bytes from 192.168.2.1: icmp_seq=5 ttl=64 time=114417 ms
64 bytes from 192.168.2.1: icmp_seq=6 ttl=64 time=113394 ms
64 bytes from 192.168.2.1: icmp_seq=8 ttl=64 time=111346 ms
64 bytes from 192.168.2.1: icmp_seq=9 ttl=64 time=110322 ms
64 bytes from 192.168.2.1: icmp_seq=10 ttl=64 time=109298 ms
64 bytes from 192.168.2.1: icmp_seq=12 ttl=64 time=107250 ms
64 bytes from 192.168.2.1: icmp_seq=13 ttl=64 time=106226 ms
64 bytes from 192.168.2.1: icmp_seq=18 ttl=64 time=101106 ms
64 bytes from 192.168.2.1: icmp_seq=19 ttl=64 time=100082 ms
64 bytes from 192.168.2.1: icmp_seq=20 ttl=64 time=99058 ms
64 bytes from 192.168.2.1: icmp_seq=22 ttl=64 time=97014 ms
64 bytes from 192.168.2.1: icmp_seq=23 ttl=64 time=95990 ms
64 bytes from 192.168.2.1: icmp_seq=24 ttl=64 time=94966 ms
64 bytes from 192.168.2.1: icmp_seq=25 ttl=64 time=93942 ms
64 bytes from 192.168.2.1: icmp_seq=26 ttl=64 time=92918 ms
64 bytes from 192.168.2.1: icmp_seq=27 ttl=64 time=91894 ms
64 bytes from 192.168.2.1: icmp_seq=28 ttl=64 time=90870 ms
64 bytes from 192.168.2.1: icmp_seq=29 ttl=64 time=89846 ms
64 bytes from 192.168.2.1: icmp_seq=30 ttl=64 time=88822 ms
64 bytes from 192.168.2.1: icmp_seq=31 ttl=64 time=87798 ms
64 bytes from 192.168.2.1: icmp_seq=32 ttl=64 time=86774 ms
64 bytes from 192.168.2.1: icmp_seq=33 ttl=64 time=85750 ms
64 bytes from 192.168.2.1: icmp_seq=34 ttl=64 time=84726 ms
64 bytes from 192.168.2.1: icmp_seq=35 ttl=64 time=83702 ms
64 bytes from 192.168.2.1: icmp_seq=36 ttl=64 time=82678 ms
64 bytes from 192.168.2.1: icmp_seq=37 ttl=64 time=81654 ms
64 bytes from 192.168.2.1: icmp_seq=38 ttl=64 time=80630 ms
64 bytes from 192.168.2.1: icmp_seq=39 ttl=64 time=79606 ms
64 bytes from 192.168.2.1: icmp_seq=40 ttl=64 time=78582 ms
64 bytes from 192.168.2.1: icmp_seq=41 ttl=64 time=77558 ms
64 bytes from 192.168.2.1: icmp_seq=42 ttl=64 time=76534 ms
64 bytes from 192.168.2.1: icmp_seq=43 ttl=64 time=75510 ms
64 bytes from 192.168.2.1: icmp_seq=44 ttl=64 time=74486 ms
64 bytes from 192.168.2.1: icmp_seq=45 ttl=64 time=73462 ms
64 bytes from 192.168.2.1: icmp_seq=46 ttl=64 time=72439 ms
64 bytes from 192.168.2.1: icmp_seq=47 ttl=64 time=71415 ms
64 bytes from 192.168.2.1: icmp_seq=48 ttl=64 time=70391 ms
64 bytes from 192.168.2.1: icmp_seq=49 ttl=64 time=69367 ms
64 bytes from 192.168.2.1: icmp_seq=51 ttl=64 time=67319 ms
64 bytes from 192.168.2.1: icmp_seq=52 ttl=64 time=66295 ms
64 bytes from 192.168.2.1: icmp_seq=53 ttl=64 time=65271 ms
64 bytes from 192.168.2.1: icmp_seq=54 ttl=64 time=64247 ms
64 bytes from 192.168.2.1: icmp_seq=55 ttl=64 time=63223 ms
64 bytes from 192.168.2.1: icmp_seq=56 ttl=64 time=62198 ms
64 bytes from 192.168.2.1: icmp_seq=57 ttl=64 time=61175 ms
64 bytes from 192.168.2.1: icmp_seq=58 ttl=64 time=60151 ms
64 bytes from 192.168.2.1: icmp_seq=59 ttl=64 time=59127 ms
64 bytes from 192.168.2.1: icmp_seq=99 ttl=64 time=6.29 ms
64 bytes from 192.168.2.1: icmp_seq=100 ttl=64 time=2.99 ms
64 bytes from 192.168.2.1: icmp_seq=101 ttl=64 time=3.05 ms
64 bytes from 192.168.2.1: icmp_seq=102 ttl=64 time=2.66 ms
64 bytes from 192.168.2.1: icmp_seq=103 ttl=64 time=2.85 ms
64 bytes from 192.168.2.1: icmp_seq=104 ttl=64 time=2.94 ms
^C
--- 192.168.2.1 ping statistics ---
104 packets transmitted, 56 received, 46% packet loss, time 124361ms
rtt min/avg/max/mdev = 2.660/77202.876/118531.793/31247.224 ms, pipe 98

Observations:

  • Many lines during the outage have “No buffer space available” messages, but not all.
  • Only 46% packet loss, but I stopped pinging almost as soon as internet connectivity returned.
  • Some lines show a “response” but it is delayed by nearly 2 minutes? Delayed, not lost packets?
  • “pipe 98” in ping output? What does that mean?

#38

You could try getting a Mikrotik hAP ac² or a Ubiquiti AP ac lite or a Mikrotik SXTsq Lite5 .

If it doesn’t solve your problem at least you’ll have a capable router and dual band AP or a decent well known popular dual band AP, and you’ll be able to maybe split clients across different channels for debugging and maybe you’ll be able to get rid of powerline with the sxt (it’s 16dbi gain is enough to erase some walls).

At that point you’ll know for sure it’s a client issue.

Just to double check, you’ve tried various 2.4GHz and 5GHz channels as well, it’s not analog interference?


#39

No buffer space available generally means the network adapter lost its IP or carrier. As above, maybe its your machine.

Really you need to start shuffling network gear around (or plugging a laptop into different network segments to confirm which parts are affected during the issue) or isolating it in order to troubleshoot.

Until you do that, you’re simply wasting your time pissing into the wind.

Spending too much time on solving a problem may be bad, but even worse is wasting time doing things that have no hope of diagnosing the problem.

If you’re not willing to do what is required to diagnose it, then just accept that it is going to happen and live with it.

edit:
Have you at least been logging date/time of these drop outs? If not, i can not stress enough to DO THIS.

You may have no idea of any pattern in these drops that may exist, until you have say a few days or maybe even a week (or whatever) of date/time logs for the problem.

If its happening every 2 hours or 5 minutes or whatever then you can examine your network/environment for events that co-incide with those time intervals.

As an example on a mine site - i diagnosed a problem with WIFI that ended up being nothing to do with equipment - it was due to physical obstructions between the two p2p wireless devices. A truck loading area was stacking sea containers 3 high from time to time and blocking line of sight between the two devices…

Without accurate time/date log of the issue that would have been much more difficult to diagnose as a possibility from remote and then confirm with a site visit…


#40

Thanks for the recommendations. I’ll keep them in mind, but I hope the problem can be solved without buying more stuff.

Only 2.4 GHz is being used and the channels are fixed:

  • AP: channel = 1; channel width = 40 MHz, extension channel = upper
  • Router: channel = 11; channel width = 20 MHz

I could try changing channels (suggestions?). Years ago those channels had seemed to work best, but things might have changed.

I have a spare netbook I could leave plugged in at different points in the network. Do you have specific testing setup suggestions?

I could leave it pinging various devices, using some tips to also print timestamps. There’s also Wireshark but I don’t know how to interpret the output, and leaving it running a long time seems to fill up a lot of disk space, but maybe that’s because I was using the system at the same time.


#41

At the simplest level, even just leaving a script running a datestamp and say 5 pings (then time/datestamp) logging to a text file would be a start. Do this a few times at various points in the network (i.e., plugged directly into router, plugged into powerline adapter and then on wifi or whatever to get a few failures).

Wireshark is useful for debugging issues but i suspect not in this case; wireshark will tell you what traffic is running over a link, but your issue here is that NO traffic is running over the link and it looks like it is because the link layer is going down somewhere.

A wireshark dump certainly won’t hurt, but the issue is that you’re probably wanting to grab the frames from just before the issue occurs, and if you don’t know when it will occur, leaving wireshark running for long periods of time will result in absolutely MASSIVE traffic captures.

You can set up filters in wireshark to only capture certain frames, etc. but if you don’t know what you’re looking for you won’t know what to filter.

I’d try and identify a pattern as to WHEN the issue occurs first (log with ping) at various points in the network and then once you’re reasonably confident you know which segment is broken try and get a wireshark capure set up just before you expect it to happen, at the segment you expect it to occur on.

Wireshark won’t help you with interference necessarily but if there’s some protocol error or framing error you might see it. But it won’t tell you WHY that is happening - and you already know the link is dropping…

At the end of the day though i suspect it is a case of identify segment serviced by broken hardware X and then junk that hardware.

OR

Identify source of interference and isolate that somehow.


#42

Finally got around to working on this problem again; sorry for the delay.

pinger script
 #!/bin/bash

SERVERS="192.168.2.1 192.168.2.2 192.168.2.8"
OUTPUT_FILE=pinger.log
LFLAG=false

show_help()
{
  echo "
***************************************************************************
pinger usage:
-l [LOCATION]                   LOCATION string to print in log header
***************************************************************************
"
}

convert_datetime()
{
  sed -i 's/^\[\([0-9]*\.[0-9]*\)\]\(.*\)/ \
    echo "[$(date -d @\1 +"%Y-%m-%d %H:%M:%S")] \2"/e' $OUTPUT_FILE
}

while getopts "h?l:" opt; do
    case "$opt" in
    h|\?)
	show_help
        exit 0
	;;
    l)
      	LOCATION=$OPTARG
        LFLAG=true
        ;;
    esac
done

shift $((OPTIND-1))

if ! $LFLAG
then
        echo "Error: -l option must be included."
        show_help
        exit 1
fi

trap convert_datetime SIGHUP SIGINT SIGQUIT SIGABRT SIGTERM

echo "
***************************************************************************
Date: $(date)
Location: $LOCATION
***************************************************************************
" >> $OUTPUT_FILE

fping -Dul --period=5000 $SERVERS 2>&1 | tee -a -i $OUTPUT_FILE

Here’s a script that pings the router, AP, and my machine every 5 seconds with fping, printing the UNIX datestamp, and appending to a log file. When interrupted and before exiting, the script converts the UNIX time to a regular human-readable format. I would have liked to do that on the fly, but such attempts behaved strangely.

I’ve deployed this on a netbook wired to the AP for now. I can also put it at the router later, but I already have the VOIP ATA there which didn’t lose internet during issues.

Are we sure of this? How can we check that in fact no traffic is working?

Also, I was wondering if the issue might be due to powersavings on the wireless adapter (RTL8822BE), so for now I’ve turned off power management with wifi.powersave = 2 in a file at /etc/NetworkManager/conf.d/ and have restarted NetworkManager.


#43

I got “lucky” with another network interruption on my machine soon after deploying the ping script on a netbook wired to the AP.

  • script pinging my machine failed with “ICMP Host Unreachable from [NETBOOK IP] for ICMP Echo sent to [MY MACHINE’S IP]”
  • script pinging AP and router (over powerline) did not fail
  • interruption lasted ~50 minutes
  • interruption ended without my intervention within seconds of the following log entry on the AP: "daemon.info hostapd: ath0: STA [MAC ADDRESS] WPA: group key handshake completed (RSN) "
  • my machine’s uptime was not interrupted on the AP’s wireless client list

What conclusions can be drawn from this?

A few other observations from the AP’s logs in case they’re relevant:

  • There are entries for "pairwise key handshake completed (RSN) " from various devices, aside from “group key handshake completed (RSN)”. Is that normal?
  • There are entries for “RADIUS: starting accounting session xxxxxxxxxxxxxxxx”. I didn’t set up a RADIUS server, not entirely sure what it is, or whether it’s relevant to my setup. Are those entries normal?

From searching, this github issue seems similar, though it’s for OpenWRT/LEDE and a different chipset. Unfortunately no solution.

Perhaps I will update ddwrt on the AP in case that helps. It’s currently on v3.0-r33555 std (10/20/17).


#44

And it happened again.

Similar to the above points, the pings to my machine failed ~2-3 minutes after a group key handshake. I don’t know if that’s just coincidence though.

This time I didn’t have the patience to wait for it to reconnect, so I restarted the NetworkManager service after 10 minutes.

To add to previous observations, during interruptions iwconfig still reports connection to my network ESSID, Access Point [MAC ADDRESS], and fluctuating Link Quality and Signal level stats.


#45

So wpa2…

Each host will have two keys, one is used for unicast traffic, other one is used for broadcast traffic (group). Accesspoint is what initiates the group key exchanges, usually on a timer. Typical values for the timer are 5min (Mikrotik), 30min (OpenWRT), 60min (ddwrt), either never or 2h (TP-Link).
Group key needs to rotate every once in a while to prevent disconnected stations from sniffing or injecting broadcast traffic long after initial connection. Typically you need broadcast to work unless you want static IP and static arp on every host in lan.

The encryption/decryption is hardware accelerated, keys are stored and maintained in memory of the wifi nic as (mac address, key) pairs. wpa_supplicant/hostapd (on station/ap - 2 utilities sharing the codebase) will maintain the entries via a compatible driver.

I’d say in this case it’s likely your wpa_supplicant or hostapd is having issues communicating with the driver.


Yes, they shouldn’t happen that frequently for each of the devices while connected, keys should be exchanged on reconnection, on enough traffic or whenever a client asks. It’s ok to see it all the time for the same client it it’s roaming between APs.

Depends on hostapd.conf ddwrt scripts end up generating, I wouldn’t worry too much.


iwconfig is the old utility using the old API and it’s needed with old drivers. iw and iwlist are currently maintained utilities for mac80211 compatible wifi drivers . Generally you need iw, iwlist, wpa_supplicant, drivers and config files for wifi to work.

You can try setting the group key rotation interval to never or maxint or some ridiculously high value, if keys never change once established the broadcast traffic should work forever. It wouldn’t fix your drivers but might mitigate your issues.


#46

Would that correspond to “Key renewal interval” (in Wireless -> Wireless Security)? Default: 3600, Range: 1 - 99999. Right now I upgraded to the latest (r36808) ddwrt. If the problem happens again, I can try changing this next.

Would that mean the problem and solution are in the AP firmware and client kernel drivers? Client is current in updates on Fedora 28 ( 4.17.19-200.fc28.x86_64, wpa_supplicant v2.6). It would be such a relief if a firmware update solved this…

Ok, that’s possible. Some devices are cellphones that go back and forth between the AP and the router’s wireless as people walk through the house.

Long ago I learned to use ifconfig/iwconfig for quick network info and stuck with it. I like how quick and compact the output is. But I’m not using them for setting configuration, just displaying. Would you suggest using eg. iwlist wlp6s0 scan instead?


#47

I’ve been logging pings for ~3 months now.

The pinging script is running on a netbook connected by ethernet to the AP, and pinging the AP, router, client 1, and client 0 (my computer).

The AP (basement), client 1 (first floor), and client 0 (second floor) are nearly vertically lined up with each other.

Client 0 can be connected to the network by wifi, or by powerline adapter (ethernet over powerline). The backhaul between the AP and router is with powerline adapters.

Observations:

  • average pings are generally in single or sometimes double digits
  • max ping over wifi for client 0 typically reaches ~2000 ms, even 5000 ms
  • max ping over wifi for client 1 is often over ~1000 ms
  • max ping over powerline to the router can also reach ~1000 ms
  • max ping over powerline for client 0 is on the order of several hundred ms, much better than on wifi
  • max throughput between client 0 and the netbook, measured with iperf3 is:
    • over powerline: a consistent ~20 Mbps with 0 retries
    • over wifi: usually ~60 Mbps or as low as 10 Mbps, with dozens/hundreds of retries in most cases
  • internet speedtest with client 0 and a nominal 15/10 Mbps cable internet connection, measured with speedtest-cli:
    • over powerline: consistent 16.5/11 Mbps download/upload
    • over wifi: usually 14.5/8 Mbps dl/ul, dropping to 8-10 Mbps download when iperf3 shows lower (10-20 Mbps) network throughput

When the network issues occur with client 0 over wifi:

  • other clients such as client 1 don’t seem to be affected
  • resolving the network issue requires either waiting ~5 minutes or disconnecting/reconnecting with network manager

When network issues occur with client 0 over powerline, the following happens though it’s not clear if it’s always the case:

  • client 0 fails to ping the router or any device connected to the router (eg. ATA) (consistent)
  • however, client 0’s tmux session to the netbook continues to work, and can ping the router this way
  • client 0 succeeds in pinging the AP and client 1
  • the netbook continues to succeed pinging the router during this time

So we can conclude this is not (exclusively) a wifi issue since it still happens when there is no wifi segment between client 0 and the router.

The network issues do happen more frequently when using wifi. I’m not sure if it’s coincidence but it seemed like the max ping is higher when both client 0 and client 1 use wifi. Maybe interference since client 1 is directly in between?

I’m not sure how to think of powerline connections, and whether it’s a point-to-point connection or some kind of mesh network. There are 4 powerline adapters connecting the router, AP, client 0, and a PS3.

So, what could cause connectivity to drop between client 0 and the router, while not affecting other devices?


#48

at this rate i would go and get some ethernet cable and bypass the power line adapter altogether and see if it gets those pings time within reason .

also adding more suspicion to the power line adapters
http://www.dslreports.com/forum/r30892334-powerline-adapter-network-issues

i would check and see if your router has the Spanning tree protocol feature and if so enable it ,

maybe that will help . im suspecting the adapters are the problem .


#49

I’m using dd-wrt on both the router and AP, and there are options for STP. Should it be used on both or just one? Is this the correct setting?

There’s also STP in Basic/WAN settings.

STP_WAN


I don’t think the ping times are bad, but the max and lost packets might be an issue. Here are some stats from the netbook connected by ethernet to the AP during a quiet time recently, which includes a short network interruption for my computer (client 0):

[router] : xmt/rcv/%loss = 60268/59900/0%, min/avg/max = 1.17/2.94/141
[AP] : xmt/rcv/%loss = 60268/60266/0%, min/avg/max = 0.19/0.34/7.70
[client 0 powerline] : xmt/rcv/%loss = 60268/59660/1%, min/avg/max = 1.27/2.98/224
[client 1 wifi] : xmt/rcv/%loss = 60268/60266/0%, min/avg/max = 0.64/3.01/627

Often the maxes are higher, but it’s typical for the xmt and rcv not to be equal. Given the large number due to how long I run it, the %loss is usually 0.


Maybe I could test with an ethernet backhaul between the router and AP for a while, but it’s very inconvenient to pass wire to literally the opposite corners of the house. A Permanent wired solution would be even harder.


#50

Yeah i realize the the cable will be a pain but bypassing the the power line adapters may be the key in solving this . so may be worth the inconvenience.

as for stp it should be on the router. as it should be the main device controlling the network.

but in this case who knows anymore and enabling stp on the access point may work better.