Return to Level1Techs.com

Comprehensive debuging of network issues


#27

Sorry for the defuse line of “run debugging”. What I mean is that you collect data in form of ping with ip, dns, and checking your routes (can you reach the specific machines) - however you stated that you do have machines with static addresses already…

Verify the following:

Do your static address space collide with the dhcp space?

Have you seen any logs indicating that one machine has multiple leases?

Have you ran out of leases?

Is the problem something you believe can be related to the lease time with the above checked?

How many windows 10 machines in total? Are they all set to “hide their mac” as it does by default (ie, screws up address leases and fills the dhcp with requests and sometimes does not release it correctly…)?

The last point is valid, if you are using equipment that decides to renew their mac-address etc.

How does your ARP-cache look like on your router? It can give hints to if the above is true.


#28

They are seeing disruptions right now due to the drop outs that haven’t been fixed for over a month. :slight_smile:

Disruptive testing can be scheduled, unlike random drop outs.

Sometimes to bake a cake you gotta break some eggs.


#29

Drop happened yet again. I had just changed the AP with a spare unit (wifi not as good) just a few hours ago.


@oO.o My internet is only 15/10 Mbps, so it should be nowhere near the point of overloading the router, QOS or not.

What should I be looking for with iperf?

Adaptive ping. Interpacket interval adapts to round-trip time, so that effectively not more than one (or more, if preload is set) unanswered probes present in the network. Minimal interval is 200msec for not super-user. On networks with low rtt this mode is essentially equivalent to flood mode.

Sorry, I’m not too familiar with networks. What is the reason for using ping -A ?

“Luckily” the latest drop happened while I had wireshark running. Does anything catch the eye?


#30

DHCP is from 192.168.2.100 and up. Static IPs are 192.168.2.2 through 192.168.2.9.

No.

I don’t think so.

Lease time with DHCP is 1440 minutes. The machine I use on which I notice the drops has a static IP.

Just one windows 10 machine, and it’s rarely ever on. People who buy new computers but use their phone 99% of the time… But in any case there are typically at most about a dozen IP leases on the router status page, usually half that.

What is this and where can I check it?


#31

Actually the issue has been around a few years… but you’re preaching to the choir. I go to great lengths to fix, optimize, etc.

Most people, including techies I know who said they’ve experienced similar issues, just want to minimize effort, and really wont tolerate having devices disconnected or slowed down. Considering the disruption from the issue can be maybe just a minute every week or so, and easily worked around by reconnecting… the economics of this chart indicate I’ve already spent too much time:


#32

I think you are at the point where you need to replace the wireless nic on the problem pc,

if im not mistaken you have already,
changed out the router, problem persisted.
changed out the access point problem persisted.
and its just this one machine that keeps dropping connection.

so next is the nic… hopefully that is the fix.


#33

It’s not just one machine. My previous machine at the same location experienced the same thing with a different NIC. It seems to have happened with at least one other machine in a different location, but I don’t get good debugging info from other users, often only long after the fact.

It happened with:
-router: Engenius ESR-9850; AP: TP-Link WR841N
-router: Netgear R7000; AP: TP-link WR841N
-router: Netgear R7000; AP: Engenius ESR-9850


#34

What is the brand and model of your power line adaptors?
from what I’ve read
Engenius ESR9850 - it’s a product that’s been around since 2010
TP-Link WR841N - it’s a product that’s been around since 2014
Netgear R7000 - looks like it came out 2016/2017 but it’s had issues

I read a few forum post and angry reviews of people that bought a TP-Link WR841N that were describing issues similar to the ones you are experiencing. Same with netgear R7000 but apparently resolved with newer firmware versions.
If those power line adaptors are TP-Link I’ve got bad news for you, random internet dropouts are a known thing that firmware updates might help with but as a matter of best practice i’d avoid using them if possible, running a Cat6 cable and keeping it well away from existing power cabling would be best.


#35

The powerline adapters are TP-Link TL-PA511KIT and yes there are known issues with them having connection drops due to power savings issues.

I haven’t been able to confirm that my issue is due to the powerline adapters, although I’ve had my suspicions, since in my case when one device has the drop, others are still able to connect via a powerline route. Disconnecting/reconnecting the affected device from wifi seems to “fix” the drop and unless that’s a coincidence of timing I don’t see how that would work if the powerline adapters were the problem.

Also, the WR841N and R7000 are running ddwrt. The ESR9850 unfortunately has no ddwrt support, so it’s on the latest stock firmware and the R7000 was purchased to retire it.


#36

Depends if that 1 minute per week is in the middle of something important now, doesn’t it?


#37

Also depends on whether they complain about it. It seems like I’m the only one losing hair over this.


Not to bore you all, but it happened again. Does the ping output have any interesting clues? I started pinging when I noticed no internet.

$ ping 192.168.2.1
PING 192.168.2.1 (192.168.2.1) 56(84) bytes of data.
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
64 bytes from 192.168.2.1: icmp_seq=1 ttl=64 time=118531 ms
64 bytes from 192.168.2.1: icmp_seq=2 ttl=64 time=117489 ms
64 bytes from 192.168.2.1: icmp_seq=3 ttl=64 time=116465 ms
64 bytes from 192.168.2.1: icmp_seq=5 ttl=64 time=114417 ms
64 bytes from 192.168.2.1: icmp_seq=6 ttl=64 time=113394 ms
64 bytes from 192.168.2.1: icmp_seq=8 ttl=64 time=111346 ms
64 bytes from 192.168.2.1: icmp_seq=9 ttl=64 time=110322 ms
64 bytes from 192.168.2.1: icmp_seq=10 ttl=64 time=109298 ms
64 bytes from 192.168.2.1: icmp_seq=12 ttl=64 time=107250 ms
64 bytes from 192.168.2.1: icmp_seq=13 ttl=64 time=106226 ms
64 bytes from 192.168.2.1: icmp_seq=18 ttl=64 time=101106 ms
64 bytes from 192.168.2.1: icmp_seq=19 ttl=64 time=100082 ms
64 bytes from 192.168.2.1: icmp_seq=20 ttl=64 time=99058 ms
64 bytes from 192.168.2.1: icmp_seq=22 ttl=64 time=97014 ms
64 bytes from 192.168.2.1: icmp_seq=23 ttl=64 time=95990 ms
64 bytes from 192.168.2.1: icmp_seq=24 ttl=64 time=94966 ms
64 bytes from 192.168.2.1: icmp_seq=25 ttl=64 time=93942 ms
64 bytes from 192.168.2.1: icmp_seq=26 ttl=64 time=92918 ms
64 bytes from 192.168.2.1: icmp_seq=27 ttl=64 time=91894 ms
64 bytes from 192.168.2.1: icmp_seq=28 ttl=64 time=90870 ms
64 bytes from 192.168.2.1: icmp_seq=29 ttl=64 time=89846 ms
64 bytes from 192.168.2.1: icmp_seq=30 ttl=64 time=88822 ms
64 bytes from 192.168.2.1: icmp_seq=31 ttl=64 time=87798 ms
64 bytes from 192.168.2.1: icmp_seq=32 ttl=64 time=86774 ms
64 bytes from 192.168.2.1: icmp_seq=33 ttl=64 time=85750 ms
64 bytes from 192.168.2.1: icmp_seq=34 ttl=64 time=84726 ms
64 bytes from 192.168.2.1: icmp_seq=35 ttl=64 time=83702 ms
64 bytes from 192.168.2.1: icmp_seq=36 ttl=64 time=82678 ms
64 bytes from 192.168.2.1: icmp_seq=37 ttl=64 time=81654 ms
64 bytes from 192.168.2.1: icmp_seq=38 ttl=64 time=80630 ms
64 bytes from 192.168.2.1: icmp_seq=39 ttl=64 time=79606 ms
64 bytes from 192.168.2.1: icmp_seq=40 ttl=64 time=78582 ms
64 bytes from 192.168.2.1: icmp_seq=41 ttl=64 time=77558 ms
64 bytes from 192.168.2.1: icmp_seq=42 ttl=64 time=76534 ms
64 bytes from 192.168.2.1: icmp_seq=43 ttl=64 time=75510 ms
64 bytes from 192.168.2.1: icmp_seq=44 ttl=64 time=74486 ms
64 bytes from 192.168.2.1: icmp_seq=45 ttl=64 time=73462 ms
64 bytes from 192.168.2.1: icmp_seq=46 ttl=64 time=72439 ms
64 bytes from 192.168.2.1: icmp_seq=47 ttl=64 time=71415 ms
64 bytes from 192.168.2.1: icmp_seq=48 ttl=64 time=70391 ms
64 bytes from 192.168.2.1: icmp_seq=49 ttl=64 time=69367 ms
64 bytes from 192.168.2.1: icmp_seq=51 ttl=64 time=67319 ms
64 bytes from 192.168.2.1: icmp_seq=52 ttl=64 time=66295 ms
64 bytes from 192.168.2.1: icmp_seq=53 ttl=64 time=65271 ms
64 bytes from 192.168.2.1: icmp_seq=54 ttl=64 time=64247 ms
64 bytes from 192.168.2.1: icmp_seq=55 ttl=64 time=63223 ms
64 bytes from 192.168.2.1: icmp_seq=56 ttl=64 time=62198 ms
64 bytes from 192.168.2.1: icmp_seq=57 ttl=64 time=61175 ms
64 bytes from 192.168.2.1: icmp_seq=58 ttl=64 time=60151 ms
64 bytes from 192.168.2.1: icmp_seq=59 ttl=64 time=59127 ms
64 bytes from 192.168.2.1: icmp_seq=99 ttl=64 time=6.29 ms
64 bytes from 192.168.2.1: icmp_seq=100 ttl=64 time=2.99 ms
64 bytes from 192.168.2.1: icmp_seq=101 ttl=64 time=3.05 ms
64 bytes from 192.168.2.1: icmp_seq=102 ttl=64 time=2.66 ms
64 bytes from 192.168.2.1: icmp_seq=103 ttl=64 time=2.85 ms
64 bytes from 192.168.2.1: icmp_seq=104 ttl=64 time=2.94 ms
^C
--- 192.168.2.1 ping statistics ---
104 packets transmitted, 56 received, 46% packet loss, time 124361ms
rtt min/avg/max/mdev = 2.660/77202.876/118531.793/31247.224 ms, pipe 98

Observations:

  • Many lines during the outage have “No buffer space available” messages, but not all.
  • Only 46% packet loss, but I stopped pinging almost as soon as internet connectivity returned.
  • Some lines show a “response” but it is delayed by nearly 2 minutes? Delayed, not lost packets?
  • “pipe 98” in ping output? What does that mean?

#38

You could try getting a Mikrotik hAP ac² or a Ubiquiti AP ac lite or a Mikrotik SXTsq Lite5 .

If it doesn’t solve your problem at least you’ll have a capable router and dual band AP or a decent well known popular dual band AP, and you’ll be able to maybe split clients across different channels for debugging and maybe you’ll be able to get rid of powerline with the sxt (it’s 16dbi gain is enough to erase some walls).

At that point you’ll know for sure it’s a client issue.

Just to double check, you’ve tried various 2.4GHz and 5GHz channels as well, it’s not analog interference?


#39

No buffer space available generally means the network adapter lost its IP or carrier. As above, maybe its your machine.

Really you need to start shuffling network gear around (or plugging a laptop into different network segments to confirm which parts are affected during the issue) or isolating it in order to troubleshoot.

Until you do that, you’re simply wasting your time pissing into the wind.

Spending too much time on solving a problem may be bad, but even worse is wasting time doing things that have no hope of diagnosing the problem.

If you’re not willing to do what is required to diagnose it, then just accept that it is going to happen and live with it.

edit:
Have you at least been logging date/time of these drop outs? If not, i can not stress enough to DO THIS.

You may have no idea of any pattern in these drops that may exist, until you have say a few days or maybe even a week (or whatever) of date/time logs for the problem.

If its happening every 2 hours or 5 minutes or whatever then you can examine your network/environment for events that co-incide with those time intervals.

As an example on a mine site - i diagnosed a problem with WIFI that ended up being nothing to do with equipment - it was due to physical obstructions between the two p2p wireless devices. A truck loading area was stacking sea containers 3 high from time to time and blocking line of sight between the two devices…

Without accurate time/date log of the issue that would have been much more difficult to diagnose as a possibility from remote and then confirm with a site visit…


#40

Thanks for the recommendations. I’ll keep them in mind, but I hope the problem can be solved without buying more stuff.

Only 2.4 GHz is being used and the channels are fixed:

  • AP: channel = 1; channel width = 40 MHz, extension channel = upper
  • Router: channel = 11; channel width = 20 MHz

I could try changing channels (suggestions?). Years ago those channels had seemed to work best, but things might have changed.

I have a spare netbook I could leave plugged in at different points in the network. Do you have specific testing setup suggestions?

I could leave it pinging various devices, using some tips to also print timestamps. There’s also Wireshark but I don’t know how to interpret the output, and leaving it running a long time seems to fill up a lot of disk space, but maybe that’s because I was using the system at the same time.


#41

At the simplest level, even just leaving a script running a datestamp and say 5 pings (then time/datestamp) logging to a text file would be a start. Do this a few times at various points in the network (i.e., plugged directly into router, plugged into powerline adapter and then on wifi or whatever to get a few failures).

Wireshark is useful for debugging issues but i suspect not in this case; wireshark will tell you what traffic is running over a link, but your issue here is that NO traffic is running over the link and it looks like it is because the link layer is going down somewhere.

A wireshark dump certainly won’t hurt, but the issue is that you’re probably wanting to grab the frames from just before the issue occurs, and if you don’t know when it will occur, leaving wireshark running for long periods of time will result in absolutely MASSIVE traffic captures.

You can set up filters in wireshark to only capture certain frames, etc. but if you don’t know what you’re looking for you won’t know what to filter.

I’d try and identify a pattern as to WHEN the issue occurs first (log with ping) at various points in the network and then once you’re reasonably confident you know which segment is broken try and get a wireshark capure set up just before you expect it to happen, at the segment you expect it to occur on.

Wireshark won’t help you with interference necessarily but if there’s some protocol error or framing error you might see it. But it won’t tell you WHY that is happening - and you already know the link is dropping…

At the end of the day though i suspect it is a case of identify segment serviced by broken hardware X and then junk that hardware.

OR

Identify source of interference and isolate that somehow.


#42

Finally got around to working on this problem again; sorry for the delay.

pinger script
 #!/bin/bash

SERVERS="192.168.2.1 192.168.2.2 192.168.2.8"
OUTPUT_FILE=pinger.log
LFLAG=false

show_help()
{
  echo "
***************************************************************************
pinger usage:
-l [LOCATION]                   LOCATION string to print in log header
***************************************************************************
"
}

convert_datetime()
{
  sed -i 's/^\[\([0-9]*\.[0-9]*\)\]\(.*\)/ \
    echo "[$(date -d @\1 +"%Y-%m-%d %H:%M:%S")] \2"/e' $OUTPUT_FILE
}

while getopts "h?l:" opt; do
    case "$opt" in
    h|\?)
	show_help
        exit 0
	;;
    l)
      	LOCATION=$OPTARG
        LFLAG=true
        ;;
    esac
done

shift $((OPTIND-1))

if ! $LFLAG
then
        echo "Error: -l option must be included."
        show_help
        exit 1
fi

trap convert_datetime SIGHUP SIGINT SIGQUIT SIGABRT SIGTERM

echo "
***************************************************************************
Date: $(date)
Location: $LOCATION
***************************************************************************
" >> $OUTPUT_FILE

fping -Dul --period=5000 $SERVERS 2>&1 | tee -a -i $OUTPUT_FILE

Here’s a script that pings the router, AP, and my machine every 5 seconds with fping, printing the UNIX datestamp, and appending to a log file. When interrupted and before exiting, the script converts the UNIX time to a regular human-readable format. I would have liked to do that on the fly, but such attempts behaved strangely.

I’ve deployed this on a netbook wired to the AP for now. I can also put it at the router later, but I already have the VOIP ATA there which didn’t lose internet during issues.

Are we sure of this? How can we check that in fact no traffic is working?

Also, I was wondering if the issue might be due to powersavings on the wireless adapter (RTL8822BE), so for now I’ve turned off power management with wifi.powersave = 2 in a file at /etc/NetworkManager/conf.d/ and have restarted NetworkManager.


#43

I got “lucky” with another network interruption on my machine soon after deploying the ping script on a netbook wired to the AP.

  • script pinging my machine failed with “ICMP Host Unreachable from [NETBOOK IP] for ICMP Echo sent to [MY MACHINE’S IP]”
  • script pinging AP and router (over powerline) did not fail
  • interruption lasted ~50 minutes
  • interruption ended without my intervention within seconds of the following log entry on the AP: "daemon.info hostapd: ath0: STA [MAC ADDRESS] WPA: group key handshake completed (RSN) "
  • my machine’s uptime was not interrupted on the AP’s wireless client list

What conclusions can be drawn from this?

A few other observations from the AP’s logs in case they’re relevant:

  • There are entries for "pairwise key handshake completed (RSN) " from various devices, aside from “group key handshake completed (RSN)”. Is that normal?
  • There are entries for “RADIUS: starting accounting session xxxxxxxxxxxxxxxx”. I didn’t set up a RADIUS server, not entirely sure what it is, or whether it’s relevant to my setup. Are those entries normal?

From searching, this github issue seems similar, though it’s for OpenWRT/LEDE and a different chipset. Unfortunately no solution.

Perhaps I will update ddwrt on the AP in case that helps. It’s currently on v3.0-r33555 std (10/20/17).


#44

And it happened again.

Similar to the above points, the pings to my machine failed ~2-3 minutes after a group key handshake. I don’t know if that’s just coincidence though.

This time I didn’t have the patience to wait for it to reconnect, so I restarted the NetworkManager service after 10 minutes.

To add to previous observations, during interruptions iwconfig still reports connection to my network ESSID, Access Point [MAC ADDRESS], and fluctuating Link Quality and Signal level stats.


#45

So wpa2…

Each host will have two keys, one is used for unicast traffic, other one is used for broadcast traffic (group). Accesspoint is what initiates the group key exchanges, usually on a timer. Typical values for the timer are 5min (Mikrotik), 30min (OpenWRT), 60min (ddwrt), either never or 2h (TP-Link).
Group key needs to rotate every once in a while to prevent disconnected stations from sniffing or injecting broadcast traffic long after initial connection. Typically you need broadcast to work unless you want static IP and static arp on every host in lan.

The encryption/decryption is hardware accelerated, keys are stored and maintained in memory of the wifi nic as (mac address, key) pairs. wpa_supplicant/hostapd (on station/ap - 2 utilities sharing the codebase) will maintain the entries via a compatible driver.

I’d say in this case it’s likely your wpa_supplicant or hostapd is having issues communicating with the driver.


Yes, they shouldn’t happen that frequently for each of the devices while connected, keys should be exchanged on reconnection, on enough traffic or whenever a client asks. It’s ok to see it all the time for the same client it it’s roaming between APs.

Depends on hostapd.conf ddwrt scripts end up generating, I wouldn’t worry too much.


iwconfig is the old utility using the old API and it’s needed with old drivers. iw and iwlist are currently maintained utilities for mac80211 compatible wifi drivers . Generally you need iw, iwlist, wpa_supplicant, drivers and config files for wifi to work.

You can try setting the group key rotation interval to never or maxint or some ridiculously high value, if keys never change once established the broadcast traffic should work forever. It wouldn’t fix your drivers but might mitigate your issues.


#46

Would that correspond to “Key renewal interval” (in Wireless -> Wireless Security)? Default: 3600, Range: 1 - 99999. Right now I upgraded to the latest (r36808) ddwrt. If the problem happens again, I can try changing this next.

Would that mean the problem and solution are in the AP firmware and client kernel drivers? Client is current in updates on Fedora 28 ( 4.17.19-200.fc28.x86_64, wpa_supplicant v2.6). It would be such a relief if a firmware update solved this…

Ok, that’s possible. Some devices are cellphones that go back and forth between the AP and the router’s wireless as people walk through the house.

Long ago I learned to use ifconfig/iwconfig for quick network info and stuck with it. I like how quick and compact the output is. But I’m not using them for setting configuration, just displaying. Would you suggest using eg. iwlist wlp6s0 scan instead?