Comprehensive debuging of network issues

Is there a diagnostic tool for identifying exactly where (which hardware or software) a network issue is originating? Or at least get some decent clues?

I have a situation where network connectivity randomly gets interrupted for a few minutes then comes back on its own.

  • Happens with multiple devices, though not necessarily at the same time (confirmed by checking other devices, eg. wired VOIP ATA, still have connection)
  • Happens over wifi, but the wifi doesn’t drop (confirmed by client uptime on wifi access point)
  • Usually can force connection to come back by disconnecting and reconnecting client to network.
  • Seems to happen with different OSes, but I don’t have much detail on this.

Network setup:

client <—(wifi)—> wireless access point (ddwrt) <—(powerline adapters)—> router (ddwrt) <—(ethernet)—> cable modem

I have some suspicion of the powerline adapters, but I think I still experienced the issue when connected directly to the router’s wifi.

No such tool I can think of, every time I had an issue like that at home I had to come up with some kind of logging thing myself, and that always helped (e.g. running mtr --raw and letting the file build up, or increasing verbosity of DHCP logs or wifi logs and then having a look days later.

1 Like

And how is mtr's output interpreted?

This network issue has been around for years and it bugs my optimizing mind, but most don’t bother looking for solutions since the workaround of disconnecting/reconnecting is so easy.

On the other hand, just trying to describe the issue in a useful way is difficult.

The best diagnostic tool is your brain and proper process.

General process for debugging network issues is either top down or bottom up (in terms of the protocol stack) or an intelligent guess and work from there.

Given you’re using a mix of wifi and power line adapters which are layer 2 devices and both susceptible to interference i’d be starting at layer 1 or layer 2 with those devices.

i.e.,

  • are the cables all good (layer 1)
  • are there any devices that use the same media that may be causing interference (i.e. devices on 2.4 or 5ghz radio frequency or say large devices on the power circuit which may be causing a drop in output or noise on the circuit
  • if there are such devices, attempt to isolate them (i.e., remove/turn off/disable/etc.) and see if the problem persists.

Keep a log (can not stress this enough!) of exact date/time of your drop outs and also the date/time of changes you make, and see if you can establish a pattern that may coincide with something else going on in your house.

Biggest thing with network troubleshooting in my experience is figuring out how to properly rule something out as being a factor. I.e., you need to test and confirm behaviour.

If you can properly rule something out as being a factor (i.e., devise a reliable test for that) then you can cross that off the list and move onto something else.

There are tools which may help you perform tests and snoop traffic, etc. But they won’t magically debug this for you. They will however help you test theories and observe what is going on in more detail.

Bear in mind with WIFI and power line stuff, it may be something entirely unrelated to your network. A classic case of intermittent issues I was having on a mine site for work was due to heavy machinery moving around sea containers and blocking line of sight periodically on a point to point wifi link.

1 Like

Ok, so it doesn’t look like there’s a magical network debugging tool. Unfortunately my knowledge of networking is limited, which makes the ‘brain’ approach more difficult. Maybe other brains here can help.

Some clues about the issue:

  • intermittent drop (could be a few times in a day or no observed problems for a week) of network communication for a given client device, i.e. browser can’t load web pages and pinging even local computers fails
  • but during a drop, the home phone (VOIP) still works (connected by ethernet to router), therefore it’s probably not a (complete) failure of the ISP or modem or router
  • during a drop, the client’s wifi connection is not dropped, therefore either the network manager and wireless access point’s client uptime stats are lying or it’s not necessarily a problem with the wifi connection

I wondered if it might be a DHCP issue, but I dunno… my client devices have static local IPs. For example, maybe a clash of DHCP servers on the router and wireless access point (whose DHCP server is disabled in the web admin interface).

The router was upgraded, and the issue existed before and after this change. I suppose it could still be a configuration issue.

It could be an issue with the powerline adapters, but the issue was experienced even when connected directly to the router by wifi. Could the powerline adapters still cause a problem in that case?

Could it be due to the wireless access point, whether its hardware or software (ddwrt) configuration?

Doubt it is a DHCP issue (i am assuming you are only using ONE dhcp server and ONE subnet, and not doing something stupid like running a second router piggybacked off another WIFI router with 2 instances of DHCP and 2 subnets - i’ve seen people do that before. APs and routers are different things…). A machine will generally hold its DHCP lease until it expires, and periodically refresh it prior to expiry.

I’d suggest interference. Interference won’t necessarily result in wifi dropping entirely, but traffic on it might not reach the destination. Repeat: your wireless network can be not doing traffic, but still not “drop” the SSID.

The VOIP phone, being directly plugged into the router isn’t WIFI. If DHCP was an issue it would also affect any hard-wired devices that don’t have static IPs (like, i’d suggest - your phone).

So i think you can isolate it to the WIFI or possibly (if your AP is connected via power line) the power line network.

I’d also suggest drawing yourself out a network diagram and identifying the path that dropping-out devices are taking to the internet. Then investigate the devices on that path, and attempt to verify whether or not each segment or step in the path is also dropping out.

If you can, you could isolate the powerline adapters by plugging a wifi AFP directly into your router. Sure, signal might not be the best elsewhere in the house, but you can at least monitor for drop outs that may be getting caused by the powerline segment. No more drop outs = try running without wifi, but on the end of a power line adapter directly via cable.

2 Likes

If you have the time when the connection drops, see what devices you can ping on the network, and WAN. As simple as a ping is, it can verify that a lot of things are or are not working.

This may be things you already know; I apologize if they are. This is my troubleshooting methods when I lose network connectivity.

Ping 127.0.0.1. This will ping your local machine. Confirms that IPV4 is working.

Next, ping your router’s IP address. This confirms connectivity to the router. If this works, you know physical connection to the router, and all hardware inbetween is doing its job. In addition, this confirms DHCP is working for the machine.

Next ping something on the WAN. 8.8.8.8 is a good choice. This confirms your modem has connectivity to your ISP, and that your ISPs connection to google is working also.

Lastly, ping www.google.com. this confirms the same as the last step but also confirms DNS is working.

Hopefully this is helpful.

2 Likes

Settings as described here, but in summary, IP of wireless access point (WAP) is 192.168.2.2 and DHCP server is disabled (router is 192.168.2.1).

I was afraid this might be a possibility… harder to debug.

Is this necessarily true? Eg. if the WAP somehow messed up DHCP (or didn’t turn it off) would it also affect other devices not connected to it? The home phone’s ATA is wired to the router directly, while most other devices are connected to the AP either by wire (printer, one computer) or wifi (several computers, smartphones, ps3, tablet).

I tried doing this in the first post. One thing I’m not sure about: what is the path a client of the WAP would take when pinging another client on the WAP? Would it go directly (i.e. device A -> WAP -> device B) or does it always need to go through the router (device A -> WAP -> powerline -> router -> powerline -> WAP -> device B)? If it’s the latter, it would be difficult to isolate segments.

Stopping wifi/network is a good way to get murdered lol I do have a 100’ ethernet cable I could use to connect the WAP and router without the powerline adapters, but the path of the wire would annoy people who don’t care as much about debugging.

Note that the router also has wifi, but it’s literally at the opposite, most distant corner of the building from my computer, yet my computer’s wifi often prefers to connect to the router than the WAP which has a much stronger signal to my location (both use same SSID). The same sometimes happens with a computer in the same room as the WAP…

I didn’t try pinging 127.0.0.1 (next time) but anything else I tried times out (WAP, router, other computers, google/websites, etc)

The drop happened again, during which:

  • pinging 127.0.0.1 worked
  • pinging other clients or the AP resulted in “Destination Host Unreachable” and 100% packet loss
  • iwconfig showed Link Quality=56/70 Signal level=-54 dBm during the drop, same as usual
  • pinging the router resulted in no output for several seconds, then “ping: sendmsg: No buffer space available” and 100% packet loss

Is this message a useful clue?

ping: sendmsg: No buffer space available

And it happened again.

The affected system sees “Destination Host Unreachable" and later “ping: sendmsg: No buffer space available” when pinging the router or other devices.

This time I booted another computer while the first system was still without internet. The second computer had no connectivity or internet problems. Pinging the affected system from this second computer also resulted in “Destination Host Unreachable".

The AP reported the affected computer’s wifi connection throughout the issue. The uptime was not interrupted.

I would bring in a second access point and isolate the problem machine/machines to it

this will give you flexibility in tracking down the issue which may be the following.

hardware client side :
if the client disconnects from the second access point it maybe a hardware issue on the machine resolving could be as easy as adding a usb wifi adapter and retest to see if it happens again.

Router Side: second access point resolves the issue, could be router under to much load dropping connections ? incompatible chipset between routers wifi and clients adapter ? or if were talking 5ghz possible shadow/reflection of the signal giving you false signal strength which can be a pain to track down.

access point: yeah some how the access point has made the problem worse .
which may mean that there is to much noise in the environment.

which maybe resolved by placing your router on the recommended channels.

either way

maybe this will help in figuring out the issue or give you some more clues as to whats happening.

and yes ping is a powerful testing tool especially when watching the latency .

i would use ping and see if there is a huge difference between the clients access times to the router vs the access point. this may be a big clue as to whats happening.

1 Like

Such a simple idea, why didn’t I think of that…

When pinging the acess point, what route do packets take? Is it just between the client and AP, or does traffic always go through the router?

Results from pinging overnight.

Affected client to access point:

--- 192.168.2.2 ping statistics ---
36448 packets transmitted, 36448 received, 0% packet loss, time 36527978ms
rtt min/avg/max/mdev = 0.546/3.460/263.056/5.382 ms

Affected client to router:

--- 192.168.2.1 ping statistics ---
36426 packets transmitted, 36426 received, 0% packet loss, time 36483277ms
rtt min/avg/max/mdev = 1.591/5.992/206.959/5.569 ms

Second computer (same location) to access point:

--- 192.168.2.2 ping statistics ---
36580 packets transmitted, 36576 received, 0% packet loss, time 36629948ms
rtt min/avg/max/mdev = 1.009/4.050/188.861/4.154 ms

Second computer (same location) to router:

--- 192.168.2.1 ping statistics ---
36379 packets transmitted, 36374 received, 0% packet loss, time 36437601ms
rtt min/avg/max/mdev = 2.737/7.136/252.288/4.963 ms

Also, what does “Invalid misc” mean?

Is a count of 4568 normal over a day of uptime? If not, does point to any specific part of the network?

wlp6s0    IEEE 802.11  ESSID: [SSID]  
          Mode:Managed  Frequency:2.437 GHz  Access Point: [MAC ADDRESS]   
          Bit Rate=300 Mb/s   Tx-Power=20 dBm   
          Retry short limit:7   RTS thr=2347 B   Fragment thr:off
          Power Management:on
          Link Quality=57/70  Signal level=-53 dBm  
          Rx invalid nwid:0  Rx invalid crypt:0  Rx invalid frag:0
          Tx excessive retries:0  Invalid misc:4568   Missed beacon:0

well in my mind it should go through the access point first,
but in reality it may actually travel via what ever network layer to the router first then the access point . regardless of it physically going through the access point first.

basically does the access point recognize ping when it is sent from the client or does it recognize it after it comes from the router

so i cant answer that personally as im unsure if it actually matters since its all supposed to look like a wired network regardless of setup.
the best i could find was this from stack overflow.

https://stackoverflow.com/questions/8574213/how-does-a-icmp-packet-ping-command-work-in-wireless-network

as for invalid misc: Invalid misc
Other packets lost in relation with specific wireless operations.

YaY !! real helpful i know … But

https://unix.stackexchange.com/questions/337488/wifi-too-much-invalid-misc-how-to-fix-it

seems like maybe its due to other traffick on the wireless channel that is or may not be a problem .

and this forum post maybe of some actual use

http://forums.cabling-design.com/wireless/excessive-invalid-misc-9087-.htm

so im not sure 4500 packets of ‘noise’ is a cause for concern but it does indicate their is some interference on the network .

why this may be causing the issue im unsure but one thing of note , is that if i read your posts correctly you said you was using the same ssid on both the router and the access point .

for some reason some people who do the whole wifi thing treat this as taboo.
While others do this by design.
Im sure there is a reason to do this and not to do this and the forum at ubnt may give some helpful clues in getting you straightened out.

https://community.ubnt.com/t5/UniFi-Wireless/Multiple-APs-should-use-different-channels-but-the-same-SSID/td-p/1828117

1 Like

and oh yeah as for the ping test
no packet loss which is good and the latency seems about average maybe ms or 2 off but maybe accounted for by the structure the signal is passing through.

so far that inference is likely the target to try and fix.

Perhaps I can try reducing the AP transmit power and see if that changes invalid misc.

Yes I use the same SSID for the router and AP, but on different channels (1 and 11 I think), as recommended in that link. Currently only using 2.4 GHz.

“Invalid misc” to me sounds like a garbled ethernet frame. e.g., in Cisco terms, a runt or giant frame.

which could be interference either on the power line segment OR the WIFI. What device is reporting those invalid misc?

I suspect if you reduce transmit power you’ll just be more susceptible to interference, not less.

I’d really try to hard-wire a device into each network segment and see if that goes down when the other stuff does. Yes, this will take some time, but it will be quicker than random guessing :slight_smile:

e.g.,

  • plug directly into the router. see if it works when other stuff doesn’t
  • plug directly into the power line extender. as above.

You can speculate on the reasons all day, but the first thing you need to do IMHO is figure out what components work and what don’t during the time the problem is happening. That will give you a good indication of where to start pointing fingers.

“interference” could also be caused by a dodgy network cable, over-length network cable or a flaky NIC.

Can also be duplex problems. Check the speed and duplex of all your connections are what you think they are. i.e., both devices on the same cable are using the same speed and duplex. Sometimes, negotiation fails, especially when you’re dealing with 100 Mb or 10Mb ethernet, as it wasn’t part of those standards.

Check the client systems too, event logs, see if there is something silly like a master browser fight going on, all versions of windows does this. long story short PC’s on a non domain network, like a home or small business, the PC’s decide some crap amongst themselves where one PC is the glorious “master browser”. in a busy computer shop with lots of random PC being connected to the network for a short period of time and leaving master browser fights were common. If you have a mixture of windows versions on the network like a few vista machines and a bunch of windows 7 and windows 10 then instead of an orderly election to decide who is the one it’s might be more like a 12 way fist fight.
Linux machines wouldn’t be affected by that, it’s just something between windows machines i think.
-edit even if the master browser election goes in an orderly fashion it still makes the internet drop out briefly so if master browser elections are being triggered too often it’s noticeable and there would be event log entries at the time it happens on all affected machines.

Two linux devices (a desktop and netbook) at the same location, connected by wifi to the AP.

You’re right, reducing transmit power didn’t get rid of the invalid misc. I didn’t do throughput tests.

This already exists: the ATA for voip. It works when the issue occurs, so I know the problem is not the ISP, modem, and at least some of the router’s functionality.

As for plugging into the power line adapter, it has just one port, so that has to be the AP. I do have a network printer plugged into the AP’s switch, but not sure how to get useful info from that.

Good point about dodgy network cables, etc. I could try replacing the ethernet cables connecting the power line adapters with the router or AP. I don’t have a way to know if the replacement is any better, but it’s one of the easiest steps I can take.

Would the fallout from this affect linux systems too? I use only linux machines but there are Windows 7 and a Windows 10 machine on the network, plus a bunch of android devices and an iPad. And a PS3. In addition to the ATA and network printer. I keep forgetting about network population booms.

No, for troubleshooting, you can take down the AP, plug a machine into that port, and test to see if it is stable for some long enough period of time.

If you do not see issues on the powerline adapter without the wifi, you then know the problem is your wifi and not your powerline network segment. Because right now, interference or network drop could be coming from both… and you’re taking a guess that the powerline adapter isn’t the cause.

Number one rule of troubleshooting: don’t just guess. Take a guess (as a starting point), sure. But CONFIRM whether or not your guess is correct (i.e. devise a method to rule it out) before assuming 100% that it is.

Otherwise you’re going to go around in circles forever, possibly based on assumptions that aren’t true.

Diagnosing this stuff isn’t rocket science, but you do need to make 100% sure any assumptions that you are making are correct. If you haven’t confirmed that, or can’t devise a test, rule everything else out that you can.

I’ve been bitten in the ass so many times earlier in my career trying to diagnose stuff for huge amounts of time because i made one or more assumptions that proved not to be true. Had I actually tested my assumption(s) instead of just assuming without test, i would have fixed the issue(s) in way less time.

check the event logs on the windows machines for events logged at the time the dropouts occure.
PS3 no idea if it doesn’t need to be connected I wouldn’t leave it permanently connected, see if the problem only happens when it’s connected is the best way to troubleshoot that,
ipads and android devices don’t have any internal system logging like the windows event log that i know about.

maybe a PFsense box so your entire network has to go through it to get to the cable modem would be the easiest and cheapest way to really monitor whats going on with the network traffic.

have you explored if it’s just a crappy driver on some machines making their WLan card work but unreliably, I’ve seen that a bunch of times.