Return to Level1Techs.com

Comprehensive debuging of network issues


#17

Check the client systems too, event logs, see if there is something silly like a master browser fight going on, all versions of windows does this. long story short PC’s on a non domain network, like a home or small business, the PC’s decide some crap amongst themselves where one PC is the glorious “master browser”. in a busy computer shop with lots of random PC being connected to the network for a short period of time and leaving master browser fights were common. If you have a mixture of windows versions on the network like a few vista machines and a bunch of windows 7 and windows 10 then instead of an orderly election to decide who is the one it’s might be more like a 12 way fist fight.
Linux machines wouldn’t be affected by that, it’s just something between windows machines i think.
-edit even if the master browser election goes in an orderly fashion it still makes the internet drop out briefly so if master browser elections are being triggered too often it’s noticeable and there would be event log entries at the time it happens on all affected machines.


#18

Two linux devices (a desktop and netbook) at the same location, connected by wifi to the AP.

You’re right, reducing transmit power didn’t get rid of the invalid misc. I didn’t do throughput tests.

This already exists: the ATA for voip. It works when the issue occurs, so I know the problem is not the ISP, modem, and at least some of the router’s functionality.

As for plugging into the power line adapter, it has just one port, so that has to be the AP. I do have a network printer plugged into the AP’s switch, but not sure how to get useful info from that.

Good point about dodgy network cables, etc. I could try replacing the ethernet cables connecting the power line adapters with the router or AP. I don’t have a way to know if the replacement is any better, but it’s one of the easiest steps I can take.

Would the fallout from this affect linux systems too? I use only linux machines but there are Windows 7 and a Windows 10 machine on the network, plus a bunch of android devices and an iPad. And a PS3. In addition to the ATA and network printer. I keep forgetting about network population booms.


#19

No, for troubleshooting, you can take down the AP, plug a machine into that port, and test to see if it is stable for some long enough period of time.

If you do not see issues on the powerline adapter without the wifi, you then know the problem is your wifi and not your powerline network segment. Because right now, interference or network drop could be coming from both… and you’re taking a guess that the powerline adapter isn’t the cause.

Number one rule of troubleshooting: don’t just guess. Take a guess (as a starting point), sure. But CONFIRM whether or not your guess is correct (i.e. devise a method to rule it out) before assuming 100% that it is.

Otherwise you’re going to go around in circles forever, possibly based on assumptions that aren’t true.

Diagnosing this stuff isn’t rocket science, but you do need to make 100% sure any assumptions that you are making are correct. If you haven’t confirmed that, or can’t devise a test, rule everything else out that you can.

I’ve been bitten in the ass so many times earlier in my career trying to diagnose stuff for huge amounts of time because i made one or more assumptions that proved not to be true. Had I actually tested my assumption(s) instead of just assuming without test, i would have fixed the issue(s) in way less time.


#20

check the event logs on the windows machines for events logged at the time the dropouts occure.
PS3 no idea if it doesn’t need to be connected I wouldn’t leave it permanently connected, see if the problem only happens when it’s connected is the best way to troubleshoot that,
ipads and android devices don’t have any internal system logging like the windows event log that i know about.

maybe a PFsense box so your entire network has to go through it to get to the cable modem would be the easiest and cheapest way to really monitor whats going on with the network traffic.

have you explored if it’s just a crappy driver on some machines making their WLan card work but unreliably, I’ve seen that a bunch of times.


#21

https://www.bufferbloat.net/projects/bloat/wiki/Introduction/

Is QOS configurable on your router?

Practical Packet Analysis: Using Wireshark to Solve Real-World Network Problems https://www.amazon.com/dp/1593272669/ref=cm_sw_r_cp_api_6VABBbX85V7B3


#22

my guess is your problem is here, hostapd and it’s affiliates tends to spass out at times, on my home network, which i use hostapd as a wifi hotspot, if i connect a “to slow” device basically it halts my whole network to the point where it cant even watch youtube. But try attaching a computer running wireshark, and see what happens to the traffic when it happens.
My setup is a raspberry pi which bridges to my modem, running some dhcp, ip-tables, and all that jazz, but if i connect a slow wifi device, and start downloading lets say a steam game, my whole network, even wired goes full retard.
if i disconnect the slow device and give the network a minute or two to cool down everything returns to normal.


#23

I tried to scroll through the responses above. But your issues sounds like something is a bit bonkers with dhcp, perhaps other parts as well. If you don’t already have tried it - increase lease-time and see if the problem happens more in correspondence with the new settings.

Second, lock in your leases to static entries.

third, attach a machine or a few with static ip’s set. Run ‘debugging’ all the time, check in when other parts misbehave to check out if it misbehaves as well.


#24

You’re supposed to have some counters for invalid frames be different than zero, that’s ok.

Do you happen to have a wired computer or a USB drive where you could redirect verbose logs from hostapd on ddwrt?

You could leave a ping or MTR running in other direction as well


#25

Thanks to everyone for all the input and ideas.

Of course you’re right but unfortunately, I have the added constraint of non-destructive testing. Network users won’t be happy with disruptions, so ideally diagnostics should be done with devices I control.

I could switch the AP (TP-Link WR841N + ddwrt vs Engenius ESR-9850 + stock firmware) with another one though at least temporarily. Maybe better than nothing?

Continued happening after replacing my old system with a new one, using different WLAN cards (Atheros on old, Realtek on new). Ubuntu vs Fedora.

Yes, router is a Netgear R7000. I was using QOS with FQ_CODEL, but recently disabled QOS since things seem fine (and faster) without it and I wanted to eliminate a variable.

Is there a shorter tutorial you’d recommend for this issue?

My issue is that even ping fails, so it’s not just about reduced throughput. Is hostapd still suspect?

Leases are 1440 minutes by default. My affected machine has a static IP, as do several others.

I confirmed that when one machine has the network drop and outgoing pings fail, another machine (netbook) was able to connect to the AP/router/internet without problems, but pinging the affected machine failed.

What do you mean by "run ‘debugging’ " ?

I could leave a running netbook wired to the AP, would that help? Where do I set verbose logs from hostapd on ddwrt? I already have logging enabled on both the AP and router, though not redirected to another system at the moment. The AP has entries from hostapd but they’re just about authenticating, handshakes, etc. I haven’t noticed any hostapd messages on the router although it also has some (fewer) wifi clients.


#26

2 things there.

  1. It is expected that you’ll get slower results from a speed test with QOS/FQ_CODEL on because it will always try to keep some bandwidth open for new connections. However, it is processor intensive (and I think single-threaded?), so:

  2. at some point it will hit a brick wall. For instance, on my Ubiquiti router, it peaks at around 100mbs with “Smart QOS” enabled. The good news there is that it (in my experience at least) it’s not necessary once you have 3-digit speeds.

You don’t really need to order and read through that whole book, but wireshark is a go-to for troubleshooting network.

I haven’t read through every post here, but have you tested each connection with iperf? I would eliminate layer 1 issues before anything else, especially since you’ve got powerline adapters in the mix.

Whoever mentioned starting at the top or bottom and incrementally moving through the network is correct, although I’d almost always recommend starting at the bottom. I doubt your problem is in layers 4, 5, 6 or 7.


ping -A can be your friend if you want to get some work done (or relax) while you wait for something bad to happen.


#27

Sorry for the defuse line of “run debugging”. What I mean is that you collect data in form of ping with ip, dns, and checking your routes (can you reach the specific machines) - however you stated that you do have machines with static addresses already…

Verify the following:

Do your static address space collide with the dhcp space?

Have you seen any logs indicating that one machine has multiple leases?

Have you ran out of leases?

Is the problem something you believe can be related to the lease time with the above checked?

How many windows 10 machines in total? Are they all set to “hide their mac” as it does by default (ie, screws up address leases and fills the dhcp with requests and sometimes does not release it correctly…)?

The last point is valid, if you are using equipment that decides to renew their mac-address etc.

How does your ARP-cache look like on your router? It can give hints to if the above is true.


#28

They are seeing disruptions right now due to the drop outs that haven’t been fixed for over a month. :slight_smile:

Disruptive testing can be scheduled, unlike random drop outs.

Sometimes to bake a cake you gotta break some eggs.


#29

Drop happened yet again. I had just changed the AP with a spare unit (wifi not as good) just a few hours ago.


@oO.o My internet is only 15/10 Mbps, so it should be nowhere near the point of overloading the router, QOS or not.

What should I be looking for with iperf?

Adaptive ping. Interpacket interval adapts to round-trip time, so that effectively not more than one (or more, if preload is set) unanswered probes present in the network. Minimal interval is 200msec for not super-user. On networks with low rtt this mode is essentially equivalent to flood mode.

Sorry, I’m not too familiar with networks. What is the reason for using ping -A ?

“Luckily” the latest drop happened while I had wireshark running. Does anything catch the eye?


#30

DHCP is from 192.168.2.100 and up. Static IPs are 192.168.2.2 through 192.168.2.9.

No.

I don’t think so.

Lease time with DHCP is 1440 minutes. The machine I use on which I notice the drops has a static IP.

Just one windows 10 machine, and it’s rarely ever on. People who buy new computers but use their phone 99% of the time… But in any case there are typically at most about a dozen IP leases on the router status page, usually half that.

What is this and where can I check it?


#31

Actually the issue has been around a few years… but you’re preaching to the choir. I go to great lengths to fix, optimize, etc.

Most people, including techies I know who said they’ve experienced similar issues, just want to minimize effort, and really wont tolerate having devices disconnected or slowed down. Considering the disruption from the issue can be maybe just a minute every week or so, and easily worked around by reconnecting… the economics of this chart indicate I’ve already spent too much time:


#32

I think you are at the point where you need to replace the wireless nic on the problem pc,

if im not mistaken you have already,
changed out the router, problem persisted.
changed out the access point problem persisted.
and its just this one machine that keeps dropping connection.

so next is the nic… hopefully that is the fix.


#33

It’s not just one machine. My previous machine at the same location experienced the same thing with a different NIC. It seems to have happened with at least one other machine in a different location, but I don’t get good debugging info from other users, often only long after the fact.

It happened with:
-router: Engenius ESR-9850; AP: TP-Link WR841N
-router: Netgear R7000; AP: TP-link WR841N
-router: Netgear R7000; AP: Engenius ESR-9850


#34

What is the brand and model of your power line adaptors?
from what I’ve read
Engenius ESR9850 - it’s a product that’s been around since 2010
TP-Link WR841N - it’s a product that’s been around since 2014
Netgear R7000 - looks like it came out 2016/2017 but it’s had issues

I read a few forum post and angry reviews of people that bought a TP-Link WR841N that were describing issues similar to the ones you are experiencing. Same with netgear R7000 but apparently resolved with newer firmware versions.
If those power line adaptors are TP-Link I’ve got bad news for you, random internet dropouts are a known thing that firmware updates might help with but as a matter of best practice i’d avoid using them if possible, running a Cat6 cable and keeping it well away from existing power cabling would be best.


#35

The powerline adapters are TP-Link TL-PA511KIT and yes there are known issues with them having connection drops due to power savings issues.

I haven’t been able to confirm that my issue is due to the powerline adapters, although I’ve had my suspicions, since in my case when one device has the drop, others are still able to connect via a powerline route. Disconnecting/reconnecting the affected device from wifi seems to “fix” the drop and unless that’s a coincidence of timing I don’t see how that would work if the powerline adapters were the problem.

Also, the WR841N and R7000 are running ddwrt. The ESR9850 unfortunately has no ddwrt support, so it’s on the latest stock firmware and the R7000 was purchased to retire it.


#36

Depends if that 1 minute per week is in the middle of something important now, doesn’t it?