I have a Supermicro E200-9B as my pfsense (v 2.6.0, latest as of this writing) router, and occasionally it will just stop responding to any traffic whatsoever on one of the interfaces (usually LAN, but occasionally WAN too). I wrote a stupid script in a cron job to ping several known hosts on my network to detect this and reset the interface (literally just ifconfig $iface down; sleep 5; ifconfig $iface up fixes the problem) that I run in a cron job every minute, but this sucks and is stupid.
Anyway, the nics are intel i210, and I don’t know if these are known to be awful or something, but generally I have a good impression of intel nics so I don’t expect these to be the problem.
As far as pfsense goes, it apparently has no clue that something is wrong. There’s no log message, the interface still thinks it’s up and happy, everything seems fine…except it’s not passing any traffic.
I have actually even tried multiple nics on the system (it has 4, I use 2, lan and wan) and it doesn’t really seem to matter which ones I use. They all seem to have this problem.
This seems to happen like maybe once a week, though I haven’t exactly been keeping track of statistics. I do log the output of my script, so I have logs when it fails…of course, there’s also logs of when it succeeds, every minute, so it’s spammy and annoying. Maybe I will fix that to only report when it detects a broken interface - if no one can suggest what might be wrong.
Any ideas on what horrible thing I may have done to curse my router so?
In another (much jankier) device I’ve experienced similar to this. The scenario reads as similar but could be with other causes - works fine, until an interface just seems to stop.
This in my case was a watchdog timer, in the driver itself - under high load, it’d time out and kill the interface, so not exactly random, but sporadic. You can check for that problem at least in /var/log/system.log - if you already didn’t, for lines like this (for your ethernet interface, going to guess ‘igb’).
kernel: re0: watchdog timeout
kernel: re0: link state changed to DOWN
Your script should tell you when to narrow a review for.
Obviously my case was realtek, equally I couldn’t change the card in the particular unit, and normally I’d accept here the canon that intel cards run best. Judging from Supermicro’s picture of the board, can you fit a PCI card into your unit, even if it sits open case with the card bracket removed, even for test?
There were also couple options recommended I also enabled
System > Advanced > Networking
Hardware Checksum offloading (disable)
Hardware TCP Segmentation Offloading (disable)
Hardware Large Recieve Offloading (disable)
These I assume of course to be targeted at realtek drivers - but I’d consider trying them.
If it’s none of these, consider how you can simplify the firewall rules.
Ensuring Hardware TCP Segmentation Offload (TSO) and Hardware Large Receive Offload (LRO) under System > Advanced on the Networking tab are set to their default (disabled)?
Increasing the network memory buffer?
Disabling the different types of messages signal interrupts?
Disabling flow control?
There’s a full list of interface troubleshooting tips available here.
Thanks for the input. I think I had one of these options set and toggling it seems to have fixed my problem. Unfortunately, I can’t remember exactly which one it was.