CentOS 8 graphical programs lose network connectivity after resume

carnogaunt · September 4, 2020, 12:39am

I have an issue with a CentOS 8 laptop where after closing the lid to suspend and waking up the laptop later, desktop applications are not able to open any network connections. CLI programs such as ping, dig, curl, and dnf continue to work as expected, but graphical programs such as Firefox, Chrome, and Microsoft Teams do not. The problem does not occur on every suspend, but most of the time. When it happens I have not been able to resolve it without a reboot.

I’m curious if any others have experienced a similar problem. I’m just about out of ideas.

Dynamic_Gravity · September 4, 2020, 1:27pm

What is your Display environment?

Like GNOME, KDE, XFCE, etc?

That will help with troubleshooting.

ChrisLifts · September 4, 2020, 1:39pm

Man that sucks. Looks like it is an issue across several distributions, including the previous CentOS release

When you resume from suspend what does dmesg say? Anything about kernel or network interface issues?

nmcli networking on

Seems to work for some people on CentOS after resume from suspend.

reavessm · September 4, 2020, 1:48pm

You can try putting this script in /lib/systemd/system-sleep/00restart_networking.sh

#!/bin/sh

case "$1" in
        pre)
                echo "Do something before sleep"
                ;;
        post)
                echo "I'm awake now"
                nmcli networking on
                ;;
esac

Maybe look into the permissions of the sript, idk

carnogaunt · September 4, 2020, 2:24pm

It is a GNOME display environment, chosen from the Workstation package set at install time.

carnogaunt · September 4, 2020, 2:29pm

I’ve attached the dmesg after suspend and resume.
resume-dmesg.txt (5.0 KB)

Unfortunately, nmcli networking on did not help. I also tried toggling it off/on a couple times.

Dynamic_Gravity · September 4, 2020, 3:41pm

After the machine comes up from sleep are there any failed services?

sudo systemctl list-units --failed

carnogaunt · September 4, 2020, 5:20pm

kdump.service was failed before going to sleep. No change after waking.

rcxb · September 5, 2020, 2:18am

Launch your programs (like Firefox) from a terminal window. Then you might have some output to look at.

The output of netstat might show something strange going on, or you can run tcpdump on the interface and see what your apps are sending over the wire.

spiders · September 22, 2020, 3:17am

This may have been resolved or put on low priority due to not being as much of a concern, but I figured I’d toss in some other thoughts in case they help. There is no rush for this and it’s just provided in case there is time later to test and other things to try are needed.

Huh…

A couple of the below points may be fairly easy and quick to check/clarify, though a few others may not be. I’m just tossing out some thoughts for initial poking in case obvious errors/solutions show up. (and trying not to ask for data etc) Feel free to try some or all of them or ask any question if I’ve made a mistake or something was unclear. There may be other ways to get verbose data if we know what we want to target but I hesitate to ask for too much of anything that may or may not be related, etc.

First in case it was not tried, if it’s kernel related (and it may not be) have you tried booting to an older (oldest) kernel to see if it behaves better?

That has helped in a pinch sometimes when a sudden wireless related issue ends up in a release; for me long ago). It may not but it is something I tend to try prior to looking for new updates etc.
Do you have the system proxy configured and or these applications, or are you using vpn prior to suspending?

This may be unrelated but I though I would check since command line tools (or newly launched tools that have not fully exited) appear to be working.
What type of messages do you see from firefox/chrome/teams, are they saying ‘no route to host’ or having a dns issue, etc? (or a specific type of error message; eventually timing out and showing a message?)
Also as a test in case it has not been done (though it may not be feasible for you, if you disable wireless (maybe bluetootth too, though not sure that is related) and use a wired network cable and test this does it help?
If (4) happens to help, then something simple as an initial test would be to boot back to just before non-working state (wireless on), collect some of the below output in a terminal prior to suspending, then suspending and resuming and then waiting a bit and then test a failure to connect in a browser, then using ctrl-c in the terminal to stop the output, and collect a bit more.
```
$ sudo netstat -tunap &> ~/netstat.out
$ sudo rfkill list |& tee ~/rfkill.out; sudo rfkill event | tee -a ~/rfkill.out
0: phy0: Wireless LAN
	Soft blocked: no
	Hard blocked: no
1600564002.350385: idx 0 type 1 op 0 soft 0 hard 0
....
<suspend, resume, confirm the 
 issue occurred in browser>
<return to this terminal and press ctrl-c>
$ sudo rfkill list |& tee ~/rfkill.out
$ echo "======== after resume =========" | tee -a ~/rfkill.out ~/netstat.out
$ sudo netstat -tunap &>> ~/netstat.out
$ ethtool -S wlp2s0 &>> ~/ethtoolS.out
```
(you can remove the ‘n’ flag in netstat if seeing domain names vs ip is easier)
You/we would then be looking for any error messages or that the interface might still be blocked in rfkill output; however, if network is still working for other things then this may not show much, though the netstat output may about existing/past connections states might help clarify if anything is established but not receiving anything.

This may only output a small bit of detail or many events but may not be too helpful unless it shows that the interface still has not come back on or shows odd messages to check on. There may be more for wireless troubleshooting but those came to mind initially. It seemed not much was in dmesg, so we could bump other verbosity but I’m unsure if it will reveal anything more.
If rfkill list shows anything that is blocked when you had not disabled wireless, then you might try sudo rfkill unblock NUM where NUM is replaced by the number in front of the line for your normal wireless interface). Another option after that if it was blocked is to try systemctl restart NetworkManager if that is the network service you use on the system. The only though here is in case restarting NM or otherwise shaking things up might make graphical tools applications work.

If it was graphical application only issues (and not about already connected applications, I can only think of things like the proxy or browsers like chrome/firefox checking on state of NM over dbus and having an issue and stating they are in offline mode; or related to the checks prior to reboot/suspend that browser does to inhibit hibernation but that is more about it stopping you from hibernating than issues after resuming (that I’ve seen).

The first few like booting oldest kernel and tying with wired vs wireless just as a way to narrow it down if there is a difference. If it’s a dns issue that is not due to failure to get any networking then I’d expect it might be an issue for command line tools at the same time (unless it’s a cached dns server address (that’s gone/changed) or some other odd issue due to vpn changing some of those details, etc). Those are mostly vague guessing that may not be related; since it’s odd that commands work but graphical applications don’t.

You don’t have a remote/nfs home directory do you? I’d expect a lot of other issues if that went out (not just graphical, but more complex applications using dconf or lots of files in home might lockup more obviously.
You could start the wireshark gui and not start capturing, then do the suspend/resume to cause the issue then start wireshark capture and then try to access a few websites (one you can remember) with browser and a different site with curl. Then stop the trace and look for pretty warning colors like red/black with warning messages that were happening while it was capturing during the problem time. might be simple due to easy to see colors but more complex to look into more without sending the capture (and i do -not- want that shared with the public internet) so don’t upload that capture, but maybe the messages in a screenshot or typed out can be ok if you trust your judgement. (Though sometimes it requires looking over the whole connection to come to a conclusion in that way.)

As @rcxb noted, the netstat output recommendation is to check on if the state of any previous (or recently started) connections to see if it shows what type of problem say a tcp connection might have.

carnogaunt · September 24, 2020, 3:48am

There are no proxies in my environment and I’ve never configured any settings for them, so those settings will be the installation defaults.

I do often use a VPN (OpenVPN via GNOME integration; packages from EPEL), but I get no difference in behavior under the following three scenarios:

VPN not used between last reboot and suspend
VPN used and then disconnected before suspend
VPN remains connected at time of suspend

Chrome gives ERR_TIMED_OUT when navigating to a page. Teams gives max_reload_exceeded after attempting to log in if it was not previously logged in, and gives a button to restart the application. Otherwise it displays a banner in the application with a message to the effect of “please try again when you are connected to the Internet.” Firefox doesn’t give an error, it just spins forever. Or at least, it takes longer to give an error than I care to wait for one.

I get no difference in behavior between the following network adapters:

Integrated WiFi
Integrated Ethernet port
Ethernet port on USB type C dock
Ethernet port on Thunderbolt 3 dock

Also, changing the active network adapter after resume (either disabling/enabling via NetworkManager or plugging/unplugging the cable/dock) does not seem to kick things into working.

No.

carnogaunt · September 24, 2020, 4:17am

So, a packet capture did eventually give me some insight. I was able to see successful DNS lookups coming from the browsers and bidirectional TCP/TLS flows between the browser and servers, but still getting an timeout in the browser. Then I visited a known HTTP-only site and it worked fine (in Chrome at least). Since so much of the web is delivered over HTTPS these days I was getting the impression that the browser wasn’t working at all.

I double-checked that curl is still able to make HTTPS connections after resume. Also, my VPN still works.

So, oddly, it seems that this is some kind of cryptographic issue which only affects desktop applications and only after resuming from sleep.

spiders · September 24, 2020, 6:07am

Strange… that is an interesting association if it continues to hold up regarding tls and graphical versus command line.

It randomly comes to mind that curl is using nss while things like wget use openssl, though usually that is in consideration of the trust stores; I’m not sure if this ends up being related. If we could get a simpler command line tool failing (in what we think is the same way) then it may be easier to look into.

Could test with wget with ssl when returning to other tests or you see the issue.
Or for client side debug (if openssl command fails at all):
```
$ (echo -e "GET / HTTP/1.1\n"; sleep 4 ) | openssl s_client -servername www.ssllabs.com  -connect ssllabs.com:443
```
(note that a 400 error there for ssllabs there is ok as it at least connected; and less response output than google/etc)

Other thoughts on that are things like:

if LD_LIBRARY_PATH/LD_PRELOAD happen to have any third-party or different versioned libs in the graphical environment sides but not at the command line leading it to load in odd/different ssl libraries.
- checking lsof for any /opt or /usr/local or other differences.
- specific running processes env vars are in /proc/<pid>/environ if checking for LD_* vars.
- (though if you pop open a graphical terminal to run the command then maybe that is already inherited by that bash environment; if it’s global enough and not some very specific place to only load for graphical but have a login shell with new env for bash in the graphical terminal; mostly rambling here).
Since it’s both browsers, I’m not sure it will make a difference, but if there may be any common plugins causing issues one test if there and testing again later is trying firefox in safe mode.
Why it’s related to suspending I wonder. It’s not configured as a FIPS enforced kernel/system is it?

DNS might go through a local caching daemon and you might see the (I forget if CentOS 8 has anything for that in /etc/nsswitch.conf or the network configuration). I do get that the request was for the domain opened in the browser, though. That’s not to say it pulled from the cache, but that it might be ultimately going out from somewhere other than the browser to make the query. (This may be less relevant now that it seems http does work from the browser.)

Regarding the above and packet capture, I’m guessing it doesn’t appear to be showing retransmits for that tls stream or others related to the browser? It may be hard to tell for an ssl connection otherwise as it’s wrapped up a bit normally (without interception or maybe the outer layers).

Another test might be to connect to a local (running on the laptop) ssl webserver process and see if it works (out of curiosity); there might be too many things like certificate/tls types etc, but it may be another data point regarding if it works or doesn’t for something via loopback 127.0.0.1 address.

If cockpit is installed, it creates it’s own signed cert for ssl:
```
# dnf install cockpit
# systemctl start cockpit
# curl -k https://127.0.0.1:9090
```
And if that works then try out https://127.0.0.1:9090 in the web browser.

OR better There are some quick commands for openssl ‘s_server’, but requires creating a set temporary certs prior to running it. (actual debug output can be seen from this test too)

Digging up some things I think I’ve tested with:

# openssl req -new -newkey rsa:2048 -days 90 -nodes -x509 -keyout server-fftest.key -out server-fftest.crt -subj "/C=XX/ST=State/L=City/O=Company/CN=localhost"
# openssl s_server -cert server-fftest.crt -key server-fftest.key -CAfile server-fftest.crt -accept 43832 -www -state

Example s_server output for a curl -k https://127.0.0.1:43832
```
verify depth is 1
Using default temp DH parameters
ACCEPT
SSL_accept:before SSL initialization
SSL_accept:before SSL initialization
SSL_accept:SSLv3/TLS read client hello
SSL_accept:SSLv3/TLS write server hello
SSL_accept:SSLv3/TLS write certificate
SSL_accept:SSLv3/TLS write key exchange
SSL_accept:SSLv3/TLS write certificate request
SSL_accept:SSLv3/TLS write server done
SSL_accept:SSLv3/TLS write server done
SSL_accept:SSLv3/TLS read client certificate
SSL_accept:SSLv3/TLS read client key exchange
SSL_accept:SSLv3/TLS read change cipher spec
SSL_accept:SSLv3/TLS read finished
SSL_accept:SSLv3/TLS write change cipher spec
SSL_accept:SSLv3/TLS write finished
```
That could also be run on a separate host but the -subj may need to be correct if using domain name etc for the connection. (to see what the other side might be seeing regarding ssl perspective) | There is also a -debug flag but it’ll be very verbose. | note that some output like “alert unknown ca:ssl/record/rec_layer_s3.c:1528:SSL alert number 48” can be normal if the client connects but complains it’s insecure (CA unknown) and disconnects (like curl without -k).
EDIT2: I removed the -verify 1 from s_server as it came along from me adapting an over-complicated make file I made for testing nodejs and mutual tls issues in the past that I had handy.

There may be something else that is quick and installed by default on CentOS8 that I’m forgetting, but man s_server at least will provide debug details.

I’ll give some more thought to this and if any other thoughts come up with some more specific things that may be possible to check/consider then I’ll let you know…

carnogaunt · September 25, 2020, 5:24am

This is interesting. I did briefly have FIPS mode enabled (via the fips-mode-setup script). I disabled it shortly after because OpenVPN is not yet compatible (segfaults in FIPS mode), and I can’t drop OpenVPN yet. Perhaps fips-mode-setup did not fully clean up after itself. But the system crypto policy is set to DEFAULT and the kernel command line does have fips=0 so I’m not sure where else I need to look.

spiders · September 25, 2020, 7:59am

I don’t want to send you off in the wrong direction (too far) with that thought, as I’m not aware of a specific known issue in relation to FIPS and suspend (though we can check-around/test, and I’ve not tried suspending with it before), but since it might be related to ciphers/ssl the reasoning is that some libraries/applications will take different actions (not just the ciphers used). As an example, gnupg can try to do a lot more to take care not not keep things in memory for very long, and some libraries try to unprelink themselves on the fly (or something similar) when loaded. That’s to say that I wonder if there could be any similar relation.

With fips=0and the report of the policy I’d hope that would be enough, though I do remember that at least in past CentOS versions that a rebuild of the initramfs is needed when first setting it up. Some modules and dracut scripts that take action. Per below, I’d at first assume at least those stages are ok if you added fips=0, but will need to check on CentOS 8.

The CentOS *7* dracut modules

will move aside the current fips.conf modprobe/blacklist file and then modprobe each module from /etc/fipsmodules except tcrypt then it puts the original fips.conf back and tests by loading/unloading tcrypt. With CentOS8 I think this changed and it uses a different interface /proc/crypto is available.
- I believe CentOs7 dracut-fips just mounts up /boot and checks hmac for the kernel image (which is why it needed an extra boot= kernel parameter. And I know in Cent7 that it just checks for the kernel arg fips=0 or unspecified and skips checks, otherwise it tries to load+verify the fips modules/images and if that fails it aborts immediately. If it’s not there or fips=0 it cleans out the modprobe in the initramfs (in memory).
- The dracut module in CentOS7 does include libssl.so, libfreebl*and a few others in the initramfs. That itself may not be a concern, though.

So I would not expect a need to rebuild the initramfs (it will just place the dracut module back in again if the packages are still installed), but it might not hurt to back up the initramfs, remove the packages for fips and rebuild. (Again my knowledge on that is CentOS7 based, at the moment.)

When you set it up did you use modutil to set FIPS mode in any NSS related db? (trying to think of other places fips modes may be set)

Maybe having something like this leftover here or at another nssdb location:

# modutil -list -dbdir /etc/pki/nssdb | grep -i fips
  1. NSS Internal FIPS PKCS #11 Module
         slot: NSS FIPS 140-2 User Private Key Services
        token: NSS FIPS 140-2 Certificate DB
          uri: pkcs11:token=NSS%20FIPS%20140-2%20Certificate%20DB;manufacturer=Mozilla%20Foundation;serial=0000000000000000;model=NSS%203
### versus (example of false/unset):
# cp /etc/pki/nssdb nssdb_test
# modutil -fips false -dbdir nssdb_test/
# modutil -list -dbdir  /etc/pki/nssdb| grep -i FIPS -c
0

### Another more general check (example of enabled):
# modutil -chkfips false -dbdir /etc/pki/nssdb; echo $?
FIPS mode enabled.
11
# modutil -chkfips true -dbdir /etc/pki/nssdb; echo $?
FIPS mode enabled.
0

(I’m unsure of how much the CentOS8 fisp-mode-setup helper sets/unsets at the moment.)
Possibly also within the user home directory (firefox profile), though you may not have gone out of your way to set that or I’d guess may remember doing so. (and that would be firefox specific and not chrome)

Granted, as curl uses libnss and is working I’m not sure if that can be related. As for firefox/chrome I believe they are using nss; though they might behave differently application side than a simpler curl. (where wget will be different from curl)

CentOS8/RHEL8 made things a little more convenient with update-crypto-policies, though I’m not familiar with every place it changes things at the moment, so I’m unsure if it covers things that might affect firefox and chrome (nssdb) or if it’s just under openssl configuration etc. I wish I had a spare laptop at the moment to test some things out. I might be able to get it setup as a vm just to see.

It may be worth checking on the nssdb with the-list & -chkfips command above in case you took any separate actions to set things there when you initially setup fips enforcement.

Otherwise at the moment I really don’t want to suggest anything like backing up and reinstalling (or testing another similar system) while avoiding setting up fips just to see if it happens to work, since at the moment I can’t be so sure it’s FIPS as it’s just a thought if it’s crypto specific (and also seems to be limited to chrome/firefox or possibly other graphical applications).
It may be worth seeing where in the ssl connection it may be failing by testing with openssl s_server locally and a connection to another host on your network that is running s_server. (or to see if that happens to work during the time of issue, it will report back some openssl details in browser if it does)
Maybe another random test is to login as a different user over ssh to the laptop at the time of issue, and use X11 forwarding with firefox and confirm if it’s able to connect to anything. This way it’s for sure a new instance and avoids profile settings, and avoids a full graphical session (though it may load dbus or require dbus-launch. I suppose a test of logging in as a different user without a reboot could be an option too. None of that will affect global settings like the system/nssdb states but maybe at least rules out the specific user’s configuration in home (assuming no related global firefox/chrome settings are in place).

I’ll check on CentOS8 when I have a chance for the fips configuration tools since I’ll need to be familiar with that myself eventually anyway. I’m not sure if I’ll be able to test sleeping to cause the issue, but if I get a chance to try that I will let you know the results.

carnogaunt · October 5, 2020, 3:23am

The value of LD_LIBRARY_PATH for Firefox is /usr/lib64/firefox:/usr/lib64/firefox/plugins:/usr/lib64/firefox. LD_PRELOAD is not set.

I don’t think any of these libraries are relevant to the problem.

$ find /usr/lib64/firefox/ -name '*.so*'
/usr/lib64/firefox/gtk2/libmozgtk.so
/usr/lib64/firefox/gmp-clearkey/0.1/libclearkey.so
/usr/lib64/firefox/liblgpllibs.so
/usr/lib64/firefox/libmozavcodec.so
/usr/lib64/firefox/libmozavutil.so
/usr/lib64/firefox/libmozgtk.so
/usr/lib64/firefox/libmozsandbox.so
/usr/lib64/firefox/libmozsqlite3.so
/usr/lib64/firefox/libmozwayland.so
/usr/lib64/firefox/libxul.so

/proc/<pid>/environ is an empty file for Chrome processes.

Chrome libraries in /opt are mostly symlinks to system libraries.

$ find /opt/google/chrome/ -name '*.so*' -exec ls -l {} +
-rwxr-xr-x. 1 root root  237152 Sep 18 20:48 /opt/google/chrome/libEGL.so
-rwxr-xr-x. 1 root root 9265904 Sep 18 20:48 /opt/google/chrome/libGLESv2.so
lrwxrwxrwx. 1 root root      18 Sep 23 20:09 /opt/google/chrome/libnspr4.so.0d -> /lib64/libnspr4.so
lrwxrwxrwx. 1 root root      17 Sep 23 20:09 /opt/google/chrome/libnss3.so.1d -> /lib64/libnss3.so
lrwxrwxrwx. 1 root root      21 Sep 23 20:09 /opt/google/chrome/libnssutil3.so.1d -> /lib64/libnssutil3.so
lrwxrwxrwx. 1 root root      17 Sep 23 20:09 /opt/google/chrome/libplc4.so.0d -> /lib64/libplc4.so
lrwxrwxrwx. 1 root root      18 Sep 23 20:09 /opt/google/chrome/libplds4.so.0d -> /lib64/libplds4.so
lrwxrwxrwx. 1 root root      19 Sep 23 20:09 /opt/google/chrome/libsmime3.so.1d -> /lib64/libsmime3.so
lrwxrwxrwx. 1 root root      17 Sep 23 20:09 /opt/google/chrome/libssl3.so.1d -> /lib64/libssl3.so
-rwxr-xr-x. 1 root root  261760 Sep 18 20:48 /opt/google/chrome/swiftshader/libEGL.so
-rwxr-xr-x. 1 root root 2616536 Sep 18 20:48 /opt/google/chrome/swiftshader/libGLESv2.so
-rwxr-xr-x. 1 root root 9334980 Aug 26 22:19 /opt/google/chrome/WidevineCdm/_platform_specific/linux_x64/libwidevinecdm.so

There are no libraries in /usr/local.

$ find /usr/local/ -type f
/usr/local/share/applications/mimeinfo.cache

CentOS 8 does include systemd-resolved, but /etc/resolv.conf is managed by NetworkManager and is not configured to use the stub resolver. Also, the default configuration in /etc/nsswitch.conf does not include the resolve service:

hosts:      files dns myhostname

I believe this means that resolved will only be used over its D-Bus interface. TBH I don’t know a general way to determine which mechanism an application is using.

Not that I recall.

# modutil -chkfips false -dbdir /etc/pki/nssdb; echo $?
FIPS mode disabled.
0

$ modutil -chkfips false -dbdir ~/.pki/nssdb; echo $?
FIPS mode disabled.
0

I’m not aware of any other such DBs on this system.

I will update with more information as time allows.

carnogaunt · June 25, 2021, 12:57am

@spiders I think I discovered the problem. This seems to be related to PKCS#11 support.

I noticed similar behavior in the web browser which was occurring without suspending and it turned out that I had a window in the background requesting a client certificate. The web browser was stalled on all HTTPS requests while prompting me for a PIN for the Yubikey Nano device which I leave plugged in to the laptop. Dismissing the prompt resolved the issue.

When the problem occurs after resuming from sleep, there is no visible PIN prompt on any window, but removing and reinserting the Yubikey resolves the issue.