KubeADM for Kubernetes: Chicken and egg problem during setup, what am I doing wrong?

Vitalius · July 9, 2018, 2:25pm

Documentation I’m following:

It’s not working as expected. I’m doing the normal install of a single master, and it keeps giving me errors when I try to do kubeadm init. The specific error is a timeout waiting for condition and it says this is usually because kubectl isn’t running or has an error.

Checking kubectl shows me that it’s Loaded in SystemD. It has a bunch of connection refused lines. When I google this, it says that’s normal because KubeCTL is going to keep trying to connect to CoreDNS until CoreDNS is up and failing, so it switches between Loaded and Active in SystemD every 8 seconds until kubeadm init is ran to get CoreDNS running.

So then I see it needs a network configured for that, and I go down to the network section and follow those instructions, but those instructions require kubeadm init to be ran first and they just get a connection refused error too.

TL;DR: Chicken & egg problem. To initialize the service, I need the network. To configure the network, I need to initialize the service.

What do I even do.

Specific errors:

[root@node1 swarmadmin]# kubeadm init --pod-network-cidr=192.168.0.0/16
I0709 08:56:35.129428   22236 feature_gate.go:230] feature gates: &{map[]}
[init] using Kubernetes version: v1.11.0
[preflight] running pre-flight checks
        [WARNING Firewalld]: firewalld is active, please ensure ports [6443 10250] are open or your cluster may not function correctly
I0709 08:56:35.153974   22236 kernel_validator.go:81] Validating kernel version
I0709 08:56:35.154040   22236 kernel_validator.go:96] Validating kernel config
[preflight/images] Pulling images required for setting up a Kubernetes cluster
[preflight/images] This might take a minute or two, depending on the speed of your internet connection
[preflight/images] You can also perform this action in beforehand using 'kubeadm config images pull'
[kubelet] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[preflight] Activating the kubelet service
[certificates] Generated ca certificate and key.
[certificates] Generated apiserver certificate and key.
[certificates] apiserver serving cert is signed for DNS names [node1.domain.com kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.0.204.73]
[certificates] Generated apiserver-kubelet-client certificate and key.
[certificates] Generated sa key and public key.
[certificates] Generated front-proxy-ca certificate and key.
[certificates] Generated front-proxy-client certificate and key.
[certificates] Generated etcd/ca certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [node1.domain.com localhost] and IPs [127.0.0.1 ::1]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [node1.domain.com localhost] and IPs [10.10.204.73 127.0.0.1 ::1]
[certificates] Generated etcd/healthcheck-client certificate and key.
[certificates] Generated apiserver-etcd-client certificate and key.
[certificates] valid certificates and keys now exist in "/etc/kubernetes/pki"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/admin.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/kubelet.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/controller-manager.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/scheduler.conf"
[controlplane] wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/manifests/kube-apiserver.yaml"
[controlplane] wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
[controlplane] wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/manifests/etcd.yaml"
[init] waiting for the kubelet to boot up the control plane as Static Pods from directory "/etc/kubernetes/manifests"
[init] this might take a minute or longer if the control plane images have to be pulled

                Unfortunately, an error has occurred:
                        timed out waiting for the condition

                This error is likely caused by:
                        - The kubelet is not running
                        - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
                        - No internet connection is available so the kubelet cannot pull or find the following control plane images:
                                - k8s.gcr.io/kube-apiserver-amd64:v1.11.0
                                - k8s.gcr.io/kube-controller-manager-amd64:v1.11.0
                                - k8s.gcr.io/kube-scheduler-amd64:v1.11.0
                                - k8s.gcr.io/etcd-amd64:3.2.18
                                - You can check or miligate this in beforehand with "kubeadm config images pull" to make sure the images
                                  are downloaded locally and cached.

                If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
                        - 'systemctl status kubelet'
                        - 'journalctl -xeu kubelet'

                Additionally, a control plane component may have crashed or exited when started by the container runtime.
                To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
                Here is one example how you may list all Kubernetes containers running in docker:
                        - 'docker ps -a | grep kube | grep -v pause'
                        Once you have found the failing container, you can inspect its logs with:
                        - 'docker logs CONTAINERID'
couldn't initialize a Kubernetes cluster
[root@node01 swarmadmin]# systemctl status kubelet -l | less
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Mon 2018-07-09 08:56:46 CDT; 9min ago
     Docs: http://kubernetes.io/docs/
 Main PID: 22358 (kubelet)
   CGroup: /system.slice/kubelet.service
           └─22358 /usr/bin/kubelet --cgroup-driver=systemd --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=systemd --cni-bin-dir=/opt/cni/bin --cni-conf-dir=/etc/cni/net.d --network-plugin=cni

Jul 09 09:06:37 node1.domain.com kubelet[22358]: E0709 09:06:37.445983   22358 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get https://10.0.204.73:6443/api/v1/nodes?fieldSelector=metadata.name%3Dnode1.domain.com&limit=500&resourceVersion=0: dial tcp 10.0.204.73:6443: connect: connection refused
Jul 09 09:06:37 node1.domain.com kubelet[22358]: E0709 09:06:37.447072   22358 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get https://10.0.204.73:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.0.204.73:6443: connect: connection refused
Jul 09 09:06:37 node1.domain.com kubelet[22358]: I0709 09:06:37.482451   22358 kuberuntime_manager.go:513] Container {Name:kube-apiserver Image:k8s.gcr.io/kube-apiserver-amd64:v1.11.0 Command:[kube-apiserver --authorization-mode=Node,RBAC --advertise-address=10.0.204.73 --allow-privileged=true --client-ca-file=/etc/kubernetes/pki/ca.crt --disable-admission-plugins=PersistentVolumeLabel --enable-admission-plugins=NodeRestriction --enable-bootstrap-token-auth=true --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key --etcd-servers=https://127.0.0.1:2379 --insecure-port=0 --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key --requestheader-allowed-names=front-proxy-client --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6443 --service-account-key-file=/etc/kubernetes/pki/sa.pub --service-cluster-ip-range=10.96.0.0/12 --tls-cert-file=/etc/kubernetes/pki/apiserver.crt --tls-private-key-file=/etc/kubernetes/pki/apiserver.key] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[] Resources:{Limits:map[] Requests:map[cpu:{i:{value:250 scale:-3} d:{Dec:<nil>} s:250m Format:DecimalSI}]} VolumeMounts:[{Name:ca-certs ReadOnly:true MountPath:/etc/ssl/certs SubPath: MountPropagation:<nil>} {Name:etc-pki ReadOnly:true MountPath:/etc/pki SubPath: MountPropagation:<nil>} {Name:k8s-certs ReadOnly:true MountPath:/etc/kubernetes/pki SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:6443,Host:10.0.204.73,Scheme:HTTPS,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:15,TimeoutSeconds:15,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:8,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Jul 09 09:06:37 node1.domain.com kubelet[22358]: E0709 09:06:37.482542   22358 dns.go:131] Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.0.204.10 207.172.3.8 207.172.3.9
Jul 09 09:06:37 node1.domain.com kubelet[22358]: I0709 09:06:37.482577   22358 kuberuntime_manager.go:757] checking backoff for container "kube-apiserver" in pod "kube-apiserver-node1.domain.com_kube-system(9c7fb87cd378884611587ccc63c45e01)"
Jul 09 09:06:37 node1.domain.com kubelet[22358]: I0709 09:06:37.482668   22358 kuberuntime_manager.go:767] Back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-node1.domain.com_kube-system(9c7fb87cd378884611587ccc63c45e01)
Jul 09 09:06:37 node1.domain.com kubelet[22358]: E0709 09:06:37.482695   22358 pod_workers.go:186] Error syncing pod 9c7fb87cd378884611587ccc63c45e01 ("kube-apiserver-node1.domain.com_kube-system(9c7fb87cd378884611587ccc63c45e01)"), skipping: failed to "StartContainer" for "kube-apiserver" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-node1.domain.com_kube-system(9c7fb87cd378884611587ccc63c45e01)"
Jul 09 09:06:38 node1.domain.com kubelet[22358]: E0709 09:06:38.445557   22358 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://10.0.204.73:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dnode1.domain.com&limit=500&resourceVersion=0: dial tcp 10.0.204.73:6443: connect: connection refused
[root@node01 swarmadmin]# docker ps -a | grep kube | grep -v pause
3cb3071a5054        214c48e87f58           "kube-apiserver --..."   13 seconds ago      Exited (1) 12 seconds ago                       k8s_kube-apiserver_kube-apiserver-node1.cdmmedia.com_kube-system_9c7fb87cd378884611587ccc63c45e01_9
e27276793c93        b8df3b177be2           "etcd --advertise-..."   20 seconds ago      Exited (1) 20 seconds ago                       k8s_etcd_etcd-node1.cdmmedia.com_kube-system_23ad1ff3e78edc7e6836c83093a2a8ac_9
6096a71a935a        55b70b420785           "kube-controller-m..."   21 minutes ago      Up 21 minutes                                   k8s_kube-controller-manager_kube-controller-manager-node1.cdmmedia.com_kube-system_c4d676022d756bd9b2043cb9aeddbdab_0
c58d7d40948e        0e4a34a3b0e6           "kube-scheduler --..."   21 minutes ago      Up 21 minutes                                   k8s_kube-scheduler_kube-scheduler-node1.cdmmedia.com_kube-system_31eabaff7d89a40d8f7e05dfc971cdbd_0
[root@node01 swarmadmin]# docker logs 3cb3071a5054 # The API Server Container
Flag --insecure-port has been deprecated, This flag will be removed in a future version.
I0709 14:18:22.723150       1 server.go:703] external host was not specified, using 10.0.204.73
I0709 14:18:22.723457       1 server.go:145] Version: v1.11.0
Error: unable to load server certificate: open /etc/kubernetes/pki/apiserver.crt: permission denied
Usage:
  kube-apiserver [flags]
Flags:

... etc etc listing all the flags from the help/manual

format: resource[.group]#size, where resource is lowercase plural (no version), group is optional, and size is a number. It takes effect when watch-cache is enabled. Some resources (replicationcontrollers, endpoints, nodes, pods, services, apiservices.apiregistration.k8s.io) have system defaults set by heuristics, others default to default-watch-cache-size
error: unable to load server certificate: open /etc/kubernetes/pki/apiserver.crt: permission denied

This all points to the API server not being up or able to successfully start. Thus all the connections refused.

But if it’s not starting due to a permission error, and I’m running this command as root, then … why?

Vitalius · July 9, 2018, 6:23pm

I tried running the kubeadm init command and checking the permissions on the file it can’t access.

[root@node1 pki]# ls -al /etc/kubernetes/pki/
total 60
drwxr-xr-x. 3 root root 4096 Jul  9 13:21 .
drwxr-x--x. 4 root root  125 Jul  9 13:21 ..
-rw-r--r--. 1 root root 1237 Jul  9 13:20 apiserver.crt

IDK what else it needs. Tried the nuke:

[root@node1 pki]# chmod 777 /etc/kubernetes/pki/apiserver.crt /etc/kubernetes/pki/ /etc/kubernetes/
[root@node1 pki]# ls -al /etc/kubernetes/pki/
total 60
drwxrwxrwx. 3 root root 4096 Jul  9 13:21 .
drwxrwxrwx. 4 root root  125 Jul  9 13:21 ..
-rwxrwxrwx. 1 root root 1237 Jul  9 13:20 apiserver.crt

Didn’t work. Still permission denied.

Vitalius · July 9, 2018, 7:20pm

Turns out it was this that was needed:

set enforce 0

doh

SgtAwesomesauce · July 9, 2018, 7:25pm

You should figure out what selinux policy is not happy and fix it, rather than disabling selinux altogether.

Vitalius · July 9, 2018, 9:53pm

Yeah I’ve opted to forsake Kubernetes.

It is the opposite of user friendly. I’m fine with using it if it’s in the Cloud where other people maintain the actual back-end, but I have other stuff to do and Docker Swarm is infinitely easier to deploy and has better documentation.

Thanks though. That’s good advice.

SgtAwesomesauce · July 9, 2018, 9:59pm

Really, that’s the best option. Docker and kubernetes are crutches.

Vitalius · July 10, 2018, 1:01am

Then what would you use to manage multiple apps in containers?

SgtAwesomesauce · July 10, 2018, 4:43pm

Nix or Ansible with LXC.

Relying on a single image from the Docker hub is a major security hole.

Vitalius · July 11, 2018, 2:00pm

What if you use your own repo/images?

SgtAwesomesauce · July 11, 2018, 5:36pm

What base image are you using?

Vitalius · July 11, 2018, 6:42pm

Alpine usually. Or if not Alpine, one based on Alpine. And if not one based on Alpine, one based on CentOS or Debian.

Basically any sourced from the “official” docker hub library. And only using ones from there as well.

Usually I read through the Dockerfile for an image before using it as a base and follow it back to scratch. Most basically follow this chain:

Alpine:

FROM scratch
ADD rootfs.tar.xz /
CMD ["/bin/sh"]

Apache-PHP-Alpine:

FROM alpine:3.7
<env variables>
<runs commands to install and setup the named programs>

etc.

So following a chain, I can see where everything comes into whatever base image I would use.

Same thing with CentOS.

AFAIK, It’s only insecure if you haven’t bothered to do that and verify what all comes in some image.

Specifically, spinning up each source image to see what’s actually in it. i.e. rootfs.tar.xz could technically have anything in it, but I can know what’s in it by just running a container using the image alpine:3.7 and looking at what is in it and what it’s doing.

It’s only really insecure then if you do something dumb like alpine:latest because latest could end up being anything. 3.7 is generally 3.7 (even though it’s possible to swap the tags around before commit).

If I need something that only exists in the general public repos and not the “official” ones, then I look at their docker file and make my own custom image using their commands and source image tweaking as I go along. Then I post that to my personal repo and just move along with whatever I’m doing.

SgtAwesomesauce · July 11, 2018, 8:27pm

That should be acceptable then.

If you build from scratch and don’t rely on existing files/binaries/code, you should be safe.