Wrong questions

Saving top nodes

Check out the metrics for all node across all clusters:


kubectl top node --context cluster1 --no-headers | sort -nr -k2 | head -1
cluster1-controlplane   127m   1%    703Mi   1%

kubectl top node --context cluster2 --no-headers | sort -nr -k2 | head -1
cluster2-controlplane   126m   1%    675Mi   1%

kubectl top node --context cluster3 --no-headers | sort -nr -k2 | head -1
cluster3-controlplane   577m   7%    1081Mi   1%

kubectl top node --context cluster4 --no-headers | sort -nr -k2 | head -1
cluster4-controlplane   130m   1%    679Mi   1%

Using this, find the node that uses most cpu. In this case, it is cluster3-controlplane on cluster3.


echo cluster3,cluster3-controlplane > /opt/high_cpu_node

Install etcdctl


Install etcd utility:

cd /tmp
export RELEASE=$(curl -s https://api.github.com/repos/etcd-io/etcd/releases/latest | grep tag_name | cut -d '"' -f 4)
wget https://github.com/etcd-io/etcd/releases/download/${RELEASE}/etcd-${RELEASE}-linux-amd64.tar.gz
tar xvf etcd-${RELEASE}-linux-amd64.tar.gz ; cd etcd-${RELEASE}-linux-amd64
mv etcd etcdctl  /usr/local/bin/

Network policy to make a pod accessible from a specific pod in a specific namespace


ingress:
- from:
  - namespaceSelector:
      matchLabels:
        kubernetes.io/metadata.name: default
	  podSelector:
      matchLabels:
        app: cyan-white-cka28-trb

Deployment scaling out


kubectl get deployment

We can see DESIRED count for pink-depl-cka14-trb is 2 but the CURRENT count is still 1

As we know Kube Controller Manager is responsible for monitoring the status of replica sets/deployments and ensuring that the desired number of PODs are available so let's check if its running fine.


kubectl get pod -n kube-system

So kube-controller-manager-cluster4-controlplane is crashing, let's check the events to figure what's happening


student-node ~ ✖ kubectl get event --field-selector involvedObject.name=kube-controller-manager-cluster4-controlplane -n kube-system
LAST SEEN   TYPE      REASON         OBJECT                                              MESSAGE
10m         Warning   NodeNotReady   pod/kube-controller-manager-cluster4-controlplane   Node is not ready
3m25s       Normal    Killing        pod/kube-controller-manager-cluster4-controlplane   Stopping container kube-controller-manager
2m18s       Normal    Pulled         pod/kube-controller-manager-cluster4-controlplane   Container image "k8s.gcr.io/kube-controller-manager:v1.24.0" already present on machine
2m18s       Normal    Created        pod/kube-controller-manager-cluster4-controlplane   Created container kube-controller-manager
2m18s       Warning   Failed         pod/kube-controller-manager-cluster4-controlplane   Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "kube-controller-manage": executable file not foundin $PATH: unknown
108s        Warning   BackOff        pod/kube-controller-manager-cluster4-controlplane   Back-off restarting failed container

student-node ~ ➜

You will see some errors as below


Warning   Failed    pod/kube-controller-manager-cluster4-controlplane   Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "kube-controller-manage": executable filenot foundin $PATH: unknown

Seems like its trying to run kube-controller-manage command but it is supposed to run kube-controller-manager commmand. So lets look into the kube-controller-manager manifest which is present under /etc/kubernetes/manifests/kube-controller-manager.yaml on cluster4-controlplane node. So let's SSH into cluster4-controlplane


ssh cluster4-controlplane
vi /etc/kubernetes/manifests/kube-controller-manager.yaml

Under containers: -> command: change kube-controller-manage to kube-controller-manager and restart kube-controller-manager-cluster4-controlplane POD


kubectl delete pod kube-controller-manager-cluster4-controlplane -n kube-system

Check now the ReplicaSet


kubectl get deployment

CURRENT count should be equal to the DESIRED count now for pink-depl-cka14-trb.

PV & PVC with Storage Class


---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: coconut-pv-cka01-str
  labels: 
    storage-tier: gold
spec:
  capacity:
    storage: 100Mi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: /opt/coconut-stc-cka01-str
  storageClassName: coconut-stc-cka01-str
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
                - cluster1-node01
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: coconut-pvc-cka01-str
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 50Mi
  storageClassName: coconut-stc-cka01-str
  selector: 
    matchLabels:
      storage-tier: gold

Troubleshooting: service not reachable from a pod

Test if the service curlme-cka01-svcn is accessible from pod curlpod-cka01-svcn or not.


kubectl exec curlpod-cka01-svcn -- curl curlme-cka01-svcn

.....
  %Total    %Received %XferdAverageSpeedTimeTimeTimeCurrentDloadUploadTotalSpentLeftSpeed0     0    0     0    0     0      0      0 --:--:--  0:00:10 --:--:--     0

We did not get any response. Check if the service is properly configured or not.


kubectl describe svc curlme-cka01-svcn ''

....
Name:              curlme-cka01-svcn
Namespace:         default
Labels:            <none>
Annotations:       <none>
Selector:          run=curlme-ckaO1-svcn
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.109.45.180
IPs:               10.109.45.180
Port:              <unset>  80/TCP
TargetPort:        80/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>

The service has no endpoints configured. As we can delete the resource, let's delete the service and create the service again.

To delete the service, use the command kubectl delete svc curlme-cka01-svcn.

You can create the service using imperative way or declarative way.

Using imperative command:


kubectl expose pod curlme-cka01-svcn --port=80

Possible reason for pod restarting due to liveness check fail

Notice the command sleep 3; touch /healthcheck; sleep 30;sleep 30000 it starts with a delay of 3 seconds, but the liveness probe initialDelaySeconds is set to 1 and failureThreshold is also 1. Which means the POD will fail just after first attempt of liveness check which will happen just after 1 second of pod start. So to make it stable we must increase the initialDelaySeconds to at least 5

Out of Memory error causing pod restarts


kubectl logs -f green-deployment-cka15-trb-xxxx

You will see some logs like these


2022-09-18 17:13:25 98 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2022-09-18 17:13:25 98 [Note] InnoDB: Memory barrier is not used
2022-09-18 17:13:25 98 [Note] InnoDB: Compressed tables use zlib 1.2.11
2022-09-18 17:13:25 98 [Note] InnoDB: Using Linux native AIO
2022-09-18 17:13:25 98 [Note] InnoDB: Using CPU crc32 instructions
2022-09-18 17:13:25 98 [Note] InnoDB: Initializing buffer pool, size = 128.0M
Killed

This might be due to the resources issue, especially the memory, so let's try to recreate the POD to see if it helps.


kubectl delete pod green-deployment-cka15-trb-xxxx

Now watch closely the POD status


kubectl get pod

Pretty soon you will see the POD status has been changed to OOMKilled which confirms its the memory issue. So let's look into the resources that are assigned to this deployment.


kubectl get deploy
kubectl edit deploy green-deployment-cka15-trb

Under resources: -> limits: change memory from 256Mi to 512Mi and save the changes.

Now watch closely the POD status again


kubectl get pod

It should be stable now.

Troubleshooting cluster

ssh cluster4-controlplane

Let's take etcd backup


ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key snapshot save /opt/etcd-boot-cka18-trb.db

It might stuck for forever, let's see why that would happen. Try to list the PODs first


kubectl get pod -A

There might an error like


The connection tothe server cluster4-controlplane:6443 was refused - did you specify the right host or port?

There seems to be some issue with the cluster so let's look into the logs


journalctl -u kubelet -f

You will see a lot of connect: connection refused errors but that must be because the different cluster components are not able to connect to the api server so try to filter out these logs to look more closly


journalctl -u kubelet -f | grep -v 'connect: connection refused'

You should see some erros as below


cluster4-controlplane kubelet[2240]: E0923 04:38:15.630925    2240 file.go:187] "Could not process manifest file" err="invalid pod: [spec.containers[0].volumeMounts[1].name: Not found: \"etcd-cert\"]" path="/etc/kubernetes/manifests/etcd.yaml"

So seems like there is some incorrect volume which etcd is trying to mount, let's look into the etcd manifest.


vi /etc/kubernetes/manifests/etcd.yaml

Search for etcd-cert, you will notice that the volume name is etcd-certs but the volume mount is trying to mount etcd-cert volume which is incorrect. Fix the volume mount name and save the changes. Let's restart kubelet service after that.


systemctl restart kubelet

Wait for few minutes to see if its good now.


kubectl get pod -A

You should be able to list the PODs now, let's try to take etcd backup now:


ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key snapshot save /opt/etcd-boot-cka18-trb.db

It should work now.

Rearranging kubectl outputs in columns

The easiest way to route traffic to a specific pod is by the use of labels and selectors . List the pods along with their labels:


student-node ~ ➜  kubectl get pods --show-labels -n spectra-1267
NAME     READY   STATUS    RESTARTS   AGE     LABELS
pod-12   1/1     Running   0          5m21s   env=dev,mode=standard,type=external
pod-34   1/1     Running   0          5m20s   env=dev,mode=standard,type=internal
pod-43   1/1     Running   0          5m20s   env=prod,mode=exam,type=internal
pod-23   1/1     Running   0          5m21s   env=dev,mode=exam,type=external
pod-32   1/1     Running   0          5m20s   env=prod,mode=standard,type=internal
pod-21   1/1     Running   0          5m20s   env=prod,mode=exam,type=external

Looks like there are a lot of pods created to confuse us. But we are only concerned with the labels of pod-23 and pod-21.

As we can see both the required pods have labels mode=exam,type=external in common. Let's confirm that using kubectl too:


student-node ~ ➜  kubectl get pod -l mode=exam,type=external -n spectra-1267
NAME     READY   STATUS    RESTARTS   AGE
pod-23   1/1     Running   0          9m18s
pod-21   1/1     Running   0          9m17s

Nice!! Now as we have figured out the labels, we can proceed further with the creation of the service:


student-node ~ ➜  kubectl create service clusterip service-3421-svcn -n spectra-1267 --tcp=8080:80 --dry-run=client -o yaml > service-3421-svcn.yaml

Now modify the service definition with selectors as required before applying to k8s cluster:


student-node ~ ➜  cat service-3421-svcn.yaml
apiVersion: v1
kind: Service
metadata:
  creationTimestamp:nulllabels:
    app: service-3421-svcn
  name: service-3421-svcn
  namespace: spectra-1267
spec:
  ports:
  - name: 8080-80
    port: 8080
    protocol: TCP
    targetPort: 80
  selector:
    app: service-3421-svcn  # delete
    mode: exam    # add
    type: external  # add
  type: ClusterIP
status:
  loadBalancer: {}

Finally let's apply the service definition:


student-node ~ ➜  kubectl apply -f service-3421-svcn.yaml
service/service-3421 created

student-node ~ ➜  k get ep service-3421-svcn -n spectra-1267
NAME           ENDPOINTS                     AGE
service-3421   10.42.0.15:80,10.42.0.17:80   52s

To store all the pod name along with their IP's , we could use imperative command as shown below:


student-node ~ ➜  kubectl get pods -n spectra-1267 -o=custom-columns='POD_NAME:metadata.name,IP_ADDR:status.podIP' --sort-by=.status.podIP

POD_NAME   IP_ADDR
pod-12     10.42.0.18
pod-23     10.42.0.19
pod-34     10.42.0.20
pod-21     10.42.0.21
...

# store the output to /root/pod_ips
student-node ~ ➜  kubectl get pods -n spec

External endpoint for a service

Let's check if the webserver is working or not:


student-node~ ➜  curl student-node:9999
...
<h1>Welcome to nginx!</h1>
...

Now we will check if service is correctly defined:


student-node ~ ➜  kubectldescribe svcexternal-webserver-cka03-svcn
Name:external-webserver-cka03-svcn
Namespace:default
.
.
Endpoints:         <none> # thereareno endpointsfor the service
...

As we can see there is no endpoints specified for the service, hence we won't be able to get any output. Since we can not destroy any k8s object, let's create the endpoint manually for this service as shown below:


student-node ~ ➜  export IP_ADDR=$(ifconfig eth0 | grep inet | awk '{print $2}')

student-node ~ ➜ kubectl --context cluster3 apply -f - <<EOF
apiVersion: v1
kind: Endpoints
metadata:
  # the name here should match the name of the Service
  name: external-webserver-cka03-svcn
subsets:
  - addresses:
      - ip: $IP_ADDR
    ports:
      - port: 9999
EOF

Finally check if the curl test works now:


student-node ~ ➜  kubectl --context cluster3 run --rm  -i test-curl-pod --image=curlimages/curl --restart=Never -- curl -m 2 external-webserver-cka03-svcn
...
<title>Welcome to nginx!</title>
...

Running nslookup in pods

Switching to cluster1:


kubectlconfiguse-context cluster1

To create a pod nginx-resolver-cka06-svcn and expose it internally:


student-node ~ ➜ kubectl run nginx-resolver-cka06-svcn --image=nginx
student-node ~ ➜ kubectl expose pod/nginx-resolver-cka06-svcn --name=nginx-resolver-service-cka06-svcn --port=80 --target-port=80 --type=ClusterIP

To create a pod test-nslookup. Test that you are able to look up the service and pod names from within the cluster:


student-node ~ ➜  kubectl run test-nslookup --image=busybox:1.28 --rm -it --restart=Never -- nslookup nginx-resolver-service-cka06-svcn
student-node ~ ➜  kubectl run test-nslookup --image=busybox:1.28 --rm -it --restart=Never -- nslookup nginx-resolver-service-cka06-svcn > /root/CKA/nginx.svc.cka06.svcn

Get the IP of the nginx-resolver-cka06-svcn pod and replace the dots(.) with hyphens(-) which will be used below.


student-node ~ ➜  kubectl get pod nginx-resolver-cka06-svcn -o wide
student-node ~ ➜  IP=`kubectl get pod nginx-resolver-cka06-svcn -o wide --no-headers | awk '{print $6}' | tr '.' '-'`
student-node ~ ➜  kubectl run test-nslooku

Possible reason for pod stuck in pending state due to PVC stuck in pending state


kubectl get event --field-selector involvedObject.name=demo-pod-cka29-trb

You will see some Warnings like:


Warning   FailedScheduling   pod/demo-pod-cka29-trb   0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.

This seems to be something related to PersistentVolumeClaims, Let's check that:


kubectl get pvc

You will notice that demo-pvc-cka29-trb is stuck in Pending state. Let's dig into it


kubectl get event --field-selector involvedObject.name=demo-pvc-cka29-trb

You will notice this error:


Warning   VolumeMismatch   persistentvolumeclaim/demo-pvc-cka29-trb   Cannot bindto requested volume "demo-pv-cka29-trb": incompatible accessMode

Which means the PVC is using incompatible accessMode, let's check the it out


kubectl get pvc demo-pvc-cka29-trb -o yaml
kubectl get pv demo-pv-cka29-trb -o yaml

Let's re-create the PVC with correct access mode i.e ReadWriteMany


kubectl get pvc demo-pvc-cka29-trb -o yaml > /tmp/pvc.yaml
vi /tmp/pvc.yaml

Under spec: change accessModes: from ReadWriteOnce to ReadWriteMany

Delete the old PVC and create new


kubectl delete pvc demo-pvc-cka29-trb
kubectl apply -f /tmp/pvc.yaml

Check the POD now


kubectl get pod demo-pod-cka29-trb

It should be good now.

Worker node not in ready state

SSH into the cluster4-node01 and check if kubelet service is running


ssh cluster4-node01
systemctlstatus kubelet

You will see its inactive, so try to start it.


systemctl start kubelet

Check again the status


systemctl status kubelet

Its still failing, so let's look into some latest error logs:


journalctl -u kubelet --since "30 min ago" | grep 'Error:'

You will see some errors as below:


cluster4-node01 kubelet[6301]: Error: failed to construct kubeletdependencies: unable to load client CAfile /etc/kubernetes/pki/CA.crt: open /etc/kubernetes/pki/CA.crt: no suchfile or directory

Check if /etc/kubernetes/pki/CA.crt file exists:


ls /etc/kubernetes/pki/

You will notice that the file name is ca.crt instead of CA.crt so possibly kubelet is looking for a wrong file. Let's fix the config:


vi /var/lib/kubelet/config.yaml

Change clientCAFile from /etc/kubernetes/pki/CA.crt to /etc/kubernetes/pki/ca.crt

Try to start it again


systemctl start kubelet

Service should start now but there might be an error as below


ReportingIn
stance:""}': 'Post "https://cluster4-controlplane:6334/api/v1/namespaces/default/events": dial tcp 10.9.63.18:633
4: connect: connection refused'(may retry after sleeping)
Sep 18 09:21:47 cluster4-node01 kubelet[6803]: E0918 09:21:47.641184    6803 kubelet.go:2419] "Error getting node
" err="node \"cluster4-node01\" not found"

You must have noticed that its trying to connect to the api server on port 6334 but the default port for kube-apiserver is 6443. Let's fix this:

Edit the kubelet config


vi /etc/kubernetes/kubelet.conf

Change server


server: https://cluster4-controlplane:6334


server: https://cluster4-controlplane:6443

Finally restart kublet service


systemctl restart kubelet

Check from the student-node now and cluster4-node01 should be ready now.


kubectl get node --context=cluster4

Daemonset not getting scheduled on the controlplane node

Check the status of DaemonSet


kubectl --context2 cluster2 get ds logs-cka26-trb -n kube-system

You will find that DESIRED CURRENT READY etc have value 2 which means there are two pods that have been created. You can check the same by listing the PODs


kubectl --context2 cluster2 get pod  -n kube-system

You can check on which nodes these are created on


kubectl --context2 cluster2 get pod <pod-name> -n kube-system -o wide

Under NODE you will find the node name, so we can see that its not scheduled on the controlplane node which is because it must be missing the reqiured tolerations. Let's edit the DaemonSet to fix the tolerations


kubectl --context2 cluster2 edit ds logs-cka26-trb -n kube-system

Under tolerations: add below given tolerations as well


- key: node-role.kubernetes.io/control-plane
  operator: Exists
  effect: NoSchedule

Wait for some time PODs should schedule on all nodes now including the controlplane node.

Several pods in kube-system namespace are restarting after some time

🚫

In such questions, a common reason is health check as it fails after an interval and leads to pod crashing.

You will see that kube-controller-manager-cluster4-controlplane pod is crashing or restarting. So let's try to watch the logs.


kubectl logs -f kube-controller-manager-cluster4-controlplane --context=cluster4 -n kube-system

You will see some logs as below:


 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.10.129.21:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 10.10.129.21:6443: connect: connection refused

You will notice that somehow the connection to the kube api is breaking, let's check if kube api pod is healthy.


kubectl get pod --context=cluster4 -n kube-system

Now you might notice that kube-apiserver-cluster4-controlplane pod is also restarting, so we should dig into its logs or relevant events.


kubectl logs -f kube-apiserver-cluster4-controlplane -n kube-system
kubectl get event --field-selector involvedObject.name=kube-apiserver-cluster4-controlplane -n kube-system

In events you will see this error


Warning   Unhealthy      pod/kube-apiserver-cluster4-controlplane   Liveness probe failed: Get "https://10.10.132.25:6444/livez": dial tcp 10.10.132.25:6444: connect: connection refused

From this we can see that the Liveness probe is failing for the kube-apiserver-cluster4-controlplane pod, and we can see its trying to connect to port 6444 port but the default api port is 6443. So let's look into the kube api server manifest.


ssh cluster4-controlplane
vi /etc/kubernetes/manifests/kube-apiserver.yaml

Under livenessProbe: you will see the port: value is 6444, change it to 6443 and save. Now wait for few seconds let the kube api pod come up.


kubectl get pod -n kube-system

Watch the PODs status for some time and make sure these are not restarting now.

Pod not getting scheduled on any node (cannot edit pod config to fix it)

You will see that cat-cka22-trb pod is stuck in Pending state. So let's try to look into the events


kubectl --context cluster2 get event --field-selector involvedObject.name=cat-cka22-trb

You will see some logs as below


Warning   FailedScheduling   pod/cat-cka22-trb   0/3 nodes are available: 1node(s) had untolerated taint {node-role.kubernetes.io/master: }, 2node(s) didn't match Pod'snodeaffinity/selector. preemption: 0/2 nodes are available: 3 Preemption is not helpful for scheduling.

So seems like this POD is using the node affinity, let's look into the POD to understand the node affinity its using.


kubectl --context cluster2 get pod cat-cka22-trb -o yaml

Under affinity: you will see its looking for key: node and values: cluster2-node02 so let's verify if node01 has these labels applied.


kubectl --context cluster2 get node cluster2-node01 -o yaml

Look under labels: and you will not find any such label, so let's add this label to this node.


kubectl label node cluster1-node01 node=cluster2-node01