Wrong questions

Saving top nodes
Check out the metrics for all node across all clusters:
kubectl top node --context cluster1 --no-headers | sort -nr -k2 | head -1 cluster1-controlplane 127m 1% 703Mi 1% kubectl top node --context cluster2 --no-headers | sort -nr -k2 | head -1 cluster2-controlplane 126m 1% 675Mi 1% kubectl top node --context cluster3 --no-headers | sort -nr -k2 | head -1 cluster3-controlplane 577m 7% 1081Mi 1% kubectl top node --context cluster4 --no-headers | sort -nr -k2 | head -1 cluster4-controlplane 130m 1% 679Mi 1%
Using this, find the node that uses most cpu. In this case, it is cluster3-controlplane on cluster3.
echo cluster3,cluster3-controlplane > /opt/high_cpu_node
Install etcdctl
Install etcd utility: cd /tmp export RELEASE=$(curl -s https://api.github.com/repos/etcd-io/etcd/releases/latest | grep tag_name | cut -d '"' -f 4) wget https://github.com/etcd-io/etcd/releases/download/${RELEASE}/etcd-${RELEASE}-linux-amd64.tar.gz tar xvf etcd-${RELEASE}-linux-amd64.tar.gz ; cd etcd-${RELEASE}-linux-amd64 mv etcd etcdctl /usr/local/bin/
Network policy to make a pod accessible from a specific pod in a specific namespace
ingress: - from: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: default podSelector: matchLabels: app: cyan-white-cka28-trb
Deployment scaling out
kubectl get deployment
We can see DESIRED count for pink-depl-cka14-trb is 2 but the CURRENT count is still 1
As we know Kube Controller Manager is responsible for monitoring the status of replica sets/deployments and ensuring that the desired number of PODs are available so let's check if its running fine.
kubectl get pod -n kube-system
So kube-controller-manager-cluster4-controlplane is crashing, let's check the events to figure what's happening
student-node ~ ✖ kubectl get event --field-selector involvedObject.name=kube-controller-manager-cluster4-controlplane -n kube-system LAST SEEN TYPE REASON OBJECT MESSAGE 10m Warning NodeNotReady pod/kube-controller-manager-cluster4-controlplane Node is not ready 3m25s Normal Killing pod/kube-controller-manager-cluster4-controlplane Stopping container kube-controller-manager 2m18s Normal Pulled pod/kube-controller-manager-cluster4-controlplane Container image "k8s.gcr.io/kube-controller-manager:v1.24.0" already present on machine 2m18s Normal Created pod/kube-controller-manager-cluster4-controlplane Created container kube-controller-manager 2m18s Warning Failed pod/kube-controller-manager-cluster4-controlplane Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "kube-controller-manage": executable file not foundin $PATH: unknown 108s Warning BackOff pod/kube-controller-manager-cluster4-controlplane Back-off restarting failed container student-node ~ ➜
You will see some errors as below
Warning Failed pod/kube-controller-manager-cluster4-controlplane Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "kube-controller-manage": executable filenot foundin $PATH: unknown
Seems like its trying to run kube-controller-manage command but it is supposed to run kube-controller-manager commmand. So lets look into the kube-controller-manager manifest which is present under /etc/kubernetes/manifests/kube-controller-manager.yaml on cluster4-controlplane node. So let's SSH into cluster4-controlplane
ssh cluster4-controlplane vi /etc/kubernetes/manifests/kube-controller-manager.yaml
  • Under containers: -> command: change kube-controller-manage to kube-controller-manager and restart kube-controller-manager-cluster4-controlplane POD
kubectl delete pod kube-controller-manager-cluster4-controlplane -n kube-system

Check now the ReplicaSet

kubectl get deployment
CURRENT count should be equal to the DESIRED count now for pink-depl-cka14-trb.
PV & PVC with Storage Class
--- apiVersion: v1 kind: PersistentVolume metadata: name: coconut-pv-cka01-str labels: storage-tier: gold spec: capacity: storage: 100Mi accessModes: - ReadWriteMany hostPath: path: /opt/coconut-stc-cka01-str storageClassName: coconut-stc-cka01-str nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - cluster1-node01 --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: coconut-pvc-cka01-str spec: accessModes: - ReadWriteMany resources: requests: storage: 50Mi storageClassName: coconut-stc-cka01-str selector: matchLabels: storage-tier: gold
Troubleshooting: service not reachable from a pod
  • Test if the service curlme-cka01-svcn is accessible from pod curlpod-cka01-svcn or not.
kubectl exec curlpod-cka01-svcn -- curl curlme-cka01-svcn ..... %Total %Received %XferdAverageSpeedTimeTimeTimeCurrentDloadUploadTotalSpentLeftSpeed0 0 0 0 0 0 0 0 --:--:-- 0:00:10 --:--:-- 0
  • We did not get any response. Check if the service is properly configured or not.
kubectl describe svc curlme-cka01-svcn '' .... Name: curlme-cka01-svcn Namespace: default Labels: <none> Annotations: <none> Selector: run=curlme-ckaO1-svcn Type: ClusterIP IP Family Policy: SingleStack IP Families: IPv4 IP: 10.109.45.180 IPs: 10.109.45.180 Port: <unset> 80/TCP TargetPort: 80/TCP Endpoints: <none> Session Affinity: None Events: <none>
  • The service has no endpoints configured. As we can delete the resource, let's delete the service and create the service again.
  • To delete the service, use the command kubectl delete svc curlme-cka01-svcn.
  • You can create the service using imperative way or declarative way.
Using imperative command:
kubectl expose pod curlme-cka01-svcn --port=80
Possible reason for pod restarting due to liveness check fail
Notice the command sleep 3; touch /healthcheck; sleep 30;sleep 30000 it starts with a delay of 3 seconds, but the liveness probe initialDelaySeconds is set to 1 and failureThreshold is also 1. Which means the POD will fail just after first attempt of liveness check which will happen just after 1 second of pod start. So to make it stable we must increase the initialDelaySeconds to at least 5
Out of Memory error causing pod restarts
kubectl logs -f green-deployment-cka15-trb-xxxx
You will see some logs like these
2022-09-18 17:13:25 98 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins 2022-09-18 17:13:25 98 [Note] InnoDB: Memory barrier is not used 2022-09-18 17:13:25 98 [Note] InnoDB: Compressed tables use zlib 1.2.11 2022-09-18 17:13:25 98 [Note] InnoDB: Using Linux native AIO 2022-09-18 17:13:25 98 [Note] InnoDB: Using CPU crc32 instructions 2022-09-18 17:13:25 98 [Note] InnoDB: Initializing buffer pool, size = 128.0M Killed
This might be due to the resources issue, especially the memory, so let's try to recreate the POD to see if it helps.
kubectl delete pod green-deployment-cka15-trb-xxxx
Now watch closely the POD status
kubectl get pod
Pretty soon you will see the POD status has been changed to OOMKilled which confirms its the memory issue. So let's look into the resources that are assigned to this deployment.
kubectl get deploy kubectl edit deploy green-deployment-cka15-trb
  • Under resources: -> limits: change memory from 256Mi to 512Mi and save the changes.
Now watch closely the POD status again
kubectl get pod
It should be stable now.
Troubleshooting cluster
  • ssh cluster4-controlplane
    • Let's take etcd backup
      ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key snapshot save /opt/etcd-boot-cka18-trb.db
      It might stuck for forever, let's see why that would happen. Try to list the PODs first
      kubectl get pod -A
      There might an error like
      The connection tothe server cluster4-controlplane:6443 was refused - did you specify the right host or port?
      There seems to be some issue with the cluster so let's look into the logs
      journalctl -u kubelet -f
      You will see a lot of connect: connection refused errors but that must be because the different cluster components are not able to connect to the api server so try to filter out these logs to look more closly
      journalctl -u kubelet -f | grep -v 'connect: connection refused'
      You should see some erros as below
      cluster4-controlplane kubelet[2240]: E0923 04:38:15.630925 2240 file.go:187] "Could not process manifest file" err="invalid pod: [spec.containers[0].volumeMounts[1].name: Not found: \"etcd-cert\"]" path="/etc/kubernetes/manifests/etcd.yaml"
      So seems like there is some incorrect volume which etcd is trying to mount, let's look into the etcd manifest.
      vi /etc/kubernetes/manifests/etcd.yaml
      Search for etcd-cert, you will notice that the volume name is etcd-certs but the volume mount is trying to mount etcd-cert volume which is incorrect. Fix the volume mount name and save the changes. Let's restart kubelet service after that.
      systemctl restart kubelet
      Wait for few minutes to see if its good now.
      kubectl get pod -A
      You should be able to list the PODs now, let's try to take etcd backup now:
      ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key snapshot save /opt/etcd-boot-cka18-trb.db
      It should work now.
Rearranging kubectl outputs in columns
The easiest way to route traffic to a specific pod is by the use of labels and selectors . List the pods along with their labels:
student-node ~ ➜ kubectl get pods --show-labels -n spectra-1267 NAME READY STATUS RESTARTS AGE LABELS pod-12 1/1 Running 0 5m21s env=dev,mode=standard,type=external pod-34 1/1 Running 0 5m20s env=dev,mode=standard,type=internal pod-43 1/1 Running 0 5m20s env=prod,mode=exam,type=internal pod-23 1/1 Running 0 5m21s env=dev,mode=exam,type=external pod-32 1/1 Running 0 5m20s env=prod,mode=standard,type=internal pod-21 1/1 Running 0 5m20s env=prod,mode=exam,type=external
Looks like there are a lot of pods created to confuse us. But we are only concerned with the labels of pod-23 and pod-21.
As we can see both the required pods have labels mode=exam,type=external in common. Let's confirm that using kubectl too:
student-node ~ ➜ kubectl get pod -l mode=exam,type=external -n spectra-1267 NAME READY STATUS RESTARTS AGE pod-23 1/1 Running 0 9m18s pod-21 1/1 Running 0 9m17s
Nice!! Now as we have figured out the labels, we can proceed further with the creation of the service:
student-node ~ ➜ kubectl create service clusterip service-3421-svcn -n spectra-1267 --tcp=8080:80 --dry-run=client -o yaml > service-3421-svcn.yaml
Now modify the service definition with selectors as required before applying to k8s cluster:
student-node ~ ➜ cat service-3421-svcn.yaml apiVersion: v1 kind: Service metadata: creationTimestamp:nulllabels: app: service-3421-svcn name: service-3421-svcn namespace: spectra-1267 spec: ports: - name: 8080-80 port: 8080 protocol: TCP targetPort: 80 selector: app: service-3421-svcn # delete mode: exam # add type: external # add type: ClusterIP status: loadBalancer: {}
Finally let's apply the service definition:
student-node ~ ➜ kubectl apply -f service-3421-svcn.yaml service/service-3421 created student-node ~ ➜ k get ep service-3421-svcn -n spectra-1267 NAME ENDPOINTS AGE service-3421 10.42.0.15:80,10.42.0.17:80 52s
To store all the pod name along with their IP's , we could use imperative command as shown below:
student-node ~ ➜ kubectl get pods -n spectra-1267 -o=custom-columns='POD_NAME:metadata.name,IP_ADDR:status.podIP' --sort-by=.status.podIP POD_NAME IP_ADDR pod-12 10.42.0.18 pod-23 10.42.0.19 pod-34 10.42.0.20 pod-21 10.42.0.21 ... # store the output to /root/pod_ips student-node ~ ➜ kubectl get pods -n spec
External endpoint for a service
  • Let's check if the webserver is working or not:
    • student-node~ ➜ curl student-node:9999 ... <h1>Welcome to nginx!</h1> ...
      Now we will check if service is correctly defined:
      student-node ~ ➜ kubectldescribe svcexternal-webserver-cka03-svcn Name:external-webserver-cka03-svcn Namespace:default . . Endpoints: <none> # thereareno endpointsfor the service ...
      As we can see there is no endpoints specified for the service, hence we won't be able to get any output. Since we can not destroy any k8s object, let's create the endpoint manually for this service as shown below:
      student-node ~ ➜ export IP_ADDR=$(ifconfig eth0 | grep inet | awk '{print $2}') student-node ~ ➜ kubectl --context cluster3 apply -f - <<EOF apiVersion: v1 kind: Endpoints metadata: # the name here should match the name of the Service name: external-webserver-cka03-svcn subsets: - addresses: - ip: $IP_ADDR ports: - port: 9999 EOF
      Finally check if the curl test works now:
      student-node ~ ➜ kubectl --context cluster3 run --rm -i test-curl-pod --image=curlimages/curl --restart=Never -- curl -m 2 external-webserver-cka03-svcn ... <title>Welcome to nginx!</title> ...
Running nslookup in pods
Switching to cluster1:
kubectlconfiguse-context cluster1
To create a pod nginx-resolver-cka06-svcn and expose it internally:
student-node ~ ➜ kubectl run nginx-resolver-cka06-svcn --image=nginx student-node ~ ➜ kubectl expose pod/nginx-resolver-cka06-svcn --name=nginx-resolver-service-cka06-svcn --port=80 --target-port=80 --type=ClusterIP
To create a pod test-nslookup. Test that you are able to look up the service and pod names from within the cluster:
student-node ~ ➜ kubectl run test-nslookup --image=busybox:1.28 --rm -it --restart=Never -- nslookup nginx-resolver-service-cka06-svcn student-node ~ ➜ kubectl run test-nslookup --image=busybox:1.28 --rm -it --restart=Never -- nslookup nginx-resolver-service-cka06-svcn > /root/CKA/nginx.svc.cka06.svcn
Get the IP of the nginx-resolver-cka06-svcn pod and replace the dots(.) with hyphens(-) which will be used below.
student-node ~ ➜ kubectl get pod nginx-resolver-cka06-svcn -o wide student-node ~ ➜ IP=`kubectl get pod nginx-resolver-cka06-svcn -o wide --no-headers | awk '{print $6}' | tr '.' '-'` student-node ~ ➜ kubectl run test-nslooku
Possible reason for pod stuck in pending state due to PVC stuck in pending state
kubectl get event --field-selector involvedObject.name=demo-pod-cka29-trb
You will see some Warnings like:
Warning FailedScheduling pod/demo-pod-cka29-trb 0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
This seems to be something related to PersistentVolumeClaims, Let's check that:
kubectl get pvc
You will notice that demo-pvc-cka29-trb is stuck in Pending state. Let's dig into it
kubectl get event --field-selector involvedObject.name=demo-pvc-cka29-trb
You will notice this error:
Warning VolumeMismatch persistentvolumeclaim/demo-pvc-cka29-trb Cannot bindto requested volume "demo-pv-cka29-trb": incompatible accessMode
Which means the PVC is using incompatible accessMode, let's check the it out
kubectl get pvc demo-pvc-cka29-trb -o yaml kubectl get pv demo-pv-cka29-trb -o yaml
Let's re-create the PVC with correct access mode i.e ReadWriteMany
kubectl get pvc demo-pvc-cka29-trb -o yaml > /tmp/pvc.yaml vi /tmp/pvc.yaml
  • Under spec: change accessModes: from ReadWriteOnce to ReadWriteMany
Delete the old PVC and create new
kubectl delete pvc demo-pvc-cka29-trb kubectl apply -f /tmp/pvc.yaml
Check the POD now
kubectl get pod demo-pod-cka29-trb
It should be good now.
Worker node not in ready state
SSH into the cluster4-node01 and check if kubelet service is running
ssh cluster4-node01 systemctlstatus kubelet
You will see its inactive, so try to start it.
systemctl start kubelet
Check again the status
systemctl status kubelet
Its still failing, so let's look into some latest error logs:
journalctl -u kubelet --since "30 min ago" | grep 'Error:'
You will see some errors as below:
cluster4-node01 kubelet[6301]: Error: failed to construct kubeletdependencies: unable to load client CAfile /etc/kubernetes/pki/CA.crt: open /etc/kubernetes/pki/CA.crt: no suchfile or directory
Check if /etc/kubernetes/pki/CA.crt file exists:
ls /etc/kubernetes/pki/
You will notice that the file name is ca.crt instead of CA.crt so possibly kubelet is looking for a wrong file. Let's fix the config:
vi /var/lib/kubelet/config.yaml
  • Change clientCAFile from /etc/kubernetes/pki/CA.crt to /etc/kubernetes/pki/ca.crt
Try to start it again
systemctl start kubelet
Service should start now but there might be an error as below
ReportingIn stance:""}': 'Post "https://cluster4-controlplane:6334/api/v1/namespaces/default/events": dial tcp 10.9.63.18:633 4: connect: connection refused'(may retry after sleeping) Sep 18 09:21:47 cluster4-node01 kubelet[6803]: E0918 09:21:47.641184 6803 kubelet.go:2419] "Error getting node " err="node \"cluster4-node01\" not found"
You must have noticed that its trying to connect to the api server on port 6334 but the default port for kube-apiserver is 6443. Let's fix this:

Edit the kubelet config

vi /etc/kubernetes/kubelet.conf
  • Change server
server: https://cluster4-controlplane:6334
to
server: https://cluster4-controlplane:6443

Finally restart kublet service

systemctl restart kubelet
Check from the student-node now and cluster4-node01 should be ready now.
kubectl get node --context=cluster4
Daemonset not getting scheduled on the controlplane node
Check the status of DaemonSet
kubectl --context2 cluster2 get ds logs-cka26-trb -n kube-system
You will find that DESIRED CURRENT READY etc have value 2 which means there are two pods that have been created. You can check the same by listing the PODs
kubectl --context2 cluster2 get pod -n kube-system
You can check on which nodes these are created on
kubectl --context2 cluster2 get pod <pod-name> -n kube-system -o wide
Under NODE you will find the node name, so we can see that its not scheduled on the controlplane node which is because it must be missing the reqiured tolerations. Let's edit the DaemonSet to fix the tolerations
kubectl --context2 cluster2 edit ds logs-cka26-trb -n kube-system
Under tolerations: add below given tolerations as well
- key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule
Wait for some time PODs should schedule on all nodes now including the controlplane node.
Several pods in kube-system namespace are restarting after some time
🚫
In such questions, a common reason is health check as it fails after an interval and leads to pod crashing.
You will see that kube-controller-manager-cluster4-controlplane pod is crashing or restarting. So let's try to watch the logs.
kubectl logs -f kube-controller-manager-cluster4-controlplane --context=cluster4 -n kube-system
You will see some logs as below:
leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.10.129.21:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 10.10.129.21:6443: connect: connection refused
You will notice that somehow the connection to the kube api is breaking, let's check if kube api pod is healthy.
kubectl get pod --context=cluster4 -n kube-system
Now you might notice that kube-apiserver-cluster4-controlplane pod is also restarting, so we should dig into its logs or relevant events.
kubectl logs -f kube-apiserver-cluster4-controlplane -n kube-system kubectl get event --field-selector involvedObject.name=kube-apiserver-cluster4-controlplane -n kube-system
In events you will see this error
Warning Unhealthy pod/kube-apiserver-cluster4-controlplane Liveness probe failed: Get "https://10.10.132.25:6444/livez": dial tcp 10.10.132.25:6444: connect: connection refused
From this we can see that the Liveness probe is failing for the kube-apiserver-cluster4-controlplane pod, and we can see its trying to connect to port 6444 port but the default api port is 6443. So let's look into the kube api server manifest.
ssh cluster4-controlplane vi /etc/kubernetes/manifests/kube-apiserver.yaml
Under livenessProbe: you will see the port: value is 6444, change it to 6443 and save. Now wait for few seconds let the kube api pod come up.
kubectl get pod -n kube-system
Watch the PODs status for some time and make sure these are not restarting now.
Pod not getting scheduled on any node (cannot edit pod config to fix it)
You will see that cat-cka22-trb pod is stuck in Pending state. So let's try to look into the events
kubectl --context cluster2 get event --field-selector involvedObject.name=cat-cka22-trb
You will see some logs as below
Warning FailedScheduling pod/cat-cka22-trb 0/3 nodes are available: 1node(s) had untolerated taint {node-role.kubernetes.io/master: }, 2node(s) didn't match Pod'snodeaffinity/selector. preemption: 0/2 nodes are available: 3 Preemption is not helpful for scheduling.
So seems like this POD is using the node affinity, let's look into the POD to understand the node affinity its using.
kubectl --context cluster2 get pod cat-cka22-trb -o yaml
Under affinity: you will see its looking for key: node and values: cluster2-node02 so let's verify if node01 has these labels applied.
kubectl --context cluster2 get node cluster2-node01 -o yaml
Look under labels: and you will not find any such label, so let's add this label to this node.
kubectl label node cluster1-node01 node=cluster2-node01