Saving top nodes
Check out the metrics for all node across all clusters:
kubectl top node --context cluster1 --no-headers | sort -nr -k2 | head -1 cluster1-controlplane 127m 1% 703Mi 1% kubectl top node --context cluster2 --no-headers | sort -nr -k2 | head -1 cluster2-controlplane 126m 1% 675Mi 1% kubectl top node --context cluster3 --no-headers | sort -nr -k2 | head -1 cluster3-controlplane 577m 7% 1081Mi 1% kubectl top node --context cluster4 --no-headers | sort -nr -k2 | head -1 cluster4-controlplane 130m 1% 679Mi 1%
Using this, find the node that uses most
cpu
. In this case, it is cluster3-controlplane
on cluster3
.echo cluster3,cluster3-controlplane > /opt/high_cpu_node
Install etcdctl
Install etcd utility: cd /tmp export RELEASE=$(curl -s https://api.github.com/repos/etcd-io/etcd/releases/latest | grep tag_name | cut -d '"' -f 4) wget https://github.com/etcd-io/etcd/releases/download/${RELEASE}/etcd-${RELEASE}-linux-amd64.tar.gz tar xvf etcd-${RELEASE}-linux-amd64.tar.gz ; cd etcd-${RELEASE}-linux-amd64 mv etcd etcdctl /usr/local/bin/
Network policy to make a pod accessible from a specific pod in a specific namespace
ingress: - from: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: default podSelector: matchLabels: app: cyan-white-cka28-trb
Deployment scaling out
kubectl get deployment
We can see
DESIRED
count for pink-depl-cka14-trb
is 2
but the CURRENT
count is still 1
As we know
Kube Controller Manager
is responsible for monitoring the status of replica sets/deployments and ensuring that the desired number of PODs are available so let's check if its running fine.kubectl get pod -n kube-system
So
kube-controller-manager-cluster4-controlplane
is crashing, let's check the events to figure what's happeningstudent-node ~ ✖ kubectl get event --field-selector involvedObject.name=kube-controller-manager-cluster4-controlplane -n kube-system LAST SEEN TYPE REASON OBJECT MESSAGE 10m Warning NodeNotReady pod/kube-controller-manager-cluster4-controlplane Node is not ready 3m25s Normal Killing pod/kube-controller-manager-cluster4-controlplane Stopping container kube-controller-manager 2m18s Normal Pulled pod/kube-controller-manager-cluster4-controlplane Container image "k8s.gcr.io/kube-controller-manager:v1.24.0" already present on machine 2m18s Normal Created pod/kube-controller-manager-cluster4-controlplane Created container kube-controller-manager 2m18s Warning Failed pod/kube-controller-manager-cluster4-controlplane Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "kube-controller-manage": executable file not foundin $PATH: unknown 108s Warning BackOff pod/kube-controller-manager-cluster4-controlplane Back-off restarting failed container student-node ~ ➜
You will see some errors as below
Warning Failed pod/kube-controller-manager-cluster4-controlplane Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "kube-controller-manage": executable filenot foundin $PATH: unknown
Seems like its trying to run
kube-controller-manage
command but it is supposed to run kube-controller-manager
commmand. So lets look into the kube-controller-manager
manifest which is present under /etc/kubernetes/manifests/kube-controller-manager.yaml
on cluster4-controlplane
node. So let's SSH into cluster4-controlplane
ssh cluster4-controlplane vi /etc/kubernetes/manifests/kube-controller-manager.yaml
- Under
containers:
->command:
changekube-controller-manage
tokube-controller-manager
and restartkube-controller-manager-cluster4-controlplane
POD
kubectl delete pod kube-controller-manager-cluster4-controlplane -n kube-system
Check now the ReplicaSet
kubectl get deployment
CURRENT
count should be equal to the DESIRED
count now for pink-depl-cka14-trb
.PV & PVC with Storage Class
--- apiVersion: v1 kind: PersistentVolume metadata: name: coconut-pv-cka01-str labels: storage-tier: gold spec: capacity: storage: 100Mi accessModes: - ReadWriteMany hostPath: path: /opt/coconut-stc-cka01-str storageClassName: coconut-stc-cka01-str nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - cluster1-node01 --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: coconut-pvc-cka01-str spec: accessModes: - ReadWriteMany resources: requests: storage: 50Mi storageClassName: coconut-stc-cka01-str selector: matchLabels: storage-tier: gold
Troubleshooting: service not reachable from a pod
- Test if the service
curlme-cka01-svcn
is accessible from podcurlpod-cka01-svcn
or not.
kubectl exec curlpod-cka01-svcn -- curl curlme-cka01-svcn ..... %Total %Received %XferdAverageSpeedTimeTimeTimeCurrentDloadUploadTotalSpentLeftSpeed0 0 0 0 0 0 0 0 --:--:-- 0:00:10 --:--:-- 0
- We did not get any response. Check if the service is properly configured or not.
kubectl describe svc curlme-cka01-svcn '' .... Name: curlme-cka01-svcn Namespace: default Labels: <none> Annotations: <none> Selector: run=curlme-ckaO1-svcn Type: ClusterIP IP Family Policy: SingleStack IP Families: IPv4 IP: 10.109.45.180 IPs: 10.109.45.180 Port: <unset> 80/TCP TargetPort: 80/TCP Endpoints: <none> Session Affinity: None Events: <none>
- The service has no endpoints configured. As we can delete the resource, let's delete the service and create the service again.
- To delete the service, use the command
kubectl delete svc curlme-cka01-svcn
.
- You can create the service using imperative way or declarative way.
Using imperative command:
kubectl expose pod curlme-cka01-svcn --port=80
Possible reason for pod restarting due to liveness check fail
Notice the command
sleep 3; touch /healthcheck; sleep 30;sleep 30000
it starts with a delay of 3
seconds, but the liveness probe initialDelaySeconds
is set to 1
and failureThreshold
is also 1
. Which means the POD will fail just after first attempt of liveness check which will happen just after 1
second of pod start. So to make it stable we must increase the initialDelaySeconds
to at least 5
Out of Memory error causing pod restarts
kubectl logs -f green-deployment-cka15-trb-xxxx
You will see some logs like these
2022-09-18 17:13:25 98 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins 2022-09-18 17:13:25 98 [Note] InnoDB: Memory barrier is not used 2022-09-18 17:13:25 98 [Note] InnoDB: Compressed tables use zlib 1.2.11 2022-09-18 17:13:25 98 [Note] InnoDB: Using Linux native AIO 2022-09-18 17:13:25 98 [Note] InnoDB: Using CPU crc32 instructions 2022-09-18 17:13:25 98 [Note] InnoDB: Initializing buffer pool, size = 128.0M Killed
This might be due to the resources issue, especially the memory, so let's try to recreate the POD to see if it helps.
kubectl delete pod green-deployment-cka15-trb-xxxx
Now watch closely the POD status
kubectl get pod
Pretty soon you will see the POD status has been changed to
OOMKilled
which confirms its the memory issue. So let's look into the resources that are assigned to this deployment.kubectl get deploy kubectl edit deploy green-deployment-cka15-trb
- Under
resources:
->limits:
changememory
from256Mi
to512Mi
and save the changes.
Now watch closely the POD status again
kubectl get pod
It should be stable now.
Troubleshooting cluster
ssh cluster4-controlplane
Let's take
etcd
backupETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key snapshot save /opt/etcd-boot-cka18-trb.db
It might stuck for forever, let's see why that would happen. Try to list the PODs first
kubectl get pod -A
There might an error like
The connection tothe server cluster4-controlplane:6443 was refused - did you specify the right host or port?
There seems to be some issue with the cluster so let's look into the logs
journalctl -u kubelet -f
You will see a lot of
connect: connection refused
errors but that must be because the different cluster components are not able to connect to the api server so try to filter out these logs to look more closlyjournalctl -u kubelet -f | grep -v 'connect: connection refused'
You should see some erros as below
cluster4-controlplane kubelet[2240]: E0923 04:38:15.630925 2240 file.go:187] "Could not process manifest file" err="invalid pod: [spec.containers[0].volumeMounts[1].name: Not found: \"etcd-cert\"]" path="/etc/kubernetes/manifests/etcd.yaml"
So seems like there is some incorrect volume which
etcd
is trying to mount, let's look into the etcd
manifest.vi /etc/kubernetes/manifests/etcd.yaml
Search for
etcd-cert
, you will notice that the volume name is etcd-certs
but the volume mount is trying to mount etcd-cert
volume which is incorrect. Fix the volume mount name and save the changes. Let's restart kubelet
service after that.systemctl restart kubelet
Wait for few minutes to see if its good now.
kubectl get pod -A
You should be able to list the PODs now, let's try to take
etcd
backup now:ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key snapshot save /opt/etcd-boot-cka18-trb.db
It should work now.
Rearranging kubectl outputs in columns
The easiest way to route traffic to a specific pod is by the use of
labels
and selectors
. List the pods along with their labels:student-node ~ ➜ kubectl get pods --show-labels -n spectra-1267 NAME READY STATUS RESTARTS AGE LABELS pod-12 1/1 Running 0 5m21s env=dev,mode=standard,type=external pod-34 1/1 Running 0 5m20s env=dev,mode=standard,type=internal pod-43 1/1 Running 0 5m20s env=prod,mode=exam,type=internal pod-23 1/1 Running 0 5m21s env=dev,mode=exam,type=external pod-32 1/1 Running 0 5m20s env=prod,mode=standard,type=internal pod-21 1/1 Running 0 5m20s env=prod,mode=exam,type=external
Looks like there are a lot of pods created to confuse us. But we are only concerned with the labels of
pod-23
and pod-21
.As we can see both the required pods have labels
mode=exam,type=external
in common. Let's confirm that using kubectl too:student-node ~ ➜ kubectl get pod -l mode=exam,type=external -n spectra-1267 NAME READY STATUS RESTARTS AGE pod-23 1/1 Running 0 9m18s pod-21 1/1 Running 0 9m17s
Nice!! Now as we have figured out the labels, we can proceed further with the creation of the service:
student-node ~ ➜ kubectl create service clusterip service-3421-svcn -n spectra-1267 --tcp=8080:80 --dry-run=client -o yaml > service-3421-svcn.yaml
Now modify the service definition with selectors as required before applying to k8s cluster:
student-node ~ ➜ cat service-3421-svcn.yaml apiVersion: v1 kind: Service metadata: creationTimestamp:nulllabels: app: service-3421-svcn name: service-3421-svcn namespace: spectra-1267 spec: ports: - name: 8080-80 port: 8080 protocol: TCP targetPort: 80 selector: app: service-3421-svcn # delete mode: exam # add type: external # add type: ClusterIP status: loadBalancer: {}
Finally let's apply the service definition:
student-node ~ ➜ kubectl apply -f service-3421-svcn.yaml service/service-3421 created student-node ~ ➜ k get ep service-3421-svcn -n spectra-1267 NAME ENDPOINTS AGE service-3421 10.42.0.15:80,10.42.0.17:80 52s
To store all the pod name along with their IP's , we could use imperative command as shown below:
student-node ~ ➜ kubectl get pods -n spectra-1267 -o=custom-columns='POD_NAME:metadata.name,IP_ADDR:status.podIP' --sort-by=.status.podIP POD_NAME IP_ADDR pod-12 10.42.0.18 pod-23 10.42.0.19 pod-34 10.42.0.20 pod-21 10.42.0.21 ... # store the output to /root/pod_ips student-node ~ ➜ kubectl get pods -n spec
External endpoint for a service
- Let's check if the webserver is working or not:
student-node~ ➜ curl student-node:9999 ... <h1>Welcome to nginx!</h1> ...
Now we will check if service is correctly defined:
student-node ~ ➜ kubectldescribe svcexternal-webserver-cka03-svcn Name:external-webserver-cka03-svcn Namespace:default . . Endpoints: <none> # thereareno endpointsfor the service ...
As we can see there is no endpoints specified for the service, hence we won't be able to get any output. Since we can not destroy any k8s object, let's create the endpoint manually for this service as shown below:
student-node ~ ➜ export IP_ADDR=$(ifconfig eth0 | grep inet | awk '{print $2}') student-node ~ ➜ kubectl --context cluster3 apply -f - <<EOF apiVersion: v1 kind: Endpoints metadata: # the name here should match the name of the Service name: external-webserver-cka03-svcn subsets: - addresses: - ip: $IP_ADDR ports: - port: 9999 EOF
Finally check if the
curl test
works now:student-node ~ ➜ kubectl --context cluster3 run --rm -i test-curl-pod --image=curlimages/curl --restart=Never -- curl -m 2 external-webserver-cka03-svcn ... <title>Welcome to nginx!</title> ...
Running nslookup in pods
Switching to
cluster1
:kubectlconfiguse-context cluster1
To create a pod
nginx-resolver-cka06-svcn
and expose it internally:student-node ~ ➜ kubectl run nginx-resolver-cka06-svcn --image=nginx student-node ~ ➜ kubectl expose pod/nginx-resolver-cka06-svcn --name=nginx-resolver-service-cka06-svcn --port=80 --target-port=80 --type=ClusterIP
To create a pod
test-nslookup
. Test that you are able to look up the service and pod names from within the cluster:student-node ~ ➜ kubectl run test-nslookup --image=busybox:1.28 --rm -it --restart=Never -- nslookup nginx-resolver-service-cka06-svcn student-node ~ ➜ kubectl run test-nslookup --image=busybox:1.28 --rm -it --restart=Never -- nslookup nginx-resolver-service-cka06-svcn > /root/CKA/nginx.svc.cka06.svcn
Get the IP of the
nginx-resolver-cka06-svcn
pod and replace the dots(.) with hyphens(-) which will be used below.student-node ~ ➜ kubectl get pod nginx-resolver-cka06-svcn -o wide student-node ~ ➜ IP=`kubectl get pod nginx-resolver-cka06-svcn -o wide --no-headers | awk '{print $6}' | tr '.' '-'` student-node ~ ➜ kubectl run test-nslooku
Possible reason for pod stuck in pending state due to PVC stuck in pending state
kubectl get event --field-selector involvedObject.name=demo-pod-cka29-trb
You will see some Warnings like:
Warning FailedScheduling pod/demo-pod-cka29-trb 0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
This seems to be something related to
PersistentVolumeClaims
, Let's check that:kubectl get pvc
You will notice that
demo-pvc-cka29-trb
is stuck in Pending
state. Let's dig into itkubectl get event --field-selector involvedObject.name=demo-pvc-cka29-trb
You will notice this error:
Warning VolumeMismatch persistentvolumeclaim/demo-pvc-cka29-trb Cannot bindto requested volume "demo-pv-cka29-trb": incompatible accessMode
Which means the PVC is using incompatible accessMode, let's check the it out
kubectl get pvc demo-pvc-cka29-trb -o yaml kubectl get pv demo-pv-cka29-trb -o yaml
Let's re-create the PVC with correct access mode i.e
ReadWriteMany
kubectl get pvc demo-pvc-cka29-trb -o yaml > /tmp/pvc.yaml vi /tmp/pvc.yaml
- Under
spec:
changeaccessModes:
fromReadWriteOnce
toReadWriteMany
Delete the old PVC and create new
kubectl delete pvc demo-pvc-cka29-trb kubectl apply -f /tmp/pvc.yaml
Check the POD now
kubectl get pod demo-pod-cka29-trb
It should be good now.
Worker node not in ready state
SSH into the
cluster4-node01
and check if kubelet
service is runningssh cluster4-node01 systemctlstatus kubelet
You will see its inactive, so try to start it.
systemctl start kubelet
Check again the status
systemctl status kubelet
Its still failing, so let's look into some latest error logs:
journalctl -u kubelet --since "30 min ago" | grep 'Error:'
You will see some errors as below:
cluster4-node01 kubelet[6301]: Error: failed to construct kubeletdependencies: unable to load client CAfile /etc/kubernetes/pki/CA.crt: open /etc/kubernetes/pki/CA.crt: no suchfile or directory
Check if
/etc/kubernetes/pki/CA.crt
file exists:ls /etc/kubernetes/pki/
You will notice that the file name is
ca.crt
instead of CA.crt
so possibly kubelet
is looking for a wrong file. Let's fix the config:vi /var/lib/kubelet/config.yaml
- Change clientCAFile from
/etc/kubernetes/pki/CA.crt
to/etc/kubernetes/pki/ca.crt
Try to start it again
systemctl start kubelet
Service should start now but there might be an error as below
ReportingIn stance:""}': 'Post "https://cluster4-controlplane:6334/api/v1/namespaces/default/events": dial tcp 10.9.63.18:633 4: connect: connection refused'(may retry after sleeping) Sep 18 09:21:47 cluster4-node01 kubelet[6803]: E0918 09:21:47.641184 6803 kubelet.go:2419] "Error getting node " err="node \"cluster4-node01\" not found"
You must have noticed that its trying to connect to the
api
server on port 6334
but the default port for kube-apiserver
is 6443
. Let's fix this:Edit the kubelet config
vi /etc/kubernetes/kubelet.conf
- Change server
server: https://cluster4-controlplane:6334
to
server: https://cluster4-controlplane:6443
Finally restart kublet service
systemctl restart kubelet
Check from the
student-node
now and cluster4-node01
should be ready now.kubectl get node --context=cluster4
Daemonset not getting scheduled on the controlplane node
Check the status of DaemonSet
kubectl --context2 cluster2 get ds logs-cka26-trb -n kube-system
You will find that
DESIRED
CURRENT
READY
etc have value 2
which means there are two pods that have been created. You can check the same by listing the PODskubectl --context2 cluster2 get pod -n kube-system
You can check on which nodes these are created on
kubectl --context2 cluster2 get pod <pod-name> -n kube-system -o wide
Under
NODE
you will find the node name, so we can see that its not scheduled on the controlplane
node which is because it must be missing the reqiured tolerations
. Let's edit the DaemonSet to fix the tolerations
kubectl --context2 cluster2 edit ds logs-cka26-trb -n kube-system
Under
tolerations:
add below given tolerations
as well- key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule
Wait for some time PODs should schedule on all nodes now including the
controlplane
node.Several pods in kube-system
namespace are restarting after some time
In such questions, a common reason is health check as it fails after an interval and leads to pod crashing.
You will see that
kube-controller-manager-cluster4-controlplane
pod is crashing or restarting. So let's try to watch the logs.kubectl logs -f kube-controller-manager-cluster4-controlplane --context=cluster4 -n kube-system
You will see some logs as below:
leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://10.10.129.21:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 10.10.129.21:6443: connect: connection refused
You will notice that somehow the connection to the kube api is breaking, let's check if kube api pod is healthy.
kubectl get pod --context=cluster4 -n kube-system
Now you might notice that
kube-apiserver-cluster4-controlplane
pod is also restarting, so we should dig into its logs or relevant events.kubectl logs -f kube-apiserver-cluster4-controlplane -n kube-system kubectl get event --field-selector involvedObject.name=kube-apiserver-cluster4-controlplane -n kube-system
In events you will see this error
Warning Unhealthy pod/kube-apiserver-cluster4-controlplane Liveness probe failed: Get "https://10.10.132.25:6444/livez": dial tcp 10.10.132.25:6444: connect: connection refused
From this we can see that the Liveness probe is failing for the
kube-apiserver-cluster4-controlplane
pod, and we can see its trying to connect to port 6444
port but the default api port is 6443
. So let's look into the kube api server manifest.ssh cluster4-controlplane vi /etc/kubernetes/manifests/kube-apiserver.yaml
Under
livenessProbe:
you will see the port:
value is 6444
, change it to 6443
and save. Now wait for few seconds let the kube api pod come up.kubectl get pod -n kube-system
Watch the PODs status for some time and make sure these are not restarting now.
Pod not getting scheduled on any node (cannot edit pod config to fix it)
You will see that
cat-cka22-trb
pod is stuck in Pending
state. So let's try to look into the eventskubectl --context cluster2 get event --field-selector involvedObject.name=cat-cka22-trb
You will see some logs as below
Warning FailedScheduling pod/cat-cka22-trb 0/3 nodes are available: 1node(s) had untolerated taint {node-role.kubernetes.io/master: }, 2node(s) didn't match Pod'snodeaffinity/selector. preemption: 0/2 nodes are available: 3 Preemption is not helpful for scheduling.
So seems like this POD is using the node affinity, let's look into the POD to understand the node affinity its using.
kubectl --context cluster2 get pod cat-cka22-trb -o yaml
Under
affinity:
you will see its looking for key: node
and values: cluster2-node02
so let's verify if node01 has these labels applied.kubectl --context cluster2 get node cluster2-node01 -o yaml
Look under
labels:
and you will not find any such label, so let's add this label to this node.kubectl label node cluster1-node01 node=cluster2-node01