CKA Prep - Troubleshooting
This post is part of a series which contains my study notes for the Certified Kubernetes Administrator (CKA) exam.
Note: Unless specifically indicated, text and examples in this post all come directly from the official Kubernetes documentation. I attempted to locate and extract the relevant portions of the kubernetes.io documentation that applied to the exam objective. However, I encourage you to do your own reading. I cannot guarantee that I got all of the important sections.
Troubleshooting
The Exam Curriculum breaks down the fifth exam topic into the following objectives:
- Evaluate cluster and node logging
- Understand how to monitor applications
- Manage container stdout & stderr logs
- Troubleshoot application failure
- Troubleshoot cluster component failure
- Troubleshoot networking
Evaluate cluster and node logging
Relevant search terms for Kubernetes Documentation: logging
Kubernetes Documentation Links
Concepts
- Run
kubectl get nodes
to check the nodes in the cluster. - Run
kubectl -n kube-system get pods
to check the pods in thekube-system
namespace. - Run
kubectl -n kube-system logs <pod-name>
to view the logs for a pod - Check kubelet logs by running
journalctl --unit=kubelet
on a system with systemd installed. If systemd is not installed then the components that are not running in containers will log to/var/log
. - Pod logs are usually in
/var/log/pods
- Check the kubelet logs in syslog for issues.
cat /var/log/syslog | grep kube-apiserver
- Check the container logs directly from the container runtime using
crictl ps
and thencrictl logs
.
Understand how to Monitor Applications
Relevant search terms for Kubernetes Documentation: monitor
Kubernetes Documentation Links
Concepts
If the metrics server
has been deployed on the kubernetes cluster then the Horizontal Pod Autoscaler will pull metrics data to scale pods based on resource utilization, and the kubectl top
command can be used to determine which pods are using the most memory and CPU resources on the cluster.
Manage Container Stdout & Stderr Logs
Relevant search terms for Kubernetes Documentation: logs
Kubernetes Documentation Links
Concepts
-
To view the logs from a container running in a pod, use the
kubectl logs
commandkubectl logs <Pod-Name>
-
If there is more than one container running in the pod, you need to specify the container when running the kubectl logs command.
kubectl logs <pod-Name> -c <Container-Name>
-
“You can use
kubectl logs --previous
to retrieve logs from a previous instantiation of a container.”kubectl logs <Pod-Name> -c <Container-Name> --previous
-
“You can use
kubectl logs
to view logs from a pod that is part of a deployment.”kubectl logs deploy/<Deployment-Name>
Troubleshoot Application Failure
Relevant search terms for Kubernetes Documentation: troubleshoot, debug pods
Kubernetes Documentation Links
- Debug Pods
- Resource Management for Pods and Containers - Troubleshooting
- Troubleshooting Applications
Concepts
-
“My Pods are pending with event message FailedScheduling”
-
“If the scheduler cannot find any node where a Pod can fit, the Pod remains unscheduled until a place can be found. An Event is produced each time the scheduler fails to find a place for the Pod. You can use
kubectl
to view the events for a Pod; for example:”kubectl describe pod frontend | grep -A 9999999999 Events
-
“In general, if a Pod is pending with a message of this type, there are several things to try:”
- “Add more nodes to the cluster.”
- “Terminate unneeded Pods to make room for pending Pods.”
- “Check that the Pod is not larger than all the nodes. For example, if all the nodes have a capacity of
cpu: 1
, then a Pod with a request ofcpu: 1.1
will never be scheduled.” - “Check for node taints. If most of your nodes are tainted, and the new Pod does not tolerate that taint, the scheduler only considers placements onto the remaining nodes that don’t have that taint.”
-
You can check node capacities and amounts allocated with the
kubectl describe nodes
command.
-
-
“My container is terminated”
- “Your container might get terminated because it is resource-starved. To check whether a container is being killed because it is hitting a resource limit, call
kubectl describe pod <Pod_Name>
on the Pod of interest.” - If the Pod was terminated previously then the termination reason will be indicated in the output from the describe command.
- For example, you might find that the Pod was terminated due to excessive memory utilization. “Your next step might be to check the application code for a memory leak. If you find that the application is behaving how you expect, consider setting a higher memory limit (and possibly request) for that container.”
- “Your container might get terminated because it is resource-starved. To check whether a container is being killed because it is hitting a resource limit, call
Troubleshoot Cluster Component Failure
Relevant search terms for Kubernetes Documentation: troubleshoot cluster
Kubernetes Documentation Links
Concepts
-
Check the status of all the nodes in the cluster by running
kubectl get nodes
." -
To get detailed information about the health of the cluster run
kubectl cluster-info dump
-
“As with Pods, you can use
kubectl describe node
andkubectl get node -o yaml
to retrieve detailed information about nodes.” -
Reviewing Kubernetes Logs
-
“Here are the locations of the relevant log files. On systemd-based systems, you may need to use
journalctl
instead of examining log files.”NOTE: Depending on the Kubernetes deployment approach some kubernetes components may be running as static pods. In this case, the logs will be in
/var/log/pods
-
Control Plane Nodes
- “
/var/log/kube-apiserver.log
- API Server, responsible for serving the API” - “
/var/log/kube-scheduler.log
- Scheduler, responsible for making scheduling decisions” - “
/var/log/kube-controller-manager.log
- a component that runs most Kubernetes built-in controllers, with the notable exception of scheduling (the kube-scheduler handles scheduling)”
- “
-
Worker Nodes
- “
/var/log/kubelet.log
- logs from the kubelet, responsible for running containers on the node” - “
/var/log/kube-proxy.log
- logs fromkube-proxy
, which is responsible for directing traffic to Service endpoints”
- “
-
Troubleshoot Networking
Relevant search terms for Kubernetes Documentation: troubleshoot
Kubernetes Documentation Links
Concepts
-
Service Troubleshooting
- Make sure that the port used in the spec for the container matches the Target Port in the service.
- Verify that the selectors in the service match the labels in the Pod spec
- Check to confirm that the service endpoints are correct
- Verify that kube-proxy is running
-
Review the cluster troubleshooting tips outlined above.
-
DNS Troubleshooting
-
Verify the
/etc/resolv.conf
file on a pod to confirm that the DNS configuration is correct -
Perform a DNS lookup on a pod using the
nslookup kubernetes.default
command to confirm that DNS is resolving correctly. -
Check the
kube-system
namespace to confirm that thecoredns
pods are running. Often thecoredns
pods are part of a deployment. -
If the
coredns
pods are running then check the logs for errors -
Check to confirm that the Kubernetes DNS Service is running in the
kube-system
namespace.“Note: The service name is kube-dns for both CoreDNS and kube-dns deployments.”
-
Verify that the Pods are exposed to the service as endpoints by running
kubectl get endpoints kube-dns --namespace=kube-system
. -
Enable logging for the
coredns
pods by modifying thecoredns
configmap using the commandkubectl -n kube-system edit configmap coredns
and addinglog
to the Corefile. -
After logging has been enabled, make some queries and view the logs.
-
The service role used by the coredns pods must be able to list service and endpoint related resources to properly resolve service names.
-
Check the cluster role of
system:coredns
to confirm that it has the correct permissions. -
Confirm that you are in the correct namespace when you are querying DNS. “DNS queries that don’t specify a namespace are limited to the pod’s namespace. If the namespace of the pod and service differ, the DNS query must include the namespace of the service.”
-
And that’s a wrap for this topic.