TODO for smoothly upgrading Kubernetes version

#kubernetes #devops

During the last months, I tried to come up with a simple TODO to optimize the process and make it as smooth as possible for your workload with no downtime (by avoiding too much pods starting at the same time, hitting your container registry rate limit, etc).

So I am sharing it here if that can help your with your first upgrade :

Preparation

First start reading this article for potential breaking changes (especially regarding deleted apiVersion : you need to update them before going further!) : https://kubernetes.io/docs/reference/using-api/deprecation-guide/
If you have one, always start by upgrading your testing / staging cluster before your production one (I strongly suggest it)
Monitor it for a few days, just to be sure that there is no bad side effect on your workload with this upgrade

Upgrading master nodes

Start by upgrading the Kubernetes version of your Control Plane aka your master nodes, depending of the tool you are using (kops, EKS, AKS, etc.)
If you are using cluster-autoscaler, scale it down : kubectl scale --replicas=0 deployment/cluster-autoscaler -n kube-system
If you are using a GitOps agent (like flux in this example), scale it down : kubectl scale --replicas=0 deployment/flux -n flux
Create a new node group with the same new Kubernetes version
Put a maintenance plage for 2 hours in your monitoring tool

Rolling out worker nodes

Drain each node one by one on the old node group, use kubectl get nodes -o wide to pick the right ones (running old Kubernetes version) : kubectl drain node_name --ignore-daemonsets --delete-emptydir-data
After one drain, wait for all Evicted pods to restart correctly by running this command to get unhealthy pods across the cluster : kubectl get po -A | grep "0/" | grep -v "Completed"
Wait a few minutes until you only have a few lines left, then move to the next node
After the last node and when you have no more result at all with this command above, you are good to pursue!

Once ALL nodes of old node group have been drained, you can delete it
Check the completed deletion using again kubectl get nodes -o wide

Wrapping up

If you are using cluster-autoscaler, upgrade the version used :
find the latest release number that matches the new k8s version of your cluster : https://github.com/kubernetes/autoscaler/releases
type the major version number in the search field up right to filter easily
update the used Docker image number of cluster-autoscaler : kubectl -n kube-system set image deployment.apps/cluster-autoscaler cluster-autoscaler=k8s.gcr.io/autoscaling/cluster-autoscaler:v1.MAJOR.minor
make sure the pod is starting correctly checking its logs : kubectl scale --replicas=1 deployment/cluster-autoscaler -n kube-system

If you are using a GitOps agent (like flux in this example), scale it up, check logs and make sure that it syncs well : kubectl scale --replicas=1 deployment/flux -n flux

Check your monitoring tools and resolve muted alerts that may have been triggered by the rollout
Announce to your team that the rollout is done and all went well :)
Commit-push all the version modifications you made in your cluster repo if you have one (I strongly suggest it)
That's it!

If you have any suggestion to upgrade this TODO, do not hesitate to let me know in the comments below. Thanks for reading and I wish you a great day!

The Ops Community ⚙️

TODO for smoothly upgrading Kubernetes version

Top comments (0)