During the last months, I tried to come up with a simple TODO to optimize the process and make it as smooth as possible for your workload with no downtime (by avoiding too much pods starting at the same time, hitting your container registry rate limit, etc).
So I am sharing it here if that can help your with your first upgrade :
- First start reading this article for potential breaking changes (especially regarding deleted apiVersion : you need to update them before going further!) : https://kubernetes.io/docs/reference/using-api/deprecation-guide/
- If you have one, always start by upgrading your testing / staging cluster before your production one (I strongly suggest it)
- Monitor it for a few days, just to be sure that there is no bad side effect on your workload with this upgrade
Upgrading master nodes
- Start by upgrading the Kubernetes version of your Control Plane aka your master nodes, depending of the tool you are using (kops, EKS, AKS, etc.)
- If you are using cluster-autoscaler, scale it down :
kubectl scale --replicas=0 deployment/cluster-autoscaler -n kube-system
- If you are using a GitOps agent (like flux in this example), scale it down :
kubectl scale --replicas=0 deployment/flux -n flux
- Create a new node group with the same new Kubernetes version
- Put a maintenance plage for 2 hours in your monitoring tool
Rolling out worker nodes
- Drain each node one by one on the old node group, use
kubectl get nodes -o wideto pick the right ones (running old Kubernetes version) :
kubectl drain node_name --ignore-daemonsets --delete-emptydir-data
- After one drain, wait for all Evicted pods to restart correctly by running this command to get unhealthy pods across the cluster :
kubectl get po -A | grep "0/" | grep -v "Completed"
- Wait a few minutes until you only have a few lines left, then move to the next node
- After the last node and when you have no more result at all with this command above, you are good to pursue!
- Once ALL nodes of old node group have been drained, you can delete it
- Check the completed deletion using again
kubectl get nodes -o wide
- If you are using cluster-autoscaler, upgrade the version used :
- find the latest release number that matches the new k8s version of your cluster : https://github.com/kubernetes/autoscaler/releases
- type the major version number in the search field up right to filter easily
- update the used Docker image number of cluster-autoscaler :
kubectl -n kube-system set image deployment.apps/cluster-autoscaler cluster-autoscaler=k8s.gcr.io/autoscaling/cluster-autoscaler:v1.MAJOR.minor
- make sure the pod is starting correctly checking its logs :
kubectl scale --replicas=1 deployment/cluster-autoscaler -n kube-system
- If you are using a GitOps agent (like flux in this example), scale it up, check logs and make sure that it syncs well :
kubectl scale --replicas=1 deployment/flux -n flux
- Check your monitoring tools and resolve muted alerts that may have been triggered by the rollout
- Announce to your team that the rollout is done and all went well :)
- Commit-push all the version modifications you made in your cluster repo if you have one (I strongly suggest it)
- That's it!
If you have any suggestion to upgrade this TODO, do not hesitate to let me know in the comments below. Thanks for reading and I wish you a great day!