One day I just found a crashed AKS cluster where I used to have a healthy cluster, yes, you read it right, a cluster with 3 nodes simply stopped working in the tree nodes at the same time.
At first for some reason I thought it was a problem with the portal and that the pods were still working fine but I only lost proper response from the control plane (the side of Kubernetes you don't manage in a managed cluster), to my surprise, no, the complete cluster crashed with all the pods in "Terminating" as the control plane was desperately trying to reschedule them in a healthy node (sadly there were no healthy nodes).
And now what?
Well, my first attempt was check what the portal was reporting, so I went to the portal, selected my cluster and then went directly to "node pools", to my (not) surprise, the nodepool was in "not ready" status, indicating 3/3 nodes were in "Failed" status... nice :)
So my second attempt was "simply" spin up a new node_pool, super simple.... well, I was in an unsupported version, so you can't do that
Lesson learned #1
If your control plane is in a version that is not
available in azure to create new clusters, it
means that you can't create node_pools with that
version, you need to update your control plane
first in order to create node_pools
Ok, no new node_pool for me (bad Javi, you should keep your clusters updated!!). Next idea? yes, add another node to the same/existing node_pool, that will create another node with the same image of the others, this is something that DOES work even when you are in outdated version of kubernetes.
Well... when you create your cluster you can choose two kinds of network plugins in AKS, the regular "kubenet" and the "Azure CNI", there are slights differences between them, the most notorious one is that Azure CNI pre-allocates an IP in the subnet of your cluster when you create a new node in the node_pool, this is because it allocates an IP of the subnet to each pod that could be created in the node, in this case my subnet was too small to fit another node, so no, I couldn't create another node because of the ip starvation in my subnet...
Lesson learned #2
When you plan your kubernetes environment, think
on the future growth, maybe make extra room for
a few more nodes, if your don't create them
you don't pay for them, but you still have plenty
of space in your subnet.
Not all is darkness and desolation
Al this point nothing was working, so I was not loosing anything by cordoning my nodes and trying to delete one of them. From the portal was not possible, any scaling operation was taking around 20 minutes to simply fail.
kubectl cordon <node_name> on my tree nodes and then tried a
kubectl drain <node_name on one of them, it complained about the daemonsets and the stateful pods, force them to be removed anyway and then..... it failed to drain the node, if you think a bit about it it actually makes sense, who is gonna drain the nods (a simple delete pod in the background) if the kubelet was dead... absolutely no one, so I simple deleted the node
kubectl delete node what can I lose?
IT WORKED! the node was gone, I checked the Azure portal and it was actually deleted (I have the suspicion that the nodes were not even there, they crashed and the Portal never updated the status).
The next thing I noticed is that my node count in the Portal was still 3... despite the fact I had only 2 visible in there.That was weird, but I decided to wait 20 minutes and see it a new node spawned by it's own.... no, it never happened, so I simple scaled up to 4 nodes, my idea was that AKS will try to match the new
max nodes by creating a new node, and that's exactly what happened!!!
Lesson learned #3
The Azure Portal some times is a bit buggy,
take what it shows with careful, trust you
tool kit and try to obtain as much information
as you can in different ways.
Lesson learned #4
Think outside of the box, if I was in 3 nodes
already and I was seeing only 2, to create
a new one I decided to scale up to 4, this
makes no sense, but it actually worked.
The (slow) resurrection of the cluster
So now I had a new node, WITH the same version my previous one had AND without needing to update my control plane.
As soon as this node was ready all my deployments tried desperately to schedule their pods in it (good luck fitting all of them there), I repeated the step for the second node, this time scaling to 5, and it worked again!
So after a few minutes I had 3 nodes (1 crashed, 2 healthy and all my pods running fine)
Here is what I learned at the end and the knowledge I can transfer to you:
- ALWAYS send our kubernetes logs to some other place, in my case log analytics, once the node crashed, you can't check the logs anymore.
- Setup a nice set of alerts to know when something doesn't goes nice, even for those cases you might think you are covered when running a managed cluster
- Remember, if everything is broken, you can't broke it even more, don't be afraid of deleting or breaking a little bit more something in the pursue of fixing the problem, after all the cluster was already not working
- If you can, document every step you do, it is helpful for when you have a similar problem or just to share with others (or your team)
- Have your kubernetes cheat sheet handy, you never know when you will need some command that you rarely use (I will be posting something about this soon)
Some things to add in this case that I found useful to share are:
1) The whole problem was caused initially because the service principal used by AKS to interact with Azure resources had it secret/password expired, so part of the solution was to update the secret in AAD, update AKS to use the new one and then follow the steps described here.
2) If this service principal expires, AKS gets into a "failed" status, which block any kubernetes version update in the future (but the workloads will keep working), to resolve this you need to use the azcli to upgrade the version of the AKS control plane
to the same version you are currently running as described here, this seems to do kind of a restart of the control plane, but remember to do this
after you do the service principal password reset, update AKS and do the other steps in this article.
Up to here the update on this case, if I get more information in the future, I will update this article.
I am still investigating what was the root cause of this, it is not common at all that suddenly all the nodes in your cluster crash, but I don't think I can get much information, the node is completely crashed, I can't even debug it and my only hope are the node logs (which I store in another place)
I apologize for not including pictures in the post, I was trying to fix the problem while trying to extract something educative about the incident and totally forget about capturing some screenshots.
If you enjoyed this post/history please let me know in the comments or with a hearth, and if you are interested in knowing more about my adventures on the wild, remember to follow me in my networks
See you in another post, thank you for reading!
Top comments (2)
Thanks for the account of this harrowing experience.
I admire your composure facing this disaster. For most of us, it's not very easy to think straight under the pressure of a down system. Greatly appreciate that you took the time to document this so that others can learn from it as well.
Especially helpful to me are the recommendation take-aways. Step-by-step instructions are good if it's a repeatable process, but in this case, your recommendations are far more helpful for troubleshooting because these are sensible concepts that can help us work through a variety of different situations, rather than just this one event.
Hi Dom! Thank you for your kind words, they mean a lot to me and I am super happy than this experience can be converted into something others can learn about!
Let's keep rocking!