Disaster recovery is not an easy job, because applications are not deployed on a "fixed" server/machine and also Kubernetes workloads cannot be backed up in a traditional manner.
We're going to talk about how can we maintain/regain the use of stateful components, after a disaster occurred. By disaster we mean a non-graceful shutdown of Kubernetes nodes. If you're not familiar with Kubernetes architecture, check Probing K8s Architecture.
The situation in which the node shutdown action takes place without the kubelet daemon (on each worker node) knowing about it, and as a result the pods on that node also shut down ungracefully.
For stateless workloads, that's often not a problem but for stateful ones(e.g. controlled by StatefulSet) it might imply that the pods on that specific node will also shut down ungracefully.
So we can be in the situation in which a Pod might be stuck indefinitely in
Terminatingand the StatefulSet cannot create a replacement for the pod because the pod still exists in the cluster (maybe on a failed node), resulting in the fact that the application running on StatefulSet may be degraded or even offline.
Lucky for us Kubernetes v1.24 introduces alpha support for Non-Graceful Node Shutdown. This feature allows stateful workloads to failover to a different node after the original node is shutdown or in a non-recoverable state such as hardware failure or broken OS. Even more Kubernetes team plans to push the Non-Graceful Node Shutdown implementation to Beta in either 1.25 or 1.26.
In this particular case, we have
NodeOutOfServiceVolumeDetach feature which currently is in Alpha(meaning disabled by default).In order to use it we need to enable it using
--feature-gates command line flag the kube-controller-manager component, which is in
kubectl get pods -n kube-system | grep controller-manager.
Note: After enabling the aforementioned feature, we can apply the out-of-service taint
kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute, only after the node has been powered off.
As a result, the PersistentVolumes attached to the shutdown node will be detached, and for StatefulSets, replacement pods will be created successfully on a different running node. Last but not least we need to manually remove the
out-of-service taint after the pods are moved to a new node and the user has checked that the shutdown node has been recovered since the user was the one who originally added the taint.