The Ops Community ⚙️

Arseny
Arseny

Posted on • Originally published at rtfm.co.ua on

Kubernetes: ensuring High Availability for Pods

We have a Kubernetes cluster, where WorkerNodes are scaled by Karpenter, and Karpenter has the disruption.consolidationPolicy=WhenUnderutilized parameter for its NodePool, and this means, that Karpenter will try to "consolidate" the placement of pods on Nodes in order to maximize the use of CPU and Memory resources.

In general, everything works, but this leads to the fact that WorkerNodes are sometimes recreated, and this causes our Pods to be "migrated" to other nodes.

So, the task now is to make sure that scaling and the consolidation process do not cause interruptions in the operation of our services.

Actually, this topic is not so much about Karpenter itself as it is about ensuring the stability of Pods in Kubernetes in general. But I faced this during Karpenter use, so we will talk a little about it as well.

Karpenter Disruption Flow

To better understand what’s happening with our Pods, let’s take a quick look at how Karpenter removes a WorkerNode from the pool. See Termination Controller.

After Karpenter discovered that there were nodes that needed to be terminated, he:

  1. adds a finalizer on a Kubernetes WorkerNode
  2. adds the karpenter.sh/disruption:NoSchedule taint on such a Node so that Kubernetes does not create new Pods on this Node
  3. if necessary, creates a new Node to which it will move the Pods from the Node that will be taken out of service (or uses an existing Node if it can accept additional Pods according to their requests)
  4. performs Pod Eviction of the Pods from the Node (see Safely Drain a Node and API-initiated Eviction)
  5. after all Pods except DaemonSets are removed from the Node, Karpenter deletes the corresponding NodeClaim
  6. removes the finalizer from the Node, which allows Kubernetes to delete the Node

Kubernetes Pod Eviction Flow

And briefly, the process of how Kubernetes itself performs the Pods Eviction:

  1. The Server API receives an Eviction request and checks whether this Pod can be evicted (for example, whether its eviction will not violate the restrictions of a PodDisruptionBudget — we will speak about PodDisruptionBudgets later in this post)
  2. marks the resource of this Pod for deletion
  3. kubelet starts the gracefully shut down process - that is, sends the SIGTERM signal
  4. Kubernetes removes the IP of this Pod from the list of endpoints
  5. if the Pod has not stopped within the specified time, then kubelet sends a SIGKILL signal to kill the process immediately
  6. kubelet sends a signal to the Server API that the Pod can be removed from the list of objects
  7. API Server removes the Pod from the database

See How API-initiated eviction works and Pod Lifecycle — Termination of Pods.

Kubernetes Pod High Availability Options

So, what can we do with Pods to make our service work without interruption, regardless of the Karpenter’s activities?

  • have at least 2 Pods on critical services
  • to have Pod Topology Spread Constraints so that Pods are placed on different WorkerNodes — then if one Node with the first Pod is killed, another Pod on another Wode will stay alive
  • have a PodDisruptionBudget so that at least 1 Pod is always alive — this will prevent Karpenter from evicting all the Pods at once, because it monitors compliance with the PDB
  • and to guarantee that Pod Eviction will not be performed, we can set the karpenter.sh/do-not-disrupt Pod annotation - then Karpenter will ignore this Pod (and, accordingly, the Node on which such a Pod will be run)

Let’s take a look at these options in more detail.

Kubernetes Deployment replicas

The simplest and most obvious solution is to have at least 2 simultaneously working Pods.

Although this does not guarantee that Kubernetes will not evict them at the same time, it is a minimum condition for further actions.

So either run kubectl scale deployment --replicas=2 manually, or update the replicas field in a Deployment/StatefulSets/ReplicaSet (see Workload Resources):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-demo-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx-demo
  template:
    metadata:
      labels:
        app: nginx-demo
    spec:
      containers:
        - name: nginx-demo-container
          image: nginx:latest
          ports:
            - containerPort: 80
Enter fullscreen mode Exit fullscreen mode

Pod Topology Spread Constraints

I already wrote about this in the Pod Topology Spread Constraints, but in short, we can set the rules for placing Kubernetes Pods so that they are on different WorkerNodes. This way, when Karpenter wants to take one Node out of service, we will have a Pod on another node.

However, no one can prevent Karpenter from draining both Nodes at the same time, so this is not a 100% guarantee, but it is the second condition for ensuring the stability of our service.

In addition, with the Pod Topology Spread Constraints, we can specify the placement of Pods in different Availability Zones, which is a must-have option when building a High-Availability architecture.

So we add a topologySpreadConstraints to our Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-demo-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx-demo
  template:
    metadata:
      labels:
        app: nginx-demo
    spec:
      containers:
        - name: nginx-demo-container
          image: nginx:latest
          ports:
            - containerPort: 80
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: nginx-demo
Enter fullscreen mode Exit fullscreen mode

And now both Pods will be scheduled on a different WorkerNodes:

$ kk get pod -l app=nginx-demo -o json | jq '.items[].spec.nodeName'
"ip-10-1-54-144.ec2.internal"
"ip-10-1-45-7.ec2.internal"
Enter fullscreen mode Exit fullscreen mode

See also Scaling Kubernetes with Karpenter: Advanced Scheduling with Pod Affinity and Volume Topology Awareness.

Kubernetes PodDisruptionBudget

With the PodDisruptionBudget, we can set a rule for the minimum number of available or maximum number of unavailable Pods. The value can be either a number or a percentage of the total number of Pods in the replicas of a Deployment/StatefulSets/ReplicaSet.

In the case of a Deployment that has two Pods and has a topologySpreadConstraints on different WorkerNodes, this will ensure that Karpenter will not perform Node Drain on two WorkerNodes at the same time. Instead, it will "relocate" one Pod first, kill its Node, and then repeat the process for the other Node.

See Specifying a Disruption Budget for your Application.

Let's create a PDB for our Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-demo-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx-demo
  template:
    metadata:
      labels:
        app: nginx-demo
    spec:
      containers:
        - name: nginx-demo-container
          image: nginx:latest
          ports:
            - containerPort: 80
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: nginx-demo
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nginx-demo-pdb
spec:
  minAvailable: 50%
  selector:
    matchLabels:
      app: nginx-demo
Enter fullscreen mode Exit fullscreen mode

Deploy and check:

$ kk get pdb
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
nginx-demo-pdb 50% N/A 1 21s
Enter fullscreen mode Exit fullscreen mode

The karpenter.sh/do-not-disrupt annotation

In addition to the settings on the Kubernetes side, we can explicitly prohibit the deletion of a Pod by Karpenter itself by adding the karpenter.sh/do-not-disrupt annotation (previously, before Beta, these were karpenter.sh/do-not-evict and karpenter.sh/do-not-consolidate annotations).

This may be necessary, for example, for Pods that are to be run in a single instance (like VictoriaMetrics VMSingle instance) and that you do not want to stop.

To do this, add an annotation to the template of this Pod:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-demo-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx-demo
  template:
    metadata:
      labels:
        app: nginx-demo
      annotations:
        karpenter.sh/do-not-disrupt: "true"        
    spec:
      containers:
        - name: nginx-demo-container
          image: nginx:latest
          ports:
            - containerPort: 80
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: nginx-demo
Enter fullscreen mode Exit fullscreen mode

See Pod-Level Controls. In general, these seem to be all the main solutions that will help ensure the continuous operation of the Pods.

Originally published at RTFM: Linux, DevOps, and system administration.


Top comments (0)