The Ops Community ⚙️

Cover image for Pod rebalancing and allocations in Kubernetes
Daniele Polencic
Daniele Polencic

Posted on

Pod rebalancing and allocations in Kubernetes

Does Kubernetes rebalance your Pods?

If there's a node that has more space, does Kubernetes recompute and balance the workloads?

Let's have a look at an example.

You have a cluster with a single node that can host 2 Pods.

If the node crashes, you will experience downtime.

You could have a second node with one Pod each to prevent this.

A Kubernetes cluster with a single node

You provision a second node.

What happens next?

Does Kubernetes notice that there's a space for your Pod?

Does it move the second Pod and rebalance the cluster?

Does Kubernetes move the pods to the lower utilized node?

Unfortunately, it does not.

But why?

When you define a Deployment, you specify:

  • The template for the Pod.
  • The number of copies (replicas).

A Kubernetes deployment

But nowhere in that file you said you want one replica for each node!

The ReplicaSet counts 2 Pods, and that matches the desired state.

Kubernetes won't take any further action.

In other words, Kubernetes does not rebalance your pods automatically.

But you can fix this.

There are three popular options:

  1. Pod (anti-)affinity.
  2. Pod topology spread constraints.
  3. The Descheduler.

The first option is to use pod anti-affinity.

With pod anti-affinity, your Pods repel other pods with the same label, forcing them to be on different nodes.

You can read more about pod anti-affinity here.

Example of a pod anti-affinity

Notice how pod affinity is evaluated when the scheduler allocates the pods.

It is not applied retroactively, so you might need to delete a few pods to force the scheduler to recompute the allocations.

Kubernetes does not rebalance two pods that have pod anti-affinity and are already alloacated to the same node

Alternatively, you can use topology spread constraints to control how Pods are spread across your cluster among failure domains such as regions, zones, nodes, etc.

This is similar to pod affinity but more powerful.

Spreading posts across failure domains

With topology spread constraints, you can pick the topology and choose the pod distribution (skew), what happens when the constraint is unfulfillable (schedule anyway vs don't) and the interaction with pod affinity and taints.

Example of pod topology spread constraints

However, even in this case, the scheduler evaluates topology spread constraints when the pod is allocated.

It does not apply retroactively — you can still delete the pods and force the scheduler to reallocate them.

Kubernetes does not rebalance two pods that have pod topology spread constraints and are already alloacated to the same node

If you want to rebalance your pods dynamically (not just when the scheduler allocates them), you should check out the Descheduler.

The Descheduler scans your cluster at regular intervals, and if it finds a node that is more utilized than others, it deletes a pod in that node.

The Descheduler deletes pods

What happens when a Pod is deleted?

The ReplicaSet will create a new Pod, and the scheduler will likely place it in a less utilized node.

If your pod has topology spread constraints or pod affinity, it will be allocated accordingly.

The Kubernetes scheduler will allocate pods efficiently

The Descheduler can evict pods based on policies such as:

  • Node utilization.
  • Pod age.
  • Failed pods.
  • Duplicates.
  • Affinity or taints violations.

If your cluster has been running long, the resource utilization could be more balanced.

The following two strategies can be used to rebalance your cluster based on CPU, memory or number of pods.

Descheduler configuration to rebalance under and overtilized nodes

Another practical policy is preventing developers and operators from treating pods like virtual machines.

You can use the descheduler to ensure pods only run for a fixed time (e.g. seven days).

Deleting pods after 7 days of utilization

Lastly, you can combine the Descheduler with Node Problem Detector and Cluster Autoscaler to automatically remove Nodes with problems.

The Descheduler can be used to descheduler workloads from those Nodes.

The Descheduler is an excellent choice to keep your cluster efficiency in check, but it isn't installed by default.

It can be deployed as a Job, CronJob or Deployment.

And finally, if you've enjoyed this thread, you might also like:

Top comments (1)

Collapse
 
iziodev profile image
Romain Billot

This is truly awesome contents, thanks!

I wonder why descheduler isn't core, it seems to answer issues that anyone with +1 nodes would face