The Ops Community ⚙️

Cover image for Kubernetes scheduler deep dive
Daniele Polencic
Daniele Polencic

Posted on

Kubernetes scheduler deep dive

The scheduler is in charge of deciding where your pods are deployed in the cluster.

It might sound like an easy job, but it's rather complicated!

Let's start with the basic.

When you submit a deployment with kubectl, the API server receives the request, and the resource is stored in etcd.

Who creates the pods?

A pod resource is stored in Etcd

It's a common misconception that it's the scheduler's job to create the pods.

Instead, the controller manager creates them (and the associated ReplicaSet).

The controller manager creates the pods

At this point, the pods are stored as "Pending" in the etcd and are not assigned to any node.

They are also added to the scheduler's queue, ready to be assigned.

Pods are added to the scheduler queue.

The scheduler process Pods 1 by 1 through two phases:

  1. Scheduling phase (what node should I choose?).
  2. Binding phase (let's write to the database that this pod belongs to that node).

Pods are allocated one at the time

The Scheduler phase is divided into two parts. The Scheduler:

  1. Filters relevant nodes (using a list of functions called predicates)
  2. Ranks the remaining nodes (using a list of functions called priorities)

Let's have a look at an example.

The scheduler filters and scores nodes

Consider the following cluster with nodes with and without GPU.

Also, a few nodes are already running at total capacity.

A collection of Kubernetes nodes

You want to deploy a Pod that requires some GPU.

You submit the pod to the cluster, and it's added to the scheduler queue.

The scheduler discards all nodes that don't have GPU (filter phase).

All non-GPU nodes are discarded

Next, the scheduler scores the remaining nodes.

In this example, the fully utilized nodes are scored lower.

In the end, the empty node is selected.

The remaining nodes are scored

What are some examples of filters?

  • NodeUnschedulable prevents pods from landing on nodes marked as unschedulable.
  • VolumeBinding checks if the node can bind the requested volume.

The default filtering phase has 13 predicates.

Default predicates in the Kubernetes scheduler

Here are some examples of scoring:

  • ImageLocality prefers nodes that already have the container image downloaded locally.
  • NodeResourcesBalancedAllocation prefers underutilized nodes.

There are 13 functions to decide how to score and rank nodes.

Default functions to score nodes in Kubernetes

How can you influence the scheduler's decisions?

  • nodeSelector
  • Node affinity
  • Pod affinity/anti-affinity
  • Taints and tolerations
  • Topology constraints
  • Scheduler profiles

nodeSelector is the most straightforward mechanism.

You assign a label to a node and add that label to the pod.

The pod can only be deployed on nodes with that label.

Assigning pods to nodes with the nodeSelector

Node affinity extends nodeSelector with a more flexible interface.

You can still tell the scheduler where the Pod should be deployed, but you can also have soft and hard constraints.

Assigning pods to nodes with node affinity

With Pod affinity/anti-affinity, you can ask the scheduler to place a pod next to a specific pod.

Or not.

For example, you could have a deployment with anti-affinity on itself to force spreading pods.

Scheduling pods with pod affinity and anti-affinity

With taints and tolerations, pods are tainted, and nodes repel (or tolerate) pods.

This is similar to node affinity, but there's a notable difference: with Node affinity, Pods are attracted to nodes.

Taints are the opposite - they allow a node to repel pods.

Scheduling pods with taints and tolerations

Moreover, tolerations can repel pods with three effects: evict, "don't schedule", and "prefer don't schedule".

Personal note: this is one of the most difficult APIs I worked with.

I always (and consistently) get it wrong as it's hard (for me) to reason in double negatives.

You can use topology spread constraints to control how Pods are spread across your cluster.

This is convenient when you want to ensure that all pods aren't landing on the same node.

Pod topology constraints

And finally, you can use Scheduler policies to customize how the scheduler uses filters and predicates to assign nodes to pods.

This relatively new feature (>1.25) allows you to turn off or add new logic to the scheduler.

Scheduler policies in Kubernetes

You can learn more about the scheduler here:

And finally, if you've enjoyed this thread, you might also like:

Top comments (14)

Collapse
 
javi_labs profile image
Javier Marasco

Thanks Daniele for this article, very well done!! 👏

Collapse
 
anderson135831 profile image
Anderson

A "Kubernetes Scheduler Deep Dive" explores how the Kubernetes scheduler allocates pods to nodes based on available resources and constraints. It covers topics like the scheduling algorithm, custom scheduler creation, and performance optimizations. For those managing Kubernetes clusters, understanding the scheduler is essential, just like knowing take 5 oil change prices helps with making informed decisions on car maintenance. Optimizing the scheduler ensures efficient resource utilization and smooth operation of the cluster.

Collapse
 
david_4cdae633ac8f7656de1 profile image
David

Thank you for sharing this wonderful post — it was truly helpful while I was working on my article. Whenever you get some free time, take a moment to enjoy healthy food, visit new restaurants, and explore different places. A refreshed mind always brings out better writing, so eat fresh and breathe in clean air to stay inspired.

Collapse
 
kartonbetrix profile image
Karton Betrix • Edited

Every action game enthusiast should have Escape Road on their playlist. It blends the finest elements of chaos, racing, and planning into a single, spectacular escape situation.

Collapse
 
rafaelakutch profile image
rafaelakutch

This game is pure, unadulterated fun. It’s the perfect distraction when you have a few minutes to kill. The controls are intuitive, and the levels are surprisingly creative. I’m constantly surprised by the new challenges Sprunki Game throws at me!

Collapse
 
david_4cdae633ac8f7656de1 profile image
David

I really appreciate this great post — it helped me so much with my writing. Take some time to relax, eat healthy meals, and visit new restaurants or beautiful places. When your mind is refreshed and full of new experiences, your words flow more naturally and effectively.

Collapse
 
emmadenshi profile image
emmadenshi

It's like trying to hit a home run in Doodle Baseball - you've got to time everything just right (like the scheduler finding the perfect node!). You might swing and miss a few times (deployments failing), but you learn from each attempt and eventually figure out the best strategy. The nodeSelector and affinity sections were especially helpful. Thanks for sharing!

Collapse
 
david_4cdae633ac8f7656de1 profile image
David

Thanks a lot for this amazing post! It really helped me with my article. When you’re free, try enjoying some healthy food, visiting new restaurants, and exploring different places. Fresh air and good food refresh your mind — that’s when you can write your best.

Collapse
 
ffbetatestingapp_343ee004 profile image
ffbetatestingapp

Explore the Kubernetes scheduler in depth and learn how it makes pod placement decisions. Perfect for those involved in beta testing FF or managing complex cluster workloads efficiently.

Collapse
 
cris_tommy_355932f38a621d profile image
Cris Tommy

Fantastic read! I truly value your content and can’t wait for more posts from you.
car games