The Ops Community ⚙️

Ole Markus With
Ole Markus With

Posted on

Blazing fast Kubernetes scaling with ASG warm pools

Last week, AWS launched warm pools for auto scaling groups (ASG). In short, this feature allows you to create a pool of pre-initialised EC2 instances. When the ASG needs to scale out, it will pull in Nodes from the warm pool if available. Since these are already pre-initialised, the scale-out time is reduced significantly.

The warm pool is also virtually for free. You pay for the warm pool instances as you would for any other stopped instance. You also have to pay for the time the instances spend on initialising when entering the warm pool, but this is more or less cancelled out by the reduced time spent on initialising when entering the ASG itself.

What does pre-initialised mean, you wonder? Took a while for me to understand what that means too. What happens is that the EC2 instance boots, runs for a while, and then shuts down. We'll get back to this in a bit.

Now, what intrigued me is this bit from the warm pool documentation:

alt text

This is a very interesting claim. Why on earth would this feature not be possible for Kubernetes regardless of what shape and form self-managed could Kubernetes comes in?

AWS may have made interesting choices with their Elastic Kubernetes Service precluding them from taking advantage of warm pools, but there are alternatives!

Over the last year or so I have regularly contributed to kOps, which is my preferred way of deploying and maintaining production-ready clusters on AWS. I know well how it boots a plain Ubuntu instance and configures it to become a Kubernetes node, and I could not imagine implementing warm pool support would be any challenge. And turns out it was not either.

The results

This post will describe in detail some of the inner workings of kOps, how ASG behaves, and how to observe various time spans between when a scale-out is triggered and the node is ready for Kubernetes workloads.

If you are here only for the results, here is the TL;DR:

  • Time between a scale-out is triggered and Pods start improved at least 50%.
  • It should be possible to improve this even further.
  • Most, if not all, of the functionality below will be available in kOps 1.21.

Summary

This table shows the number of seconds between CAS being triggered and Pods starting.

Configuration First Pod started Last Pod started
No warm pool 149 190
Warm pool 79 149
Warm pool plus life cycle hook 76 98
Warm pool plus life cycle hook + warm-pull images 70 79

And now for the details!

Initialising a Kubernetes node

First, let us have a look at what the process of converting a brand new EC2 instance to a Kubernetes Node means.
On a new EC2 instance, this happens on first boot:

  • cloud-init installs a configuration service called nodeup.
  • nodeup takes the cluster configuration and installs containerd, kubelet, and the necessary distro packages with their correct configurations.
  • nodeup establishes trust with the API server (control plane).
  • nodeup creates and installs a systemd service for kubelet.
  • nodeup starts the kubelet service, which is the process on each node that manages the Kubernetes workloads.
  • kubelet pulls down all the images the control plane tells it to run and starts them as defined by Pod specs and similar.

When a kOps-provisioned instance reboots, nodeup runs through all of the above again to ensure the instance is in the expected state. nodeup is smart enough not to redo already performed tasks though, so the second run is quite fast.

Doing nothing at all

The most naïve way of implementing support for warm pools is to do nothing more than creating the warm pool. Unfortunately, this would start kubelet, which will register the Node with the cluster. Since the AWS cloud provider does not remove instances in stopped state, the control plane marks the Node NotReady, but keep it around in case it comes back up.

It is not a catastrophe to have a large amount of NotReady Nodes in the cluster, but any sane monitoring would not be too happy, and detecting actual bad Nodes would be harder.

Making nodeup aware of the warm pool

The only thing I had to do to support warm pools gracefully was to make nodeup conscious of its ASG lifecycle state.

Instances entering the warm pool have a slightly different lifecycle than instances that go directly into the ASG. nodeup needs to do is to:

  1. check if the current instance ASG lifecycle state has a warming: prefix.
  2. if it does not, install and start the kubelet service.

This way kubelet does not start and join the cluster on first boot, but since we enabled the service, systemd will start it on the second boot.

Comparing the difference

In this section I will take you through comparing the time it takes to scale out a Kubernetes Deployment with and without a warm pool enabled. The acid test is the interval between Cluster Autoscaler (CAS) reacts to the scale-out demand and all the Pods starting.

Configuration

In the kOps Cluster spec I ensured I had the following snippet:

spec:
  clusterAutoscaler:
    enabled: true
    balanceSimilarNodeGroups: true
Enter fullscreen mode Exit fullscreen mode

This enables the cluster autoscaler addon.

The number of new instances per ASG influences the scale-out time of the ASG itself. Adding 9 instances to one ASG is significantly slower than adding 3 instances to 3 ASGs. So, to ensure fair comparisons, we tell CAS to balance the ASGs.

On each of the InstanceGroups with the Node role, I set the following capacity:

spec:
  machineType: t3.medium
  maxSize: 10
  minSize: 1
Enter fullscreen mode Exit fullscreen mode

The cluster will launch and with the with the minimum capacity.

I then created a Deployment that has resource.requirement.cpu set:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: default
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 0
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:stable-alpine
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 1
Enter fullscreen mode Exit fullscreen mode

I used the nginx:stable-alpine image here as it is fairly small. I did not want image pull time to significantly impact the scale-out time.

Scaling out

To scale the Deployment, I executed the following:

kubectl scale deployment.v1.apps/nginx-deployment --replicas=10
Enter fullscreen mode Exit fullscreen mode

A t3.medium instance only has 2 cpus, some of which are reserved by other Pods already. So, increasing the replicas as above causes CAS to scale out one instance per replica.

Obtaining the test results

A good way of getting the details is to list the events for a Pod:

$ kubectl get events -o custom-columns=FirstSeen:.firstTimestamp,LastSeen:.lastTimestamp,Count:.count,From:.source.component,Type:.type,Reason:.reason,Message:.message --field-selector involvedObject.kind=Pod,involvedObject.name=nginx-deployment-123abc
Enter fullscreen mode Exit fullscreen mode

For all the configurations in this post, I ran the following command on the first and last Pod to enter the Running state.

For the first Pod, I got:

FirstSeen              LastSeen               Count   From                 Type      Reason             Message
2021-04-17T08:56:30Z   2021-04-17T08:56:30Z   1       cluster-autoscaler   Normal    TriggeredScaleUp   Pod triggered scale-up: [{Nodes-eu-central-1a 1->5 (max: 10)} {Nodes-eu-central-1b 1->4 (max: 10)}]
<snip>
2021-04-17T08:58:59Z   2021-04-17T08:58:59Z   1       `kubelet`              Normal    Started            Started container nginx
Enter fullscreen mode Exit fullscreen mode

For the last Pod I got:

FirstSeen              LastSeen               Count   From                 Type      Reason             Message
2021-04-17T08:56:30Z   2021-04-17T08:56:30Z   1       cluster-autoscaler   Normal    TriggeredScaleUp   Pod triggered scale-up: [{Nodes-eu-central-1a->5 (max: 10)} {Nodes-eu-central-1b 1->4 (max: 10)}]
<snip>
2021-04-17T08:59:40Z   2021-04-17T08:59:40Z   1       `kubelet`              Normal    Started            Started container nginx
Enter fullscreen mode Exit fullscreen mode

This means the first Pod started after 149s and the last Pod after 190s. These are the numbers I'll be comparing across all configurations. I also found it interesting to compare the difference between the first and last Pod start time.

Hunting for delays

This part may not be all that interesting. Here I just try to show which time spans can be interesting for improving the warming process, including explaining what may cause a 41 second difference between those two Pods.

ASG reaction time

If I look at the ASG activity, I see the following message on all instances the ASG launched:

At 2021-04-17T08:56:30Z a user request explicitly set group desired capacity changing the desired capacity from 1 to 5. At 2021-04-17T08:56:32Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 1 to 5.

However, if I look at the actual boot of the instances, I see these two lines for the first and last instance to boot respectively:

-- Logs begin at Sat 2021-04-17 08:56:51 UTC, end at Sat 2021-04-17 09:10:23 UTC. --
-- Logs begin at Sat 2021-04-17 08:57:13 UTC, end at Sat 2021-04-17 09:09:19 UTC. --
Enter fullscreen mode Exit fullscreen mode

So even if the ASG launched the instances at the same time, they do not actually boot at the same time.

It looks like we can generally assume 20-30 seconds response time on an ASG scale-out.

Nodeup run time

We can see that the nodeup spends a fairly consistent amount of time initialising the node. The times below show when nodeup finished on the two Nodes.

Apr 17 08:58:27 ip-172-20-42-5 `systemd`[1]: kops-configuration.service: Succeeded.
Apr 17 08:58:50 ip-172-20-90-140 `systemd`[1]: kops-configuration.service: Succeeded.
Enter fullscreen mode Exit fullscreen mode

The delay is almost the same. From boot until the instance has been configured takes about 95-100 seconds.

Kubelet becoming ready

The last part is the time it takes before node enters Ready state.

kubelet becomes ready once it has registered with the control plane and verified storage, CPU, memory, and networking is working properly.

Interestingly, this part further skewed the difference between our two Nodes:

  Ready                True    Sat, 17 Apr 2021 11:19:08 +0200   Sat, 17 Apr 2021 10:58:53 +0200   KubeletReady                 `kubelet` is posting ready status. AppArmor enabled
  Ready                True    Sat, 17 Apr 2021 11:14:56 +0200   Sat, 17 Apr 2021 10:59:25 +0200   KubeletReady                 `kubelet` is posting ready status. AppArmor enabled
Enter fullscreen mode Exit fullscreen mode

This last leg took 26s and 35s for those two instances.

Enter the warm pool

So how much does adding a warm pool improve the scale-out time?

Wrap up this test by scaling the deployment back to 0 so CAS can scale down the ASGs again.

kubectl scale deployment.v1.apps/nginx-deployment --replicas=0
Enter fullscreen mode Exit fullscreen mode

Adding a warm pool

This feature has not been released to a kOps beta at the time of the writing; the field names below may change.

Configuration

The following will make kOps create warm pools for our InstanceGroups.

warmPool: 
  minSize: 10
  maxSize: 10
Enter fullscreen mode Exit fullscreen mode

Specifically, this will configure a warm pool with 10 Nodes. I only do this to ensure I have a known amount of warm instances to make the the tests comparable. When using warm pools under normal operations, I would just use AWS defaults.

Apply the configuration and watch the warm pool instances appear:

$ kops get instances
ID                      NODE-NAME                                       STATUS       ROLES   STATE           INTERNAL-IP     INSTANCE-GROUP               MACHINE-TYPE
i-01ade8dad4c7ce0cd     ip-172-20-114-104.eu-central-1.compute.internal UpToDate     node                    172.20.114.104  Nodes-eu-central-1c          t3.medium
i-01f169e730f88e016     ip-172-20-118-255.eu-central-1.compute.internal UpToDate     master                  172.20.118.255  master-eu-central-1c.masters t3.medium
i-069e09e5873042cd7     ip-172-20-93-29.eu-central-1.compute.internal   UpToDate     master                  172.20.93.29    master-eu-central-1b.masters t3.medium
i-09b42c88fbd3399ab     ip-172-20-58-68.eu-central-1.compute.internal   UpToDate     node                    172.20.58.68    Nodes-eu-central-1a          t3.medium
i-0a85aed8869a30432     ip-172-20-59-147.eu-central-1.compute.internal  UpToDate     master                  172.20.59.147   master-eu-central-1a.masters t3.medium
i-0b37f6a258d9c7775     ip-172-20-75-150.eu-central-1.compute.internal  UpToDate     node                    172.20.75.150   Nodes-eu-central-1b          t3.medium
i-0c16b3c668615f259                                                     UpToDate     node    WarmPool        172.20.50.226   Nodes-eu-central-1a          t3.medium
i-0cdf1c334d452c9a6                                                     UpToDate     node    WarmPool        172.20.126.45   Nodes-eu-central-1c          t3.medium
i-0d00f4debb586d17f                                                     UpToDate     node    WarmPool        172.20.116.222  Nodes-eu-central-1c          t3.medium
i-0d04d04cb7f4be2d7                                                     UpToDate     node    WarmPool        172.20.58.177   Nodes-eu-central-1a          t3.medium
i-0d49db4113702292c                                                     UpToDate     node    WarmPool        172.20.59.66    Nodes-eu-central-1a          t3.medium
i-0e07b6cc361d7a909                                                     UpToDate     node    WarmPool        172.20.89.193   Nodes-eu-central-1b          t3.medium
i-0f3451adef13005e7                                                     UpToDate     node    WarmPool        172.20.114.9    Nodes-eu-central-1c          t3.medium
<snip>
Enter fullscreen mode Exit fullscreen mode

The kOps output only shows that the instance is in the warm pool, not if it has finished pre-initialisation. But if you go into the Instance Management part of the ASG in the AWS console I can see something like this:

alt text

As you can see, not all instances have entered warmed:stopped state yet. But after waiting a bit longer, they are all ready and I can scale out.

kubectl scale deployment.v1.apps/nginx-deployment --replicas=10
Enter fullscreen mode Exit fullscreen mode

Once all the Pods have entered the Running state, let's look if this has improved.

Again, find the first and last Pod that entered the Running state and list their events. The method is the same as last time, and it shows that the first Pod starts after 79s and the last one starts after 149s.

The first Pod starts 70 seconds faster than the first node without a warm pool. The last Pod starts 41 seconds faster than the last Pod without warm instances. That is a pretty decent improvement.

Without the warm pool, the difference between the first Pod and last Pod was 41s. This time it was a whopping 70s; half a minute more. We cannot seem to blame this on ASG response time or the time between the kubelet starting and becoming ready.

So, what is happening here?

Turns out the time an instance runs before it is shut down is completely arbitrary. Some instances stay running for seconds, others for minutes. On the Node the first Pod is running on, nodeup was allowed run to until completion, while on the Node of the last Pod, it was barely allowed to run at all. Luckily nodeup is fairly good at knowing the current state of the Node and can pick up from where it left of regardless of when it was interrupted.

Enter lifecycle hooks

But we do not want nodeup to be interrupted. Nor do we want the instance to stay running long after nodeup has finished.

What AWS did to solve this is to let warming instances run through the same lifecycle as other instances. So, if you have a lifecycle hook for EC2_INSTANCE_LAUNCHING it will trigger on warming Nodes too.

Amend the InstanceGroup spec as follows:

warmPool:
  useLifecycleHook: true
Enter fullscreen mode Exit fullscreen mode

This will make kOps provision a lifecycle hook that can be used by nodeup to signal that it has completed its configuration.

Do the scale down/scale up dance again and observe the first and last Pod creation time.

First Pod 76s; last Pod 98s. That's 22s difference between the first and last Pod. Down from 40s without warm pool and certainly down from the 70s for warm pool without the instance lifecycle hook.

The effect of warm pool alone

With warm pool and some fairly easy changes to kOps, we did the impossible. We created warm pool support for self-managed Kubernetes.

The result is significantly faster response times to Pod scale-out. From up to 190 seconds without warm pool, to up to 98 seconds with warm pool and lifecycle hooks. The best case went from 149 seconds to 76 seconds.

That is 50% faster. Not bad!

Exploiting warm pools further

So far we have focused just on running a regular nodeup run. But can we do better? Can we exploit warm pool to make Pod scale out even faster?

Most likely we cannot do much about what happens before nodeup runs. Nodeup already runs fairly fast; roughly 10 seconds. There could be some optimisations there, but we would not be able to shave off many seconds.

But what about those 26-35 seconds between nodeup completing and the Pods becoming ready?

All Nodes are running additional system containers such as kube-proxy a CNI, in this case Cilium. All of these need to be present on the machine before much else happens. And those images are not necessarily small.

In my case, the time it took was the following:

FirstSeen              LastSeen               Count   From      Type     Reason    Message
2021-04-18T06:10:34Z   2021-04-18T06:10:34Z   1       `kubelet`   Normal   Pulled    Successfully pulled image "k8s.gcr.io/kube-proxy:v1.20.1" in 19.120641887s
2021-04-18T06:10:55Z   2021-04-18T06:10:55Z   1       `kubelet`   Normal   Pulled    Successfully pulled image "docker.io/cilium/cilium:v1.9.4" in 8.270050914s
Enter fullscreen mode Exit fullscreen mode

So not only did it take a while for each of these images to be pulled, but they are pulled in sequence, adding up to 25+ seconds.

But during warming, nodeup already knows about these images. What if it could just pull them so they were already present? Let's try this and see if this changes anything.

After another round of the scale-in/scale-out dance we describing the node and see the image has indeed been pulled during warming.

2021-04-18T06:35:37Z   2021-04-18T06:35:37Z   1       `kubelet`   Normal   Pulled    Container image "k8s.gcr.io/kube-proxy:v1.20.1" already present on machine
2021-04-18T06:35:37Z   2021-04-18T06:35:37Z   1       `kubelet`   Normal   Started   Started container kube-proxy
2021-04-18T06:35:37Z   2021-04-18T06:35:37Z   1       `kubelet`             Normal   Pulled      Container image "docker.io/cilium/cilium:v1.9.4" already present on machine
2021-04-18T06:35:38Z   2021-04-18T06:35:38Z   1       `kubelet`             Normal   Started     Started container cilium-agent
Enter fullscreen mode Exit fullscreen mode

We also see that the containers have started at roughly the same time.

Comparing this configuration with previous configuration shows this did not bring down the starting time of the first Pod by many seconds. This time the first Pod started 70 seconds after CAS triggered. But the last Pod consistently trails only a few seconds behind, at about 79 seconds. This makes the improvement worthwhile.

Pulling in images during warming could also be used for any other containers that may run on a given Node. This is certainly valuable for any DaemonSet, but also for any other Pod that have a chance of being deployed to the Node. Worst-case is that you waste a bit of disk space should the Pods not be scheduled on the Node.

Wrap up

The feature is not even two weeks old, so I certainly have not had the time to explore all the ways this can be exploited. nodeup typically does not expect instances to reboot so there may be optimisations that can be done there as well. For example, also on second boot, kubelet is triggered by nodeup, which may not be necessary. If nodeup successfully created the kubelet service on the first run, there should be zero changes done to the system on the second run. The only reason there would be a change is if the cluster configuration changed, and those should only be applied through an instance rotation.

I hope you are as excited about this as I am. And if you wonder when this all will be available to you, the answer is "shortly". Some of the functionality has already been merged to Kubernetes/kops, and I obviously have the code for the rest more or less ready. I hope for all of this to be available in kOps 1.21 expected sometime in May.

Top comments (0)