The Ops Community ⚙️

Cover image for Request-based autoscaling in Kubernetes: scaling to zero
Daniele Polencic
Daniele Polencic

Posted on

Request-based autoscaling in Kubernetes: scaling to zero

TL;DR: In this article, you will learn how to monitor the HTTP requests to your apps in Kubernetes and how to define autoscaling rules to increase and decrease replicas for your workloads.

Reducing infrastructure costs boils down to turning apps off when you don't use them.

That's easy to do manually, but how to turn them on automatically when you need them?

You can do so with a scale-to-zero strategy.

Let me show you how to implement it in Kubernetes.

Imagine you are running a reasonably resource-intensive app on the dev cluster.

Ideally, it's off when people leave the office and on when they start the day.

You could use a CronJob to scale up/down, but what happens during the weekend?

And public holidays?

A development and production Kubernetes clusters

Instead of coming up with an evergrowing list of exceptions, you could scale up your workloads based on traffic:

  • When there's traffic → autoscale on usage.
  • When there is no traffic → turn off the app.

And that's the idea behind scaling to zero.

Scale to zero strategy

But what do you need to make it work?

The first step is collecting metrics and measuring the app traffic flow.

You could intercept the traffic by injecting a proxy in front.

The proxy could temporarily buffer requests when the app is scaled to 0.

When the app is ready, the proxy can forward all requests.

In Kubernetes, that's an extra proxy container in the Pod.

Intercepting all traffic that flows to the app with a proxy container

Once you have the metrics, you need two things:

  1. A metrics server to store and aggregate metrics (Kubernetes doesn't come with one by default).
  2. A way to export the metrics to the metrics server.

Kubernetes metrics server

And finally, the last piece of the puzzle is the autoscaler.

You need to configure the Horizontal Pod Autoscaler to consume the metrics from the metrics server and adjust the replicas count accordingly.

Collecting metrics, exposing them to the Kubernetes API and driving the Horizontal Pod Autoscaler

In Kubernetes, you can have exports, metrics servers and proxy bundled into one with KEDA.

KEDA is an event-driven autoscaler that implements the Custom Metrics API.

It's made of 3 components:

  1. A Scaler.
  2. A Metrics Adapter.
  3. A Controller.

KEDA architecture

Scalers are like adapters that can collect metrics from databases, message brokers, telemetry systems etc.

KEDA has a special scaler that creates an HTTP proxy that measures and buffers requests before they reach the app.

KEDA HTTP add-on

KEDA's metrics adapter is responsible for exposing the metrics collected by the HTTP scaler in a format that the Kubernetes metrics pipeline can consume.

Autoscaling with KEDA in Kubernetes

Finally, the KEDA controller glues all the components together:

  • Collects the metrics using the adapter and exposes them to the metrics API.
  • Registers and manages the KEDA-specific Custom Resource Definitions (CRDs).
  • Creates and manages the Horizontal Pod Autoscaler.

KEDA and the HTTPScaledObject

With KEDA, you don't create the Horizontal Pod Autoscaler (HPA).

Instead, you create an HTTPScaledObject, which is a wrapper around the HPA that includes:

  1. How to connect to the source of the metrics.
  2. How to route the traffic to the app.
  3. The settings to create the Horizontal Pod Autoscaler.

HTTPScaledObject YAML definition

You can see the setup in action here.

  • There is no active app.
  • 6 requests are issued.
  • Eventually, the app (octopus icon) appears and serves the six requests.

Scale to zero in action

Scaling to zero… isn't that serverless? Not really.

Once scaled to at least one replica, the app still behaves as a regular app (e.g. it can handle hundreds of connections on its own).

Instead, serverless frameworks might create more functions to handle concurrent traffic.

Also, serverless frameworks handle incoming requests. You only have to define the application logic.

With a scale to zero, you are still handling the traffic.

If you wish to see this in action, Salman Iqbal demos scaling to zero here

And finally, if you've enjoyed this short post, you might also like the Kubernetes workshops that we run at Learnk8s or this collection of past Twitter threads

Until next time!

Top comments (0)