It all started with the container wars of yore, where Apache Mesos, Docker Swarm, Kubernetes, and Hashicorp Nomad fought a bloody battle until there emerged a clear victor. Kubernetes originated within Google and significantly more than half of the responders in a CNCF survey reported using Kubernetes in production (not considering sample bias from 2302 orgs). It has ~78% market share within the container orchestration space.
⭐ With great power comes great responsibility - Uncle Ben
⭐ With great power comes great complexity - Anonymous
Kubernetes is extremely powerful, and that configurability leads to complexity. Here are the practical considerations that one must be aware of when dealing with Kubernetes as the deployment target.
So you have decided to use Kubernetes for your startup. This post details what you need to do to have a fully functioning setup. This post is opinionated.
There are two classes of tools:
- Managed Kubernetes providers - like Amazon EKS, Google GKE, Azure AKS, DOKS, etc. These are managed services by cloud providers. Please do not try to manage Kubernetes clusters by yourself using
kubeadmor similar unless there is a hard constraint like compliance or on-prem deployments.
- Kubernetes wrappers to make working with k8s easy - a smattering of startups moving k8s under the hood. Argonaut does this amongst other things.
Scoping the setup to just the major clouds, the configuration process from the UI takes ~5 minutes, and then another 10-20 minutes for the cluster itself to be provisioned. Command-line tools exist to specify the configurations using yaml files. Terraform modules can help in provisioning this using IAC paradigms.
New Kubernetes (control plane) versions are released every ~15 weeks. The alpha → beta → prod release cycle for features / APIs is around 1 year long. The deprecation of older APIs is also approximately the same cycle. Installing a recent version of a third-party tool (eg. Grafana) can fail on older versions of Kubernetes because of outdated APIs. Maintaining a current version of the Kubernetes control plane becomes important for this very functional reason, not counting security enhancements and additional features delivered over time.
Updates of Kubernetes versions can break functioning applications because of deprecated APIs that may be used. This needs to be done with utmost care and can be an arduous process every few months.
This is a major consideration - if you don’t have someone with the relevant time and expertise for this activity, using Kubernetes directly is not recommended.
As a startup, you want application developers to go full throttle on building functionality and not worry about infrastructure. Using Kubernetes can bring a lot of velocity to the development workflows by enabling scale from day one. This is the technology that powers Internet-scale companies and provides immense flexibility around scaling, rollouts, and service discovery while abstracting out the physical infrastructure.
Here are the major concerns:
- Build: Workloads need to be containerized. That leads to long build times, especially if there is no caching possible/enabled for the build. A local build might be just a hot reload, but these can take many minutes with the container build step included. Please use podman, kaniko, or similar over docker for builds.
- Deploy: The deployment experience itself is super neat. The complexity of defining the deployment, which is usually a one-time heavy lift followed by minor modifications for maintenance, is pretty high and requires k8s-specific knowledge. Some tools help with the automatic generation of k8s manifests for easy use, like skaffold, devspace, dokku, and more.
- Debug - difference between local and dev environments: Replicating bugs from remote environments to local is tough. Tools like devspace and telepresence help. The former enables a quick build loop between the local and the remote k8s environment and can be massively helpful in reducing development time. The latter is a lot more magic, directing remote traffic to your local laptop, but is more complex to work with.
- Debug - logging into remote: While devspace, telepresence, and skaffold are nice for remote dev, sometimes the easiest thing to do is to login to the remote container for debugging. You definitely want to use a Kubernetes dashboard - k8s lens, or k9s cli. Along with this, you can use
kubectl execto open up a shell in the container or use ephemeral containers to attach to a running pod introduced in k8s v1.23. The latter is the preferred method in dev environments. Neither of these should be done in prod environments.
- Autoscaling: This is a huge benefit that k8s brings. Autoscaling can mean one of three things.
- Cluster - Nodes: This is the number of EC2 instances required to satisfy the application requirements. The cluster-autoscaler or karpenter projects are great for this.
- Application - Horizontal Scaling: Creating more instances of the application to keep up with the load. Keda is the tool for this.
- Application - Vertical Scaling: Increasing resources available to each instance of the application based on load. This is not done very much in practice and there aren’t any mature tools to help with this, beyond the k8s built-in.
- Logs: A logging solution is a must when dealing with k8s, simply because you should avoid directly accessing prod environments. Setting up a logging pipeline is not hard. Prometheus + Grafana + Loki is the simplest self-hosted solution, while datadog etc. provides excellent out-of-the-box experiences but at a (literal) cost. One thing to be wary of is that logs from different instances of the same application are usually treated differently, which can be a mixed bag during debugging. Self-hosting ELK will cause a few days of heartburn before you get it up and running reliably.
- APM + Tracing: Use your favorite. Please avoid tool fragmentation and use the same solution across.
My personal favorite to get started is self-hosting Prometheus + Grafana + Loki. For advanced use cases, I like datadog but it is expensive.
A catch with hosting your own observability stack is that it can lead to fragmentation if you’re not careful. Monitoring one cluster with a Grafana stack is fine, multi-cluster setup gets complicated. Infra monitoring etc. is a whole other ball game, with the right permissions and access control requirements that can be hard to wrangle for startups.
A huge advantage with running applications on k8s w.r.t monitoring is that the whole bunch can be monitored in one shot with little to no changes required by the applications. The collection agents usually run as daemonsets collecting container logs from each of the nodes and piping them to the log server with the appropriate labels. Works like a charm.
The big one. $72/month per cluster, plus whatever compute is used.
Kubernetes adoption for startups almost certainly increases the cloud bill because there is a $72/month cost associated with just running each cluster on EKS, GKE, or AKS (for uptime SLA).
However, there are some tools that could help with reducing the costs by providing visibility into the utilization of the pods and adjusting resources appropriately - like kubecost.
For very early companies, check out lambda container images as a starting point, especially if there are only a handful of services and low utilization.
Here are some things you will eventually need to do over the next few years.
- Setup multiple clusters across regions and connect them to operate as a single entity as far as apps are concerned. This needs a service mesh like linkerd.
- Setup policy around what resource requirements can be requested by an app per environment. OPA and gatekeeper or kyverno can help. Setup access control for who can create or modify apps.
- Setup a VPN solution.
- Accrete ~15 tools to help with k8s management over time.
- Have a dedicated infra team.
To be clear, these are NOT negatives. The alternative would be to build immature tooling in-house to replicate all this functionality at an even greater investment of time and effort.
- Use Kubernetes as a startup if you have cloud credits and more than 3-5 services, and a couple of environments.
- The overhead with using Kubernetes is real. Use a product that can help with overall management.
- Use lambda container images if you have 1-2 services and low utilization.
- Use Argonaut - All of the above are either a part of Argonaut already or will soon be. We support Lambdas and Kubernetes runtimes and we can help you make the right choice for your startup. Sometimes, that might be to not use Argonaut and we will tell you that. Check out our docs here.