Over the years, I have shared several blog posts about Kubernetes (What are Containers and Kubernetes, Modern Cloud deployment and usage, Introduction to Container Operating Systems, and more).
Kubernetes became a de-facto standard for running container-based workloads (for both on-premise and the public cloud), but most organizations tend to fail on what is referred to as Day 2 Kubernetes operations.
In this blog post, I will review what it means "Day 2 Kubernetes" and how to prepare your workloads for the challenges of Day 2 operations.
In the software lifecycle, or the context of this post, the Kubernetes lifecycle, there are several distinct stages:
In this stage, we focus on designing our solution (application and underlying infrastructure), understanding business needs, budget, required skills, and more.
For the context of this post, let us assume we have decided to build a cloud-native application, made of containers, deployed on top of Kubernetes.
In this stage, we focus on deploying our application using the Kubernetes orchestrator and setting up the configurations (number of replicas, public ports, auto-scale settings, and more).
Most organizations taking their first steps deploying applications on Kubernetes are stacked at this stage.
They may have multiple environments (such as Dev, Test, UAT) and perhaps even production workloads, but they are still on Day 1.
Mature organizations have reached this stage.
This is about ongoing maintenance, observability, and continuous improvement of security aspects of production workloads.
In this blog post, I will dive into "Day 2 Kubernetes".
Below are the most common Kubernetes challenges:
Managing Kubernetes at a large scale requires insights into the Kubernetes cluster(s).
It is not enough to monitor the Kubernetes cluster by collecting performance logs, errors, or configuration changes (such as Nodes, Pods, containers, etc.)
We need to have the ability to truly understand the internals of the Kubernetes cluster (from logs, metrics, etc.), be able to diagnose the behavior of the Kubernetes cluster – not just performance issues, but also debug problems, detect anomalies, and (hopefully) be able to anticipate problems before they affect customers.
Prefer to use cloud-native monitoring and observability tools to monitor Kubernetes clusters.
Without proper observability, we will not be able to do root cause analysis and understand problems with our Kubernetes cluster or with our application deployed on top of Kubernetes.
Common tools for observability:
- Prometheus - An open-source systems monitoring and alerting toolkit for monitoring large cloud-native deployments.
- Grafana - An open-source query, visualization, and alerting tool (resource usage, built-in and customized metrics, alerts, dashboards, log correlation, etc.)
- OpenTelemetry - A collection of open-source tools for collecting and exporting telemetry data (metrics, logs, and traces) for analyzing software performance and behavior.
- Amazon Managed Grafana
- Amazon Managed Service for Prometheus
- AWS Distro for OpenTelemetry
- Azure Monitor managed service for Prometheus (Still in preview on April 2023)
- Azure Managed Grafana
- OpenTelemetry with Azure Monitor
- Google Cloud Managed Service for Prometheus
- Google Cloud Logging plugin for Grafana
- OpenTelemetry Collector (Part of Google Cloud operations suite)
On the one hand, it is easy to deploy a Kubernetes cluster in private mode, meaning, the API server or the Pods are on an internal subnet and not directly exposed to customers.
On the other hand, many challenges in the security domain need to be solved:
- Secrets Management – A central and secure vault for generating, storing, retrieving, rotating, and eventually revoking secrets (instead of hard-coded static credentials inside our code or configuration files).
- Access control mechanisms – Ability to control what persona (either human or service account) has access to which resources inside the Kubernetes cluster and to take what actions, using RBAC (Role-based access control) mechanisms.
- Software vulnerabilities – Any vulnerabilities related to code – from programming languages (such as Java, PHP, .NET, NodeJS, etc.), use of open-source libraries with known vulnerabilities, to vulnerabilities inside Infrastructure-as-Code (such as Terraform modules)
- Hardening – Ability to deploy a Kubernetes cluster at scale, using secured configuration, such as CIS Benchmarks.
- Networking – Ability to set isolation between different Kubernetes clusters or even between different development teams using the same cluster, not to mention multi-tenancy where using the same Kubernetes platform to serve different customers.
- Securing the Software Supply Chain in the Cloud
- OPA (Open Policy Agent) Gatekeeper
- Kyverno - Kubernetes Native Policy Management
- Foundational Cloud Security with CIS Benchmarks
- Amazon EKS Best Practices Guide for Security
- Azure security baseline for Azure Kubernetes Service (AKS)
- GKE Security Overview
Mature organizations have already embraced DevOps methodologies for pushing code through a CI/CD pipeline.
The entire process needs to be done automatically and without direct access of developers to production environments (for this purpose you build break-glass mechanisms for the SRE teams).
The switch to applications wrapped inside containers, allowed developers to develop locally or in the cloud and push new versions of their code to various environments (such as Dev, Test, and Prod).
Integration of CI/CD pipeline, together with containers, allows organizations to continuously develop new software versions, but it requires expanding the knowledge of developers using training.
The use of GitOps and tools such as Argo CD allowed a continuous delivery process for Kubernetes environments.
To allow developers, the best experience, you need to integrate the CI/CD process into the development environment, allowing the development team the same experience as developing any other application, as they used to do in the on-premise for legacy applications, which can speed the developer onboarding process.
- GitOps 101: What is it all about?
- Argo CD - Declarative GitOps CD for Kubernetes
- Continuous Deployment and GitOps delivery with Amazon EKS Blueprints and ArgoCD
- Getting started with GitOps, Argo, and Azure Kubernetes Service
- Building a Fleet of GKE clusters with ArgoCD
Any Kubernetes cluster requires persistent storage - whether organizations choose to begin with an on-premise Kubernetes cluster and migrate to the public cloud, or provision a Kubernetes cluster using a managed service in the cloud.
Kubernetes supports multiple types of persistent storage – from object storage (such as Azure Blob storage or Google Cloud Storage), block storage (such as Amazon EBS, Azure Disk, or Google Persistent Disk), or file sharing storage (such as Amazon EFS, Azure Files or Google Cloud Filestore).
The fact that each cloud provider has its implementation of persistent storage adds to the complexity of storage management, not to mention a scenario where an organization is provisioning Kubernetes clusters over several cloud providers.
To succeed in managing Kubernetes clusters over a long period, knowing which storage type to use for each scenario, requires storage expertise.
High availability is a common requirement for any production workload.
The fact that we need to maintain multiple Kubernetes clusters (for example one cluster per environment such as Dev, Test, and Prod) and sometimes on top of multiple cloud providers, make things challenging.
We need to design in advance where to provision our cluster(s), thinking about constraints such as multiple availability zones, and sometimes thinking about how to provision multiple Kubernetes clusters in different regions, while keeping HA requirements, configurations, secrets management, and more.
Designing and maintaining HA in Kubernetes clusters requires a deep understanding of Kubernetes internals, combined with knowledge about specific cloud providers' Kubernetes management plane.
- Designing Production Workloads in the Cloud
- Amazon EKS Best Practices Guide for Reliability
- AKS - High availability Kubernetes cluster pattern
- GKE best practices: Designing and building highly available clusters
Cost is an important factor in managing environments in the cloud.
It can be very challenging to design and maintain multiple Kubernetes clusters while trying to optimize costs.
To monitor cost, we need to deploy cost management tools (either the basic services provided by the cloud provider) or third-party dedicated cost management tools.
For each Kubernetes cluster, we need to decide on node instance size (amount of CPU/Memory), and over time, we need to review the node utilization and try to right-size the instance type.
For non-production clusters (such as Dev or Test), we need to understand from the cloud vendor documentation, what are our options to scale the cluster size to the minimum, when not in use, and be able to spin it back up, when required.
Each cloud provider has its pricing options for provisioning Kubernetes clusters – for example, we might want to choose reserved instances or saving plans for production clusters that will be running 24/7, while for temporary Dev or Test environment, we might want to choose Spot instances and save cost.
- Cost optimization for Kubernetes on AWS
- Azure Kubernetes Service (AKS) – Cost Optimization Techniques
- Best practices for running cost-optimized Kubernetes applications on GKE
- 5 steps to bringing Kubernetes costs in line
- 4 Strategies for Kubernetes Cost Reduction
Running Kubernetes clusters requires a lot of knowledge.
From the design, provision, and maintenance, usually done by DevOps or experienced cloud engineers, to the deployment of new applications, usually done by development teams.
It is crucial to invest in employee training, in all aspects of Kubernetes.
Constant updates using vendor documentation, online courses, blog posts, meetups, and technical conferences will enable teams to gain the knowledge required to keep up with Kubernetes updates and changes.
- Kubernetes Blog
- AWS Containers Blog
- Azure Kubernetes Service (AKS) issue and feature tracking
- Google Cloud Blog - Containers & Kubernetes
Kubernetes solve part of the problems related to container orchestration.
As an open-source solution, it can integrate with other open-source complimentary solutions (from monitoring, security and governance, cost management, and more).
Every organization might wish to use a different set of tools to achieve each task relating to the ongoing maintenance of the Kubernetes cluster or for application deployment.
Selecting the right tools can be challenging as well, due to various business or technological requirements.
It is recommended to evaluate and select Kubernetes native tools to achieve the previously mentioned tasks or resolve the mentioned challenges.
In this blog post, I have reviewed the most common Day 2 Kubernetes challenges.
I cannot stress enough the importance of employee training in deploying and maintaining Kubernetes clusters.
It is highly recommended to evaluate and look for a centralized management platform for deploying, monitoring (using cloud-native tools), and securing the entire fleet of Kubernetes clusters in the organization.
Another important recommendation is to invest in automation – from policy enforcement to application deployment and upgrade, as part of the CI/CD pipeline.
I recommend you continue learning and expanding your knowledge in the ongoing changed world of Kubernetes.
Eyal Estrin is a cloud and information security architect, the owner of the blog Security & Cloud 24/7 and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry.
Eyal is an AWS Community Builder since 2020.
You can connect with him on Twitter and LinkedIn.