The Ops Community ⚙️: Gulcan Topcu

Hacking Alibaba Cloud's Kubernetes Cluster

Gulcan Topcu — Tue, 02 Jul 2024 06:34:03 +0000

Hacking Alibaba Cloud's Kubernetes Cluster with Hillai Ben-Sasson &Ronen Shustin, Security Researchers at Wiz and Bart Farrell, KubeFM Host

Securing Kubernetes clusters is one of the toughest challenges in cloud security, but for Ronen Shustin and Hillai Ben-Sasson at Wiz, it's just another day at work. These top-tier researchers are fearless in diving into the deep end. Their latest exploit? Cracking Alibaba Cloud's Kubernetes clusters through clever PostgreSQL vulnerabilities.

Join Bart Farell as he dives into how their innovative approach identifies vulnerabilities and enhances the overall security of cloud ecosystems.

You can watch (or listen to) this interview here.

Bart: What are three emerging Kubernetes or other tools that you're keeping an eye on?

Hillai: Ronen and I have extensive knowledge of Kubernetes, but our expertise only originates from working directly with Kubernetes. We're hackers who transitioned into Kubernetes hacking, not Kubernetes experts who started hacking. So, we need to familiarize ourselves with many Kubernetes tools. Most of the tools we know are those we've encountered and exploited during our engagements. Therefore, we might not be the best sources for the latest Kubernetes tools, but we are excited about ongoing Kubernetes research.

Bart: Are there any specific tools or infrastructure that you particularly like?

Ronen: Instead of specific tools, we're more interested in infrastructure elements like service meshes. From an attacker's perspective, engaging with these is quite fascinating. Currently, we need to mention standout tools.

Bart: For those unfamiliar, can you tell us more about your roles and what you do at Wiz?

Hillai: Ronen and I work at Wiz, a cloud security company, as part of the vulnerability research team. We focus on researching primary cloud services and providers like Azure, GCP, AWS, and more. We utilize their open bug bounty programs to find and report vulnerabilities. By sharing our findings, we aim to enhance the security of the cloud community, not just for our clients but for everyone.

Bart: Is hacking cloud environments your primary focus, or is this a specialized area within security research?

Hillai: It's unique. We didn't start with cloud environments. We began as general security researchers, focusing on hacking techniques. Over time, we transitioned into specializing in cloud security. Our research aims to discover innovative ways attackers might exploit cloud systems, ultimately leading to more secure cloud environments for everyone.

Bart: How has your hacking experience influenced your approach to Kubernetes security? Did you discover any exciting findings during this research?

Hillai: Many cloud providers rely on Kubernetes and container technology to manage their services efficiently. Traditionally, setting up individual virtual or physical machines for each customer would only be scalable for some companies. Containers offer a more efficient way to manage large infrastructures. Focusing on cloud environments, we discovered Kubernetes as the go-to tool for Alibaba Cloud and companies like IBM. Our journey started with cloud security research and ultimately led us to specialize in Kubernetes security within that domain.

Ronen: Our initial focus was on container security. We researched container escapes and other vulnerabilities that might impact containers. This research naturally led us to Kubernetes, as many infrastructures we encountered used it. We had to learn Kubernetes and develop specific techniques to achieve our goals.

Bart: If you could go back in time and share one career tip with your younger self, what would it be?

Hillai: Always follow your curiosity. Research is all about pursuing leads and hunches. We were curious about cloud security, even though we didn't start in that field. It became popular, and we wanted to explore this new area.

Bart: What resources do you use to stay updated on Kubernetes?

Ronen: I rely on technical documents the most. I also follow blogs from cloud providers, mainly the CNCF blog, because they have valuable information. I use The Kubernetes community on Twitter to learn about new features and technologies; they are highly active there.

Hillai: Additionally, I recommend Reddit. Many communities focused on security, Kubernetes, and cloud computing offer great content.

Bart: We came across an article about how you hacked Alibaba Cloud's Kubernetes cluster and a talk you gave at KubeCon. What motivated you to do this research, and did your company support you?

Hillai: Our company supports security research. At Wiz, we focus on cloud security research, often utilizing offensive security methodologies. We act like attackers to find vulnerabilities and then report them to the vendors. By identifying vulnerabilities, we can report them to the cloud providers and prevent actual attacks. Alibaba Cloud is just one example of this engagement.

Ronen: Our research often leads us to discover new hacking techniques we need to learn about. We share these discoveries with everyone so they can protect themselves.

Bart: One of our previous guests talked about Kubernetes secrets management and threat modelling. How do you approach exploiting vulnerabilities from a hacker's perspective?

Ronen:Our best security insights come from working with different applications, frameworks, and cloud systems. When we engage with one, our primary goal is to find critical security mistakes in its setup. To do this, we must fully understand how the system works and where attackers might discover weaknesses.

Hillai: There's an interesting difference between traditional and cloud security research. In traditional research, the goal is often to achieve "Remote Code Execution" (RCE) on a specific application, which means taking control of a machine and running unauthorized code. However, in the cloud, things are different. Since you often have access to a virtual machine yourself, RCE becomes less attractive.

The real challenge in cloud security lies in breaching the barriers between different customers. Unlike traditional environments, the cloud is a shared space with hundreds of thousands of users. Our focus is to demonstrate the possibility of attackers moving between these customers, even without data access. This risk highlights a unique cloud security risk - the potential for attackers to "jump" from one user to another and compromise their information. This type of research, proving a breach of trust without actually stealing data, is a crucial aspect of cloud security and something rarely seen in traditional security research.

Bart: When starting this research, why did you choose Alibaba Cloud?

Ronen: Our initial study focused on PostgreSQL. Since many cloud providers offer managed PostgreSQL instances, we were interested in how they handle the infrastructure. We discovered vulnerabilities that allowed us to execute code on these instances. We tested several providers, including Alibaba, and presented our findings at the Black Hat talk.

Hillai: We began with PostgreSQL and expanded to Alibaba and other cloud providers. Our blog post provides more details about PostgreSQL and our Black Hat talk.

Bart: Why did you choose to focus on PostgreSQL for your research?

Ronen: PostgreSQL is a robust database with many features, including the ability to execute code within the database. While this capability can benefit certain users, it poses a potential security risk in cloud environments.

Cloud providers typically modify PostgreSQL to prevent users from executing code on their managed instances. However, our research identified vulnerabilities in these modifications, not in the core PostgreSQL code itself. We were able to exploit these vulnerabilities to bypass the restrictions and still execute code on the managed databases.

Bart: How does PostgreSQL relate to Kubernetes in this context? Did you find a way to access a Kubernetes cluster by exploiting the PostgreSQL vulnerabilities?

Hillai: Cloud providers often use containers and orchestration tools like Kubernetes to manage large-scale services, including PostgreSQL. This approach allows them to offer these services to many customers efficiently. While exploiting the PostgreSQL vulnerabilities, we discovered that we were actually in a Kubernetes environment. The user interface typically abstracts away the underlying infrastructure from the user, but our research methods disclosed it.

Ronen: We've seen various infrastructures, but Alibaba and IBM used Kubernetes for their managed services. Other providers might use different implementations.

Bart: Security experts often talk about avoiding vulnerabilities caused by misconfigurations, which can be human errors. What were the biggest misconfigurations you found that created security risks?

Hillai: The biggest misconfiguration we found is treating containers as the only security barrier. It's important to remember that containers can be a security layer within a more extensive security system, but they should be relied on only partially. Containers alone wouldn't be strong enough to isolate each company's data from each other entirely because any security flaw in the core Linux system (the kernel) could bypass container security. We were able to exploit such misconfigurations during our research.

Another problem is poorly managed secrets within the Kubernetes environment. These secrets could read information across the system and write and change it, which meant we could overwrite software packages used by many cloud services and customer accounts within Alibaba. Essentially, these powerful secrets allowed someone to access different environments, services, and customer data—all with a single key. That's a significant security risk we wouldn't recommend taking.

Ronen: The specific secret we found was the image pull secret. In Kubernetes, when you want to download images from a private registry, you need this secret to configure network access. If you misconfigure it, you might accidentally include a secret key with push permissions instead of pull permissions. This key should only allow downloading images, not uploading them. If an attacker gains access to a key with push permissions (like what we achieved in Alibaba), it could have devastating consequences for your entire environment.

Bart: To those without a strong background in security, it may seem that security experts click a button, scan your system, and find vulnerabilities. However, security research, like many other fields, is a blend of art and science. Can you elaborate on this further?

Hillai: Security research requires a lot of creativity. When you hear about a new attack vector, it boils down to creative thinking - coming up with something no one else has considered. In this research, we started by looking for patterns we already knew were risky, like overly permissive settings and shared volumes. We had to think outside the box. Returning to the Alibaba Cloud control panel, we began experimenting. This exploration led us to a breakthrough when we discovered a button enabling SSL encryption for the PostgreSQL instance. Clicking it triggered new activity in the container, which we followed to escape the container.

Bart: To help our audience understand, could you explain SCP, its role in the attack, and how you exploited it?

Hillai: SCP stands for Secure Copy. It's a standard tool on Linux systems that transfers files between machines using secure SSH connections. In our case, the SSL encryption feature we triggered used a new Alibaba management container. This container ran the SCP command on our container to move the SSL certificate.

SCP reads its configuration from a directory we control within our container by default. We placed a malicious SSH configuration file there. When the SCP command loaded this configuration, it ran a command we placed within the file. This trick let us escape our limited container and jump to the Alibaba Management Container because it unknowingly executed our command.

Ronen: A crucial factor in this exploit was the shared volume. This volume acted like a shared home directory for our container and the management container since the same user existed in both containers. We could exploit this shared space because SCP reads its configuration from the user's home directory by default. By replacing the default configuration with ours containing a malicious command, we tricked the management container into running it when it used SCP.

Bart: What does successfully creating a privileged container using the Docker API tell us about cloud security in general?

Ronen: Many cloud environments rely on Docker to manage their containers. You can create a new container through an HTTP request if you gain access to the Docker API socket. This container could be privileged, meaning it shares resources like namespaces and possibly even volumes with the underlying host machine, the Kubernetes node. Spawning a privileged container grants you access to almost everything the node has access to.

Hillai: You transition from being a guest in the container to gaining complete control of the host machine.

Bart: Gainin access to the node would only give you control of some of the Kubernetes clusters, would it?

Ronen: With code execution on the node, we could use Kubelet credentials to explore further, looking for commands, codes, secrets, and other information. In our case, Alibaba had misconfigured its Kubelet credentials: it was too powerful. We could list all pods, see all the code in the cluster, potentially containing customer data, and even retrieve all the secrets using the "kubectl get secret" command. This misconfiguration was the key that unlocked broader access for us.

Bart: Did you achieve the entire exploit on a single node within the cluster?

Ronen: Yes, we were on a single node. Using the compromised Kubelet credentials, we could see all the other nodes and resources in the cluster.

Hillai: While the specific node we compromised was isolated and didn't contain data from other customers, the service account associated with Kubelet had excessive permissions. Even though the node itself was secure, this service account allowed us to access sensitive information across the entire cluster, including pods, nodes, and secrets belonging to other customers.

Bart: What was the next step after taking over Alibaba's managed PostgreSQL offering? Did you contact Alibaba to report your findings?

Hillai: Once we discovered the ability to access data belonging to other customers, our research stopped immediately. We wouldn't risk even accidentally accessing someone else's data. At that point, we documented everything we found and sent a detailed report to Alibaba Cloud, and they responded quickly and professionally. They kept us updated on the fixes they deployed throughout the research process. We immediately report any critical issues to prevent others from exploiting them.

Bart: Can you tell us about any specific fixes they implemented based on your findings?

Ronen: The first issue was a misconfiguration that falsely indicated increased resource consumption. We exploited it to execute unauthorized code on the operating system. We collaborated with Alibaba Cloud to fix this problem. They also resolved the SCP vulnerability problem that allowed unauthorized access to their management container. Finally, they restricted the Kubelet permissions to a narrower scope, granting only specific permissions.

Hillai: Following our research, Alibaba took several steps to address the vulnerabilities we discovered. They limited image pull secret permissions to read-only access, preventing unauthorized uploads. Additionally, they implemented a secure container technology similar to Google's gVisor project. This technology hardens containers and makes them more difficult to escape from, adding another layer of security.

Bart: Throughout this process, what key lessons did you learn?

Hillai: There are two main lessons learned. First, containers shouldn't be relied on as the sole security barrier. While they can be a layer of security, they can be bypassed in various ways. Additional precautions are crucial to ensure proper isolation between customers. We recommend building a layered defense so that a single vulnerability doesn't allow unauthorized access to a competitor company's data.

Second, strong credentials require careful management. As Ronen mentioned, Alibaba originally had a powerful secret that could be read and written across the cluster. This secret also had push access to the central Docker image registry. Following our report, they limited the scope of these credentials. It's essential to be very cautious with such powerful secrets. Ideally, you should scope the secrets to specific actions and minimize them whenever possible. A powerful secret can allow attackers to move across different environments, including production, development, testing, and even development workstations.

Another lesson learned relates to the container itself. The SCP vulnerability we exploited highlights the risk of shared namespaces between containers. In the Alibaba incident, the shared namespace and home directory allowed us to exploit the SCP vulnerability. Always be very careful when sharing namespaces between trusted and untrusted containers. The lesson learned is to minimize what you share and never grant unnecessary permissions. Attackers may exploit even seemingly minor misconfigurations.

Bart: Can you recommend any specific tools that people might need to be aware of if they want to discuss implementing some of these mitigation tactics with their managers?

Hillai: There's one framework I highly recommend: Peach. It's an open-source project developed by our research team and contributions from fantastic people at many companies.

Peach is a framework that outlines how to build secure and isolated environments, whether in the cloud or not. Like a white paper, it's a valuable resource that guides you on properly isolating tenants or customers in a multi-tenant environment. It covers common mistakes to avoid, what to look out for, and how to implement the necessary precautions.

If you manage a multi-tenant environment or need to isolate resources within your environment, Peach is a valuable resource worth exploring. It covers the common mistakes to avoid and offers best practices for implementing protection. It's completely open-source and available on GitHub. We also welcome contributions from anyone with additional tips or tricks we might need to know.

Ronen: I also recommend using secret scanning tools. These tools are essential in our research; we use them to identify potential secrets-related vulnerabilities.

Bart: Do you have any recommendations for securing multi-tenant Kubernetes clusters?

Ronen: Securing multi-tenant Kubernetes clusters involves a few key areas. First, prioritize network security. By default, Kubernetes doesn't restrict node communication, so strong network isolation is essential.

Second, separating namespaces between customers is a good practice when dealing with multi-tenancy.

Additionally, consider implementing container security technologies like gVisor or Kata Containers. Don't solely rely on Docker's security features to prevent container escapes.

Bart: What advice would you give for hardening containers to make them more secure?

Ronen: Our case study with Alibaba revealed they were using shared Linux namespaces between containers, such as their management container and our container. Sharing Linux namespaces can be dangerous. When designing a system that shares namespaces or resources between management and regular user containers, constantly carefully assess and be aware of the risks involved. Container technologies like GVisor and Kata Containers can mitigate the risk of attackers exploiting Linux kernel vulnerabilities in your environment to achieve kernel-level code execution and jump to the Kubernetes node.

Bart: What advice would you give to Kubernetes engineers needing more security experience?

Hillai: Security is crucial. Companies of all sizes, from startups to large corporations, are constantly targeted by malicious actors, not just ethical hackers like us. Anyone managing a service on the internet must understand that they are a potential target for cyberattacks. These attacks range from data breaches to ransomware attacks that turn off your entire operation. Even small projects need to pay more attention to security.

The good news is that many tools can help you achieve security without being a security expert. Tools like gVisor are relatively easy to implement because you don't need to write them from scratch. By using security hardening tools, you gain significant protection benefits.

Ronen: Besides the tools, many online resources are available to learn about security. These resources can help you understand security risks and how to mitigate them. Kubernetes itself has built-in security features, including default security policies. Be security-conscious and take steps to secure your environment.

Bart: You discover a vulnerability and report it to the vendor. What prevents you from exploiting the vulnerability for malicious purposes instead? Wouldn't Alibaba eventually find the problem on its own?

Ronen: We started seeing signs that Alibaba was taking steps to address the issue while we were still in the research phase. They were transparent with us about their efforts. Cloud providers all have security teams that constantly monitor their environments. They likely knew we were there.

Hillai: Cloud providers are doing a great job with security. We're ethical hackers; our goal is to improve security for the cloud community. Penetration testing, or offensive research, is a tool to achieve that goal. We want to fix the vulnerabilities, and it's rewarding to hear that our reports lead to security updates that benefit many customers. We do this to make cloud products more secure and help users learn how to secure their deployments.

We publish blogs and give talks so that security professionals and developers can learn from our research and identify potential problems in their environments.

Bart: What's next on the agenda for you both?

Hillai: We're always working on new research projects. Sagi from our team recently published a blog about a vulnerability in Hugging Face, an AI provider. We have several ongoing projects under disclosure, meaning we can only reveal them once we fix the vulnerabilities.

Follow our blog; it's the first place we announce new findings.

Ronen: Our research will benefit the Kubernetes security community as well.

Bart: How can people contact you if they have questions?

Hillai: We're both on Twitter. My handle is @hillai, and Ronen's is @RonenSHH. You can also email us at research@wiz.io, but Twitter is the best way. Make sure to spell the names correctly.

Wrap up

If you enjoyed this interview and want more Kubernetes stories and opinions, visit KubeFM and subscribe to the podcast.

If you want to keep up-to-date with Kubernetes, subscribe to Learn Kubernetes Weekly.
If you're going to become an expert in Kubernetes, look at courses on Learnk8s.
If you want to keep in touch, follow me on Linkedin.

eBPF, sidecars, and the future of the service mesh

Gulcan Topcu — Fri, 07 Jun 2024 07:56:41 +0000

Kubernetes and service meshes may seem complex, but not for William Morgan, an engineer-turned-CEO who excels at simplifying the intricacies. In this enlightening podcast, he shares his journey from AI to the cloud-native world with Bart Farrell.

Discover William's cost-saving strategies for service meshes, gain insights into the ongoing debate between sidecars, Ambient Mesh, and Cilium Cluster Mesh, his surprising connection to Twitter's early days and unique perspective on balancing tech expertise with the humility of being a piano student.

You can watch (or listen) to this interview here.

Bart: Imagine you've just set up a fresh Kubernetes cluster. What's your go-to trio for the first tools to install?

William: My first pick would be Linkerd. It's a must-have for any Kubernetes cluster. I then lean towards tools that complement Linkerd, like Argo and cert-manager. You're off to a solid start with these three.

Bart: Cert Manager and Argo are popular choices, especially in the GitOps domain. What about Flux?

William: Flux would work just fine. I don't have a strong preference between the two. Flux and Argo are great options, especially for tasks like progressive delivery. When paired with Linkerd, they provide a robust safety net for rolling out new code.

Bart: As the CEO, who are you accountable to? Could you elaborate on your role and responsibilities?

William: Being a CEO is an exciting shift from my previous role as an engineer. I work for myself, and I must say, I’m a demanding boss. As a CEO, I focus on the big picture and align everyone toward a common goal. These are the two skills I’ve had to develop rapidly since transitioning from an engineer, where my primary concern was writing and maintaining code.

Bart: From a technical perspective, how did you transition into the cloud-native space? What were you doing before it became mainstream?

William: My early career was primarily focused on AI, NLP, and machine learning long before they became trendy. I thought I’d enter academia but realized I enjoyed coding more than research.

I worked at several Bay Area startups, mainly in NLP and machine learning roles. I was part of a company called PowerSet, which was building a natural language processing engine and was acquired by Microsoft. I then joined Twitter in its early days, around 2010, when it had about 200 employees. I started on the AI side but transitioned to infrastructure because I found it more satisfying and challenging. We were doing what I now describe at Twitter as cloud-native, even though the terminology differed. We didn’t have Kubernetes or Docker, but we had Mesos, the JVM for isolation, and cgroups for a basic form of containerization. We transitioned from a monolithic Ruby on Rails service to a massive microservices deployment. When I left Twitter, we tried to apply those same ideas to the emerging world of Kubernetes and Docker.

Bart: How do you keep up with the rapid changes in the Kubernetes and cloud-native ecosystems, especially transitioning from infrastructure and AI/NLP?

William: My current role primarily shapes my strategy. I learn a lot from the engineers and users of Linkerd, who are at the forefront of these technologies. I also keep myself updated by reading discussions on Reddit platforms like r/kubernetes and r/Linkerd. Occasionally, I contribute to or follow discussions on Hacker News. Overall, my primary source of knowledge comes from the experts I work with daily, giving me valuable insights into the latest developments.

Bart: If you could return to your time at Twitter or even before that, what one tip would you give yourself?

William: I'd tell myself to prioritize impact. As an engineer, I was obsessed with building and exploring new technologies, which was rewarding. However, I later understood the value of stepping back to see where I could make a real difference in the company. Transitioning my focus to high-impact areas, such as infrastructure at Twitter, was a turning point. Despite my passion for NLP, I realized that infrastructure was where I could truly shine. Always look for opportunities where you can make the most significant impact.

Bart: Let’s focus on "Sidecarless eBPF Service Mesh Sparks Debate," which follows up on your previous article “eBPF, sidecars, and the future of the service mesh.” You're one of the creators of Linkerd. For those unfamiliar, what exactly is a service mesh? Why would someone need it, and what value does it add?

William: There are two ways to describe service mesh: what it does and how it works. Service mesh is an additional layer for Kubernetes that enhances key areas Kubernetes doesn't fully address.

The first area is security. It ensures all connections in your cluster are encrypted, authorized, and authenticated. You can set policies based on services, gRPC methods, or HTTP routes, like allowing Service A to talk to /foo but not /bar.

The second area is reliability. It enables graceful failovers, transparent traffic shifting between clusters, and progressive delivery. For example, deploying new code and gradually increasing traffic to it to avoid immediate production traffic. It also includes mechanisms like load balancing, circuit breaking, retries, and timeouts.

The last area is observability. It provides uniform metrics for all workloads across all services, such as success rates, latency distribution, and traffic volume. Importantly, it does this without requiring changes to your application code.

The most prevalent method today involves using many proxies. This approach has become feasible thanks to technological advancements like Kubernetes and containers, which simplify the deployment and management of many proxies as a unified fleet. A decade ago, deploying 10,000 proxies would have been absurd, but it is feasible and practical today. The specifics of deploying these proxies, their locations, programming languages, and practices are subject to debate. However, at a high level, service meshes work by running these layer seven proxies that understand HTTP, HTTP2, and gRPC traffic and enable various functionalities without requiring changes to your application code.

Bart: Can you briefly explain how the data and control planes work in service meshes, especially compared to the older sidecar model with an extra container?

William: A service mesh architecture consists of two main components: a control plane and a data plane. The control plane allows you to manage and configure the data plane, which directs network traffic within the service mesh. In Kubernetes, the control plane operates as a collection of standard Kubernetes services, typically running within a dedicated namespace or across the entire cluster.

The data plane is the operational core of a service mesh, where proxies manage network traffic. The sidecar model, employed by service meshes like Linkerd, deploys a dedicated proxy alongside each application pod. Therefore, a service mesh with 20 pods would have 20 corresponding proxies. The overall efficiency and scalability of the service mesh rely heavily on the size and performance of these individual proxies.

In the sidecar model, service A and service B communication flows through service A's and service B's proxy. Service A sends its message to its sidecar proxy, and then the service A proxy forwards it to service B's sidecar proxy. Finally, service B's proxy delivers the message to service B itself. This indirect communication path adds extra hops, leading to a slight increase in latency. You must carefully consider the potential performance impacts to ensure that service mesh benefits outweigh the trade-offs.

Bart: We've been discussing the benefits of service meshes, but running an extra container for each pod sounds expensive. Does cost become a significant issue?

William: Service meshes have a compute cost, just like adding any component to a system. You pay for CPU and memory, but memory tends to be the more significant concern, as it can force you to scale up instances or nodes.

However, Linkerd has minimized this issue with a "micro proxy" written in Rust. Rust's strict memory management allows fast, lightweight proxies and avoids memory vulnerabilities like buffer overflows, which are common in C and C++. Studies from both Google and Microsoft have shown that roughly 70% of security bugs in C and C++ code are due to memory management errors.

Our choice of Rust as the programming language in 2018 was a calculated risk. Rust offers the best of both worlds: the speed and control of languages like C/C++ and the safety and ease of use of languages with runtime environments like Go. Rust and its network library ecosystem were still relatively young at that time. We invested significantly in underlying libraries like Tokio, Tower, and H2 to build the necessary infrastructure.

The critical role of the data plane in handling sensitive application data drove this decision. We ensured its reliability and security. Rust enables us to build small, fast, and secure proxies that scale with traffic, typically using minimal memory, directly translating to the user experience. Instead of facing long response times (like 5-second tail latencies), users experience faster interactions (closer to 30 milliseconds). A service mesh can optimize these tail latencies, improving user experience and customer retention. Choosing Rust has proven to be instrumental in achieving these goals.

While cost is a factor, the actual cost often stems from operational complexity. Do you need dedicated engineers to maintain complex proxies, or does the system primarily work independently? That human cost usually dwarfs the computational one.

Our design choices have made managing Linkerd’s costs relatively straightforward. However, for other service meshes, costs can escalate if the proxies are large and resource-intensive. Even so, the more significant cost is often not the resources but the operational overhead and complexity. This complexity can demand considerable time and expertise, increasing the overall cost.

Bart: You raise a crucial point about the human aspect. While we address technical challenges, the time spent resolving errors detracts from other tasks. The community has developed products and projects to tackle these concerns and costs. One such example is Istio with Ambient Mesh. Another approach is sidecarless service meshes like Cilium Cluster Mesh. Can you explain what Ambient Mesh is and how it enhances the classic sidecar model of service meshes?

William: We've delved deep into both of these options in Linkerd. While there might come a time when adopting these projects makes sense for us, we're not there yet.

Every decision involves trade-offs regarding distributed systems, especially in production environments within companies where the platform is a tool to support applications. At Linkerd, our priority is constantly reducing the operational workload.

Ambient Mesh and eBPF aren't primarily reactions to complexity but responses to the practical annoyances of sidecars. Their key selling point is eliminating the need for sidecars. However, the real question is: What's the cost of this shift? That's where the analysis becomes crucial.

In Ambient Mesh, rather than having sidecar containers, you utilize connective components, such as tunnels, within the namespace. These tunnels communicate with proxies located elsewhere in the cluster. So essentially, you have multiple proxies running outside of the pod, and the pods use these tunnels to communicate with the proxies, which then handle specific tasks.

This setup is indeed intriguing. As mentioned earlier, running sidecars can be challenging due to specific implications. One such implication is the cost factor, which we discussed earlier. In Linkerd’s case, this is a minor concern. However, a more significant implication is the need to restart the pod to upgrade the proxy to the latest version, given the immutability of pods in Kubernetes.

This situation necessitates managing two separate updates: one to keep the applications up-to-date and another to upgrade the service mesh. Therefore, while the setup has advantages, it also requires careful management to ensure smooth operation and optimal performance.

We operate the proxy as the first container for various reasons, which can lead to friction points, such as when using kubectl logs. Typically, when you request logs, you're interested in your application's logs, not the proxy's. This friction, combined with a desire for networking to operate seamlessly in the background, drives the development of solutions like Ambient and eBPF, which aim to eliminate the need for explicit sidecars.

Both Ambient and eBPF solutions, which are closely related, are reactions to this sentiment of not wanting to deal with sidecars directly. The aim is to make sidecars disappear. Take Istio and most service meshes built on Envoy, for instance. Envoy is complex and memory-intensive and requires constant attention and tuning based on traffic specifics.

Challenges with sidecars are more of a cloud-native trend to market solutions, like writing a blog post proclaiming the death of sidecars rather than being specific to Linkerd. They can sometimes be an inaccurate reflection of the reality of engineering.

In Ambient, eliminating sidecars by running the proxy elsewhere and using tunnel components allows for separate proxy maintenance without needing to reboot applications for upgrades. However, in a Kubernetes environment, the idea is that pods should be rebootable anytime. Kubernetes can reschedule pods as needed, which aligns with the principles of building applications as distributed systems. Yet, there are legacy applications or specific scenarios where rebooting could be more convenient, making the sidecar approach less appealing.

Historically, running cron jobs with sidecar proxies in Kubernetes posed a significant challenge. Kubernetes lacked a built-in mechanism to signal the sidecar proxy when the main job was complete, necessitating manual intervention to prevent the proxy from running indefinitely. This manual process went against the core principle of service mesh, which aims to decouple services from their proxies for easier management and scalability.

Thankfully, one significant development is the Sidecar Container Kubernetes Enhancement Proposal. With this enhancement, you can designate your proxy as a sidecar container, leading to several benefits, like jobs terminating the proxy once finished and eliminating unnecessary resource consumption.

For Linkerd, adopting Ambient mesh architecture introduces more complexity than benefits. The additional components, like the tunnel and separate proxies, add unnecessary layers to the system. Unlike Istio, which has encountered issues due to its architecture, Linkerd's existing design hasn't faced similar challenges. Therefore, the trade-offs associated with Ambient aren't justified for Linkerd.

In contrast, the sidecar model offers distinct advantages. It creates clear operational and security boundaries at the pod level. Each pod becomes a self-contained unit, making independent decisions regarding security and operations, aligning with Kubernetes principles, and simplifying management in a cloud-native environment.

This sidecar approach is crucial for implementing zero-trust security. The critical principle of zero trust is to enforce security policies at the most granular level possible. Traditional approaches relying on a perimeter firewall and implicitly trusting internal components are no longer sufficient. Instead, each security decision must be made independently at every system layer. This granular enforcement is achieved by deploying a sidecar proxy within each application pod, acting as a security boundary and enabling fine-grained control over network traffic, authentication, and authorization.

In Linkerd, every request undergoes a rigorous security check within the pod. This check includes verifying the validity of the TLS encryption, confirming the client's identity through cryptographic algorithms, and ensuring the request comes from a trusted source. Additionally, Linkerd checks whether the request can access the specific resource or method it's trying to reach. This multi-layered scrutiny happens directly inside the pod, providing the highest possible level of security within the Kubernetes framework. Maintaining this tight security model is crucial, as any deviation, like separating the proxy and TLS certificate, weakens the model and introduces potential vulnerabilities.

Bart: The next point I'd like to discuss has garnered significant attention in recent years through Cilium Service Mesh and various domains. What is eBPF?

William: eBPF is a kernel technology that enables the execution of specific code within the kernel, offering significant advantages. Firstly, operations within the kernel are high-speed, eliminating the overhead of context switching between kernel and user space. Secondly, the kernel has unrestricted access to all system resources, requiring robust security measures to ensure eBPF programs are safe. This powerful technology empowers developers to create highly efficient and secure solutions for various system tasks, particularly networking, security, and observability.

Traditionally, user-space programs lacked direct access to kernel resources, relying on system calls to communicate with the kernel. While providing security, this syscall boundary introduced cost overhead, especially with frequent requests like network packet processing.

eBPF revolutionized this by enabling user-defined code to run within the kernel with stringent safety measures. The number of instructions an eBPF program can execute is limited, and infinite loops are prohibited to prevent resource monopolization. The bytecode verifier meticulously ensures every possible execution path can be explored to avoid unexpected behavior or malicious activity. The bytecode is also verified for GPL compliance by checking for specific strings in its initial bytes.

These security measures make eBPF a powerful but restrictive mechanism, enabling previously unattainable capabilities. Understanding what eBPF can and cannot do is crucial, despite marketing claims that might blur these lines. While many promote eBPF as a groundbreaking solution that could eliminate the need for sidecars, the reality is more nuanced. It's crucial to understand its limitations and not be swayed by marketing hype.

Bart: There appears to be some confusion regarding the extent of limitations associated with eBPF. If eBPF has limitations, does that imply that these limitations constrain all service meshes using eBPF?

William: The idea of an eBPF-based service mesh can sometimes need clarification. In reality, the Envoy proxy still handles the heavy lifting, even in these eBPF-powered meshes. eBPF has limitations, especially in the network space, and can't fully replace the functionality of a traditional proxy.

While eBPF has many applications, including security and performance monitoring, its most interesting potential lies in instrumenting applications. The kernel can directly measure CPU usage, function calls, and other performance metrics by residing in the kernel.

However, when it comes to networking, eBPF faces significant challenges. Maintaining large amounts of state, essential for many network operations, is difficult, bordering on impossible. This challenge highlights the limitations of eBPF in entirely replacing traditional networking components like proxies.

The role of eBPF in networking, particularly within service meshes, is often overstated. While it excels in certain areas, like efficient TCP packet processing and simple metrics collection, other options exist beyond traditional proxies. Complex tasks like HTTP2 parsing, TLS handshakes, or layer seven routings are challenging, if possible, to implement purely with eBPF.

Some projects attempt complex eBPF implementations for these tasks but often involve convoluted workarounds that sacrifice performance and practicality. In practice, eBPF is typically used for layer 4 (transport layer) tasks, while user-space proxies like Envoy handle more complex layer 7 (application layer) operations.

Service meshes like Cilium, despite their claims of being sidecar-less, often rely on daemonset proxies to handle these complex tasks. While eliminating sidecars, this approach introduces its own set of problems. Security is compromised as TLS certificates are mixed in the proxy's memory, and operational challenges arise when the daemonset goes down, affecting seemingly random pods scheduled on that machine.

Linkerd, having experienced similar issues with its first version (Linkerd1.x) running as a daemonset, opted for the sidecar model in subsequent versions. Sidecars provide clear operational and security boundaries, making management and troubleshooting easier.

Looking ahead, eBPF can still be a valuable tool for service meshes. Linkerd, for instance, could significantly speed up raw TCP proxying by offloading tasks to the kernel. However, for complex layer seven operations, a user-space proxy remains essential.

The decision to use eBPF and the choice between sidecars and daemonsets are distinct considerations, each with advantages and drawbacks. While eBPF offers powerful capabilities, it doesn't inherently dictate a specific proxy architecture. Choosing the most suitable approach requires careful evaluation of the system's requirements and trade-offs.

Bart: Can you share your predictions about conflict or uncertainty concerning service meshes and sidecars for the next few years? Is there a possibility of resolving this? Should we anticipate the emergence of new groups? What are your expectations for the near and distant future?

William: While innovation in this field is valuable, relying solely on marketing over technical analysis needs more appeal, especially for those prioritizing tangible customer benefits.

Regarding the future of service meshes, their value proposition is now well-established. The initial hype has given way to a practical understanding of their necessity, with users selecting and implementing solutions without extensive deliberation. This maturity is a positive development, shifting the focus from explaining the need for a service mesh to optimizing its usage.

Functionally, service meshes converge on core features like MTLS, load balancing, and circuit breaking. However, a significant area of development and our primary focus is mesh expansion, which involves integrating non-Kubernetes components into the mesh. We have a big announcement regarding this in mid-February.

Bart: That sounds intriguing. Please give us a sneak peek into what this announcement is about.

William: It is about Linkerd 2.15! The release of Linkerd 2.15 is a significant step forward. It introduces the ability to run the data plane outside Kubernetes, enabling seamless TLS communication for both VM and pod workloads.

The industry mirrors this direction, as evidenced by developments like the Gateway API, which converges to handle both ingress and service mesh configuration within Kubernetes. This unified approach allows consistent configuration primitives for traffic entering, transiting, and exiting the cluster.

The industry will likely focus on refining details like eBPF integration or the advantages of Ambient Mesh while the fundamental value proposition of service meshes remains consistent. I'm particularly excited about how these advancements can be applied across entire organizations, starting with securing and optimizing Kubernetes environments and extending these benefits to the broader infrastructure.

Bart: Shifting away from the professional side, we heard you have an interesting tattoo. Is it of Linkerd, or what is it about?

William: It’s just a temporary one. We handed them out at KubeCon last year as part of our swag. While everyone else gave out stickers, we thought we'd do something more extraordinary. So, we made temporary tattoos of Linky the Lobster, our Linkerd mascot.

When Linkerd graduated within the CNCF, reaching the top tier of project maturity, we needed a mascot. Most mascots are cute and cuddly, like the Go Gopher. We wanted something different, so we chose a blue lobster—an unusual and rare creature reflecting Linkerd's unique position in the CNCF universe.

The tattoo featured Linky the Lobster crushing some sailboats, which is part of our logo. It was a fun little easter egg. If you were at KubeCon, you might have seen them. That event was in Amsterdam.

Bart: What's next for you? Are there any side projects or new ventures you're excited about?

William: I'm devoting all my energy to Linkerd and Buoyant. That takes up most of my focus. Outside of work, I'm a dad. My kids are learning the piano, so I decided to start learning, too. It's humbling to see how fast they pick it up compared to me. As an adult learner, it's a slow process. It's interesting to be in a role where I'm the student, taking lessons from a teacher who's probably a third my age and incredibly talented. It’s an excellent reminder to stay humble, especially since much of my day involves being the authority on something. It’s a nice change of pace and a bit of a reality check.

Bart: That's a good balance. It's important to remind people that doing something you could be better at is okay. As a kid, you're used to it—no expectations, no judgment.

William: Exactly. However, it can be a struggle as an adult, especially as a CEO. I've taught Linkerd to hundreds of people without any panic, but playing a piano recital in front of 20 people is terrifying. It's the complete opposite.

Bart: If people want to contact you, what's the best way?

William: You can email me at william@buoyant.io, find me on Linkerd Slack at slack.linkerd.io, or DM me at @WM on Twitter. I'd love to hear about your challenges and how I can help.

Wrap up

If you enjoyed this interview and want to hear more Kubernetes stories and opinions, visit KubeFM and subscribe to the podcast.
If you want to keep up-to-date with Kubernetes, subscribe to Learn Kubernetes Weekly.
If you want to become an expert in Kubernetes, look at courses on Learnk8s.
Finally, if you want to keep in touch, follow me on Linkedin.

Clusters Are Cattle Until You Deploy Ingress

Gulcan Topcu — Thu, 30 May 2024 13:42:36 +0000

Managing repeatable infrastructure is the bedrock of efficient Kubernetes operations. While the ideal is to have easily replaceable clusters, reality often dictates a more nuanced approach. Dan Garfield, Co-founder of Codefresh, briefly captures this with the analogy: "A Kubernetes cluster is treated as disposable until you deploy ingress, and then it becomes a pet."

Dan Garfield joined Bart Farrell to understand how he managed Kubernetes clusters, transforming them from "cattle" to "pets" weaving in fascinating anecdotes about fairy tales, crypto, and snowboarding.

You can watch (or listen) to this interview here.

Bart: What are your top three must-have tools starting with a fresh Kubernetes cluster?
Dan: Argo CD is the first tool I install. For AWS, I will add Karpenter to manage costs. I will also use Longhorn for on-prem storage solutions, though I'd need ingress. Depending on the situation, I will install Argo CD first and then one of those other two.

Bart: Many of our recent podcast guests have highlighted Argo or Flux, emphasizing their significance in the GitOps domain. Why do you think these tools are considered indispensable?

Dan: The entire deployment workflow for Kubernetes revolves around Argo CD. When I set up a cluster, some might default to using kubectl apply, or if they're using Terraform, they might opt for the Helm provider to install various Helm charts. However, with Argo CD, I have precise control over deployment processes.

Typically, the bootstrap pattern involves using Terraform to set up the cluster and Helm provider to install Argo CD and predefined repositories. From there, Argo CD takes care of the rest.

I have my Kubernetes cluster displayed on the screen behind me, running Argo CD for those who can't see. I utilize Argo CD autopilot, which streamlines repository setup. Last year, when my system was compromised, Argo CD autopilot swiftly restored everything. It's incredibly convenient. Moreover, when debugging, the ability to quickly toggle sync, reset applications, and access logs through the UI is invaluable. Argo CD is, without a doubt, my go-to tool for Kubernetes. Admittedly, I'm biased as an Argo maintainer, but it's hard to argue with its effectiveness.

Bart: Our numerous podcast discussions with seasoned professionals show that GitOps has been a recurring theme in about 90% of our conversations. Almost every guest we've interviewed has emphasized its importance, often mentioning it as their primary tool alongside other essentials like cert manager, Kyverno, or OPA, depending on their preferences.

Could you introduce yourself to those unfamiliar with you? Tell us your background, work, and where you're currently employed.

Dan: I'm Dan Garfield, the co-founder and chief open-source officer at CodeFresh. As Argo maintainers, we're deeply involved in shaping the GitOps landscape. I've played a key role in creating the GitOps standard, establishing the GitOps working group, and spearheading the OpenGitOps project.

Our journey began seven years ago when we launched CodeFresh to enhance software delivery in the cloud-native ecosystem, primarily focusing on Kubernetes. Alongside my responsibilities at CodeFresh, I actively contribute to SIG security within the Kubernetes community and oversee community-driven events like ArgoCon. Outside of work, I reside in Salt Lake City, where I indulge in my passion for snowboarding. Oh, and I'm a proud father of four, eagerly awaiting the arrival of our fifth child.

Bart: It’s a fantastic journey. We'll have to catch up during KubeCon in Salt Lake City later this year. Before delving into your entrepreneurial venture, could you share how you entered Cloud Native?

Dan: My journey into the tech world began early on as a programmer. However, I found myself gravitating more towards the business side, where I discovered my knack for marketing. My pivotal experience was leading enterprise marketing at Atlassian during the release of Data Center, Atlassian's clustered tool version. Initially, it didn't garner much attention internally, but it soon became a game-changer, driving significant revenue for the company. Witnessing this transformation, including Atlassian's public offering, was exhilarating, although my direct contribution was modest as I spent less than two years there.

I noticed a significant change in containerization, which sparked my interest in taking on a new challenge. Conversations with friends starting container-focused experiences captivated me. Then, Raziel, the founder of Codefresh, reached out, sharing his vision for container-driven software development. His perspective resonated deeply, prompting me to join the venture.

Codefresh initially prioritized building robust CI tools, recognizing that effective CD hinges on solid CI practices and needed to be improved in many organizations at the time (and possibly still is). As we expanded, we delved into CD and explored ways to leverage Kubernetes insights.

Kubernetes had yet to emerge as the dominant force when we launched this journey. We evaluated competitors like Rancher, OpenShift, Mesosphere, and Docker Swarm. However, after thorough analysis, Kubernetes emerged as the frontrunner, boldly cueing us to bet on its potential.

Our decision proved visionary as other platforms gradually transitioned towards Kubernetes. Amazon's launch of EKS validated our foresight. This strategic alignment with Kubernetes paved the way for our deep dive into GitOps and Argo CD, driving the project's growth within the CNCF and its eventual graduation.

Bart: It's impressive how much you've accomplished in such a short timeframe, especially while balancing family life. With the industry evolving rapidly, How do you keep up with the cloud-native scene as a maintainer and a co-founder?

Dan: Indeed, staying updated involves reading blogs, scrolling through Twitter, and tuning into podcasts. However, I've found that my most insightful learnings come from direct conversations with individuals. For instance, I've assisted the community with Argo implementations, not as a sales pitch but to help gather insights genuinely. Interacting with Codefresh users and engaging with the broader community provides invaluable perspectives on adoption challenges and user needs.

Oddly enough, sometimes, the best way to learn is by putting forth incorrect opinions or questions. Recently, while wrestling with AI project complexities, I pondered aloud whether all Docker images with AI models would inevitably be bulky due to PyTorch dependencies. To my surprise, this sparked many helpful responses, offering insights into optimizing image sizes. Being willing to be wrong opens up avenues for rapid learning.

Bart: That vulnerability can indeed produce rich learning experiences. It's a valuable practice. Shifting gears slightly, if you could offer one piece of career advice to your younger self, what would it be?

Dan: Firstly, embrace a mindset of rapid learning and humility. Be more open to being wrong and detach ego from ideas. While standing firm on important matters is essential, recognize that failure and adaptation are part of the journey. Like a stone rolling down a mountain, each collision smooths out the sharp edges, leading to growth.

Secondly, prioritize hiring decisions. The people you bring into your business shape its trajectory more than any other factor. A wrong hire can have far-reaching consequences beyond their salary. Despite some missteps, I've been fortunate to work with exceptional individuals who contribute immensely to our success. When considering a job opportunity, I always emphasize the people's quality, the mission's significance, and fair compensation. Prioritizing in this order ensures fulfillment and satisfaction in your career journey.

Bart: That's insightful advice, especially about hiring. Surrounding yourself with talented individuals can make all the difference in navigating business challenges. Now, shifting gears to your recent tweet about Kubernetes and Ingress, who was the intended audience for that tweet?

Dan: Honestly, it was more of a reflection for myself, perhaps shouted into the void. I was weighing the significance of deploying Ingress within Kubernetes. In engineering, a saying that "the problem is always DNS" suggests that your cluster becomes more tangible once you configure DNS settings. Similarly, setting up Ingress signifies a shift in how you perceive and manage your cluster. Without Ingress, it might be considered disposable, like a development environment. However, once Ingress is in place, your cluster hosts services that require more attention and care.

Bart: For those unfamiliar with the "cattle versus pets" analogy in Kubernetes, could you elaborate on its relevance, particularly in the context of Ingress?

Dan: While potentially controversial, the "cattle versus pets" analogy illustrates a fundamental concept in managing infrastructure. In this analogy, cattle represent interchangeable and disposable resources, much like livestock in a ranching operation. Conversely, pets are unique, loved entities requiring personalized care.

In Kubernetes, deploying resources as "cattle" means treating them as replaceable, identical units. However, Ingress introduces a shift towards a "pet" model, where individual services become distinct and valuable entities. Just as you wouldn't name every cow on a farm, you typically wouldn't concern yourself with the specific details of each interchangeable resource. But once you start deploying services accessible via Ingress, each service becomes unique and worthy of individual attention, akin to caring for a pet.

Bart: It seems the "cattle versus pets" analogy is stirring some controversy among vegans, which is understandable given its context. How does this analogy relate to Kubernetes and Ingress?

Dan: In software, the analogy helps distinguish between disposable, interchangeable components (cattle) and unique, loved entities (pets). For instance, in my Kubernetes cluster, the individual nodes are like cattle—replaceable and without specific significance. If one node malfunctions, I can easily swap it out without concern.

However, once I deploy Ingress and start hosting services, the cluster takes on a different role. While the individual nodes remain disposable, the cluster becomes more akin to a pet. I care about its state, its configuration, and its uptime. Suddenly, I'm monitoring metrics and ensuring its well-being, similar to caring for a pet's health.

So, the analogy underscores the shift in perception and care that occurs when transitioning from managing generic infrastructure to hosting meaningful services accessible via Ingress.

Bart: That's a fascinating perspective. How do Kubernetes and Ingress relate to all of this?

Dan: The ingress in Kubernetes is a central resource for managing incoming traffic to the cluster and routing it to different services. However, unlike other resources in Kubernetes, such as those managed by Argo CD, the ingress is often shared among multiple applications. Each application may have its own deployment rules, allowing for granular control over updates and configurations. For example, one application might only update when manually triggered, while another automatically updates when changes are detected.

The challenge arises because updating Ingress impacts multiple applications simultaneously. Through this centralized routing mechanism, you're essentially juggling the needs of various applications. This complexity underscores the importance of managing the cluster effectively, as each change to Ingress affects the entire ecosystem of applications.

The Argo CD community is discussing introducing delegated server-side field permissions. This feature would allow one application to modify components of another, easing the burden of managing shared resources like Ingress. However, it's still under debate, and alternative solutions may emerge. Other tools, like Contour, offer a different approach by treating each route as a separate custom resource, allowing applications to manage their routing independently.

Ultimately, deploying the ingress marks a shift in the cluster's dynamics, requiring considerations such as DNS settings and centralized routing configurations. As a result, the cluster becomes more specialized and less disposable as its configuration becomes bespoke to accommodate the routing needs of various applications.

Bart: Any recommendations for those who aim to keep their infrastructure reproducible while needing Ingress?

Dan: One approach is abstraction and leveraging wildcards. While technically, you can deploy an Ingress without external pointing; I prefer the concept of self-updating components. Tools like Crossplane or Google Cloud's Config Connector allow you to represent non-Kubernetes resources as Kubernetes objects. Incorporating such tools into your cluster bootstrap process ensures the dynamic creation of necessary components.

However, there's a caveat. Despite reproducible clusters, external components like DNS settings may not be. Updating name servers remains a manual task. It's a tricky aspect of operations that needs a perfect solution.

Bart: How do GitOps and Argo CD fit into solving this challenge?

Dan: GitOps and Argo CD play a crucial role in managing complex infrastructure, especially with sensitive data. The key lies in representing all infrastructure resources, including secrets and certificates, as Kubernetes objects. This approach enables Argo CD to track and reconcile them, ensuring that the desired state defined in Git reflects accurately in your cluster.

Tools like Crossplane, vCluster (for managing multiple clusters), or Cluster API (for provisioning additional clusters) can extend this approach to handle various infrastructure resources beyond Kubernetes. Essentially, Git serves as the single source of truth for your entire infrastructure, with Argo CD functioning as the engine to enforce that truth.

A common issue with Terraform is that its state can get corrupted easily because it must constantly monitor changes. Crossplane often uses Terraform under the hood. The problem is not with Terraform's primitives but with the data store and its maintenance. Crossplane ensures the data store remains uncorrupted, accurately reflecting the current state. If changes occur, they appear as out of sync in Argo CD.

You can then define policies for reconciliation and updates, guiding the controller on the next steps. This approach is crucial for managing infrastructure effectively. Using etcd as your data store is an excellent pattern and likely the future of infrastructure management.

Bart: What would happen if the challenges of managing Kubernetes infrastructure extend beyond handling ingress traffic to managing sensitive information like state secrets and certificates? This added complexity could lead to a "pet" cluster scenario. Would you think backup and recovery tools like Velero would be easier to use without these additional challenges?

Dan: I need to familiarize myself with Velero. Can you tell me about it?

Bart: Velero is a tool focused on backing up and restoring Kubernetes resources. Since you mentioned Argo CD and custom resources earlier, I'm curious about your approach to backing up persistent volumes. How did you manage disaster recovery in your home lab when everything went haywire?

Dan: I've used Longhorn for volume restoration, and clear protocols were in place. I'm currently exploring Velero, which looks like a promising tool for data migration.

Managing data involves complexities like caring for a pet, requiring careful handling and migration. Many people need help managing stateful workloads in Kubernetes. Fortunately, most of my stateful workloads in Kubernetes can rebuild their databases if data is lost. Therefore, data loss is manageable for me. Most of the elements I work with are replicable. Any items needing persistence between sessions are stored in Git or a versioned, immutable secret repository.

Bart: It's worth noting, especially considering what happened with your home lab. Should small startups prioritize treating their clusters like cattle, or is ClickOps sufficient?

Dan: It depends on the use cases. vCluster, a project I'm fond of, is particularly well-suited for creating disposable development clusters, providing developers with isolated sandboxes for testing and experimentation. It allows deploying a virtualized cluster on an existing Kubernetes setup, which saves significantly on ingress costs, especially on platforms like AWS, where you can consolidate ingress into one.

Another example is using Argo CD's application sets to create full-stack environments for each pull request in a Git repository. These environments, which include a virtual cluster, are unique to each pull request but remain completely disposable and easily recreated, much like cattle.

However, managing ingress for disposable clusters can be challenging. When deployed and applied to vClusters, ingress needs custom configurations, requiring separate tracking and maintenance. Despite this, it's still beneficial to prioritize treating infrastructure as disposable. For example, while my on-site Kubernetes cluster is a "pet" that requires careful maintenance, its nodes are considered "cattle" that can be replaced or reconfigured without disrupting overall operations. This abstraction is a core principle of Kubernetes and allows for greater flexibility and resilience.

By abstracting clusters away from custom configurations and focusing on reproducibility, you can treat them more like cattle, even if they have some pet-like qualities due to ingress deployment and DNS configurations. This commoditization of clusters simplifies management and enables greater scalability. The more you abstract and standardize your infrastructure, the smoother your operations will become. And to be clear, this analogy has nothing to do with dietary choices.

Bart: If you could rewind time and change anything, what scenario would you create to avoid writing that tweet?

Dan: We've been discussing a feature in Argo CD that allows for delegated field permissions to happen server-side. It addresses a problem inherent in Kubernetes architecture, particularly regarding ingress. The current setup doesn't allow for external delegation of its components, even though many users operate it that way. If I could make changes, I might have split ingress into an additional resource, including routes as a separate definition that users could manage independently.

Exploring other scenarios where delegated field permissions would be helpful is crucial. Ingress is the most obvious example, highlighting an area for potential improvement. Creating separate routes and resources could solve this issue without altering Argo CD. This approach, similar to Contour's, could be a promising solution. Contour's separate resource strategy demonstrates learning from Ingress and making improvements. We should consider adopting tools like Contour or other service mesh ingress providers, as several compelling options are available.

Bart: If you had to build a cluster from scratch today, how would you address these issues whenever possible?

Dan: Sometimes you just have to accept the challenge and not try to work around it. Setting up ingress and configuring DNS for a single cluster might not be a big deal, but it's worth considering a re-architecture if you're doing it on a large scale, like 250,000 times. For instance, with Codefresh, many users opt for our hybrid setup. They deploy our GitOps agent, based on Argo CD, on their cluster, which then connects to our control plane.

One of the perks we offer is a hosted ingress. Instead of setting up ingresses for each of their 5000 Argo CD instances, users can leverage our hosted ingress, saving money and configuration headaches. Consider alternatives like a tunneling system instead of custom ingress setups, depending on your use case. A hosted ingress can be a game-changer for large-scale distributed setups like multiple Argo CD instances, saving costs and simplifying configurations. Ultimately, re-architecting is always an option tailored to what works best for you.

Bart: We're nearing the end of the podcast and want to touch on a closing question, which we are looking at from a few different angles. How do you deal with the anxiety of adopting a new tool or practice, only to find out later that it might be wrong?

Dan: I've seen this dynamic play out. Sometimes, organizations invest heavily in a tool, only to realize a few years later that there are better fits. Take the example of a company transitioning to Argo workflows for CICD and deployment, only to discover that Argo CD would have been a better fit for most of their use cases. However, these transitions are well-spent efforts. In their case, the journey through Argo workflows paved the way for a smoother transition to Argo CD. Sometimes, detaching the wrong direction is necessary to reach the correct destination faster.

You can only sometimes foresee the ideal solution from where you are, and experimenting with different tools is part of the learning process. It's essential not to dwell on mistakes but to learn from them and move forward. After all, even if a tool ultimately proves to be the wrong choice, it often still brings value. The key is recognizing when a change is needed and adapting accordingly. Mistakes only become fatal if we fail to acknowledge and learn from them.

Bart: We stumbled upon your blog, Today Was Awesome, which hasn't seen an update in a while. You wrote a post about Bitcoin, priced at around $450 in 2015. Are you a crypto millionaire now?

Dan: Not quite! Crypto is a fascinating topic, often sparking wild debates. While there's no shortage of scams in the crypto world, there's also genuine innovation happening. I dabbled in Bitcoin early on and even mined a bit to understand its potential use cases better. One notable experience was mentoring at Hack the North, a massive hackathon where numerous projects leveraged Ethereum. I strategically sold my Bitcoin for Ethereum, which turned out well. However, I'm still waiting on those Lambos—I'm not quite at millionaire status yet!

Bart: Your blog covers many topics, including one post titled "What are we really supposed to learn from fairy tales.” How did you decide on such diverse content?

Dan: I can't recall the exact inspiration, but my wife and I often joke about how outdated the moral lessons in fairy tales feel. Exploring their relevance in today's world is an interesting angle to explore.

Bart: What's next for you? More fairy tales, moon-bound Lamborghinis, or snowboarding adventures? Also, let's discuss your recent tweet about making your bacon. How did that start?

Dan: Ah, yes, making bacon! It's surprisingly simple. First, you get pork belly and cure it in the fridge for seven to ten days. Then, you smoke it for a couple of hours.

My primary motivation was to avoid the nitrates found in store-bought bacon linked to health issues. Homemade bacon tastes better, is of higher quality, and is cheaper. My freezer now overflows with homemade bacon, which makes for a unique and well-received gift. People love the taste; overall, it's been a rewarding and delicious effort!

Bart: Regardless of dietary choices, considering where your food comes from and being involved in the process—whether by growing your food or making it yourself and turning it into a gift for others—creates a different, enriching experience. What's next for you?

Dan: This year, my focus is on environment management and promotion. In the Kubernetes world, we often think about applications, clusters, and instances of Argo CD to manage everything. We're working on a paradigm shift: we think about products instead of applications. In our context, a product is an application in every environment in which it exists. Hence, if you deploy a development application, move it to stage, and finally to production, you're deploying the same application with variations three times. That's what we call a product. We’re shifting from thinking about where an application lives to considering its entire life cycle. Instead of focusing on clusters, we think about environments because an environment might have many clusters.

For instance, retail companies like Starbucks, Chick-fil-A, and Pizza Hut often have Kubernetes clusters on-site. Deploying to US West might mean deploying to 1,300 different clusters and 1,300 different Argo CD instances. We abstract all that complexity by grouping them into the environments bucket. We focus on helping people scale up and build their workflow using environments and establishing these relationships. The feedback has been incredible; people are amazed by what we’re demonstrating.

We're showcasing this at ArgoCon next month in Paris. After that, I plan to do some snowboarding and then make it back in time for the birth of my fifth child.

Bart: That's a big plan. 2024 is packed for you. If people want to contact you, what's the best way to do it?

Dan: Twitter is probably the best. You can find me at @todaywasawesome. If you visit my blog and leave comments, I won't see them, as it's more of an archive now. I keep it around because I worked on it ten years ago and occasionally reference something I wrote.

You can also reach out on LinkedIn, GitHub, or Slack. I respond slower on Slack, but I do get to it eventually.

Wrap up

If you enjoyed this interview and want to hear more Kubernetes stories and opinions, visit KubeFM and subscribe to the podcast.
If you want to keep up-to-date with Kubernetes, subscribe to Learn Kubernetes Weekly.
If you want to become an expert in Kubernetes, look at courses on Learnk8s.
Finally, if you want to keep in touch, follow me on Linkedin.

Upgrading Hundreds of Kubernetes Clusters

Gulcan Topcu — Wed, 03 Apr 2024 07:15:53 +0000

Automating the upgrade process for hundreds of Kubernetes clusters is a formidable task, but it's one that Pierre Mavro, the co-founder and CTO at Qovery, is well-equipped to handle. With his extensive experience and a dedicated team of engineers, they have successfully automated the upgrade process for both public and private clouds.

Bart Farell sat with Pierre to understand how he did it without breaking the bank.

You can watch (or listen) to this interview here.

Bart: If you installed three tools on a new Kubernetes cluster, which tools would they be and why?

Pierre: The first tool I recommend is K9s. It's not just a time-saver but a productivity booster. With its intuitive interface, you can speed up all the usual kubectl commands, access logs, edit resources and configurations, and more. It's like having a personal assistant for your cluster management tasks.

The second one is a combination of tools: External DNS, cert-manager, and NGINX ingress. Using these as a stack, you can quickly deploy an application, making it available through a DNS with a TLS without much effort via simple annotations. When I first discovered External DNS, I was amazed at its quality.

The last one is mostly an observability stack with Prometheus, Metric server, and Prometheus adapter to have excellent insights into what is happening on the cluster. You can reuse the same stack for autoscaling by repurposing all the data collected for monitoring.

Bart: Tell us more about your background and how you progressed through your career.

Pierre: My journey in the tech industry has been diverse and enriching. I've had the privilege of working for renowned companies like Red Hat and Criteo, where I honed my skills in cloud deployment. Today, as the co-founder and CTO of Qovery, I bring a wealth of experience in distributed systems, particularly for NoSQL databases, and a deep understanding of Kubernetes, which I began exploring in 2016 with version 1.2.

To provide some context to Qovery's services, we offer a self-service developer platform that allows code deployment on Kubernetes without requiring expertise in infrastructure. We keep our platform cloud-agnostic and place Kubernetes at the core to ensure our deployments are portable across different cloud providers.

Bart: How was your journey into Kubernetes and the cloud-native world, given the changes since 2016?

Pierre: Actually, learning Kubernetes was quite a journey. You had a less developed landscape with most Kubernetes components in alpha at these times. In 2016, I was also juggling between my job at Criteo and my own company.

When it came to deployment, I had several options, and I chose the hard way: deploying Kubernetes on bare metal nodes using KubeSpray. Troubleshooting bare metal Kubernetes deployments honed my skills in pinpointing issues. This hands-on experience provided a deep understanding of how each component, like the Control Plane, kubelet, Container Runtime, and scheduler, interacts to orchestrate containers.

Another resource that I found pretty helpful was "Kubernetes the Hard Way" by Kelsey Hightower despite its complexity.

Lastly, I got help from the official Kubernetes docs.

Bart: Looking back, is there anything you would do differently or advice you would give to your past self?

Pierre: Not really. Looking back, KubeSpray was the best option at the time, and there were no significant changes I would make to the decision.

Bart: You've worked on various projects involving bare metal and private clouds. Can you share more about your Kubernetes experience, such as the scale of clusters and nodes?

Pierre: At Criteo, I led a NoSQL team supporting several million requests per second on a massive 4,500-node bare-metal cluster. Managing this infrastructure - particularly node failures and data consistency across stateful databases like Cassandra, Couchbase, and Elasticsearch - was a constant challenge.

While at Criteo, I also had a personal project where I built a smaller 10-node bare-metal cluster.
This experience with bare metal management solidified my belief in the benefits of Kubernetes, which I later implemented at Criteo.

When we adopted Kubernetes at Criteo, we encountered initial hurdles. In 2018, Kubernetes operators were still new, and there was internal competition from Mesos. We addressed these challenges by validating Kubernetes performance for our specific needs and building custom Chef recipes, StatefulSet hooks, and startup scripts.

Migrating to Kubernetes took eight months of dedicated effort. It was a complex process, but it was worth it.

Bart: As you’ve mentioned, Kubernetes had competitors in 2018 and continues to do so today. Despite the tooling's immaturity, you led a team to adopt Kubernetes for stateful workloads, which was unconventional. How did you guide your team through this transition?

Pierre: We had large instances — all between 50 and 100 CPUs each and 256 gigabytes of RAM up to 500 gigabytes of RAM.

We had multiple Cassandra clusters on a single Kubernetes cluster, and each Kubernetes node was dedicated to a single Cassandra node. We chose this bare metal setup to optimize disk access with SSD or NVMe.

Running these stateful workloads wasn't just a matter of starting them up. We had to handle them carefully because stateful sets like Elasticsearch and Cassandra must keep their data safe even if the machine they're running on fails.

Kubernetes helped us detect issues with these apps using features like Pod Disruption Budgets (PDBs) that limit how often pods can be disrupted, StatefulSets that have consistent ordering of execution and stable storage, and automated probes that trigger actions and alerts when something goes wrong.

Bart: Your experiences helped me better understand your blog post, The Cost of Upgrading Hundreds of Kubernetes Clusters. After managing large infrastructures, you founded Qovery. What drove you to take this step as an engineer?

Pierre: Kubernetes has become a standard, but managing it can be a headache for developers. Cloud providers offer a basic Kubernetes setup, but it often needs more features developers need to get started and deploy applications quickly. Managing the cluster and nodes and keeping them up-to-date is time-consuming. Developers must spend a lot of time adding extra tools and configurations on top of the basic setup and then updating everything, which can be time-consuming.

To tackle these challenges, I founded Qovery.

Qovery provides two critical solutions. First, it offers a unified, user-friendly stack across cloud providers, simplifying Kubernetes deployment and management complexity. Second, it enables developers to deploy code without hassle.

Bart: Managing clusters can have various interpretations. The term can be broad. How do you define cluster management at Qovery in the context of upgrading and recovery?

Pierre: Yes, that's right. At Qovery, we understand the complexity of managing Kubernetes for customers. That's why we automate and simplify the entire process.

We automatically notify you about upcoming Kubernetes updates and handle the upgrade process on schedule, eliminating the need for manual intervention.

We deploy and manage various essential charts for your environment, including tools for logging, metrics collection, and certificate management. You don't need to worry about these intricacies.

We deploy all the necessary infrastructure elements to create a fully functional Kubernetes environment for production within 30 minutes. We provide a complete solution that's ready to go.

We build your container images, push them to a registry, and deploy them based on your preferences. We also handle the lifecycle of the applications deployed.

We use Cluster Autoscaler to automatically adjust the number of nodes (cluster size) based on your actual usage to ensure efficiency. Additionally, we deploy Vertical and Horizontal Pod Autoscalers to scale your applications' resources as their needs change automatically.

By taking care of these complexities, Qovery frees your developers to focus solely on what matters most: building incredible applications.

Bart: How large is your team of engineers?

Pierre: We have ten engineers working on the project.

Bart: How do you manage hundreds of clusters with such a small team?

Pierre: We run various tests on each code change, including unit tests for individual components and end-to-end tests that simulate real-world usage. These tests cover configurations and deployment scenarios to catch potential issues early on.

Before deploying a new cluster for a customer, we put it through its paces on our internal systems for weeks. Then, we deploy it to a separate non-production environment where we closely monitor its performance and address any problems before it reaches your applications.

We closely monitor Kubernetes and cloud providers' updates by following official changelogsand using RSS feeds, allowing us to anticipate potential issues and adapt our infrastructure proactively.

We also leverage tools like Kubent, popeye, kdave, and Pluto to help us manage API deprecations (when Kubernetes deprecates features in updates) and ensure the overall health of our infrastructure.

Our multi-layered approach has proven successful. We haven't encountered any significant problems when deploying clusters to production environments.

Bart: Managing new releases in the Kubernetes ecosystem can be daunting, especially with the extensive changelog. How do you navigate this complexity and spot potential difficulties when a new release is on the horizon?

Pierre: While reading the official update changelogs from Kubernetes and cloud providers is our first step, there are other paths to smooth sailing. Furthermore, understanding these detailed technical documents can be challenging, especially for newer team members who don’t have prior on-premise Kubernetes experience.

Cloud providers typically offer well-defined upgrade processes and document significant changes like removed functionalities, changes in API behavior, or security updates in their changelogs. However, many elements are interconnected in a Kubernetes cluster, especially when you deploy multiple charts for components like logging, observability, and ingress. Even with automated tools, we still need extensive testing and a manual process to ensure everything functions smoothly after an update.

Bart: So, what is your upgrading plan for helm charts?

Pierre: Upgrading Helm charts can be tricky because they bundle both the deployment and the software; for example, upgrading the Loki chart also upgrades Loki itself. To better understand what's changing, you need to review two changelogs: one for the chart itself and another for the software it includes.

We keep a close eye on all the charts we use by storing them in a central repository. This way, we have a clear history of every version we've used. We use a tool called helm-freeze to lock down the specific version of each chart we want to use. We can also track changes between chart and software versions using the git diff command.

If needed, we can also adjust specific settings within the chart using values override.

Like any other code change, we thoroughly test the upgraded charts with unit and functional tests to ensure everything works as expected.

Once testing is complete, we route the updated charts to our test cluster for a final round of real-world testing. After a few days of monitoring, if everything looks good, we confidently release the updates to our customers.

Bart: How do you handle unexpected situations? Do you have a specific strategy or write more automation in the Helm charts?

Pierre: We're excited to see more community Helm charts, including built-in tests! This practice will make it easier for everyone to trust and use these charts in the future.

At Qovery, we enable specific Helm options by default, like 'atomic' and 'wait,' which help prevent upgrade failures during the process. However, there can still be issues that only show up in the logs, so we run additional tests specifically designed to catch these hidden problems.

Upgrading charts that deploy Custom Resource Definitions (CRDs) requires special attention. We've automated this process to upgrade the CRDs first (to the required version) and then upgrade the chart itself. Additionally, for critical upgrades like cert-manager (which manages certificates), we back up and restore resources before applying the upgrade to avoid losing any critical certificates.

If you’re running an older version of a non-critical tool like a logging system, upgrading through each minor version one by one can be time-consuming. We have a better way! Our system allows you to skip to the desired newer version, bypassing all those intermediate updates.

We've also built safeguards into our system to handle potential problems before they occur during cluster upgrades. For example, the system checks for issues like failed jobs, incorrect Pod Disruption Budgets configuration, or ongoing processes that might block the upgrade. If it detects any problems, our engine automatically attempts to fix or clean up the issue. It will also warn you if any manual intervention is needed.

Our ultimate goal is to automate the upgrade process as much as possible.

Bart: Would you say CRDs are your favorite feature in Kubernetes, or do you have another one?

Pierre: CRDs are a powerful tool for customizing Kubernetes, offering a high degree of flexibility. However, the current support and tooling around them leave room for improvement. For example, enhancing Helm with better CRD management capabilities would significantly improve the user experience.

Despite these limitations, the potential of CRDs for customizing Kubernetes is undeniable, making them a genuinely standout feature.

Bart: With your vast Kubernetes experience since 2016, how does your current process scale beyond 100 clusters? What do you need for such scalability?

Pierre: While basic application metrics can provide a general sense of health, managing hundreds of clusters requires more in-depth testing. Here at Qovery, with our experience handling nearly 300 clusters, we've found that:

More than basic metrics are needed. We need comprehensive testing that leverages application-specific metrics to ensure everything functions as expected.

Scaling requires more granular control over deployments, such as halting failures and providing detailed information to our users. For instance, quota issues from the cloud provider might necessitate user intervention.

Drawing from my experience at Criteo, where robust tooling was essential for managing complex tasks, powerful tools are the key to effectively scaling beyond 100 clusters.

Bart: Looking ahead at Qovery's roadmap, what's next for your team?

Pierre: Qovery will add Google Cloud Platform (GCP) by year-end, joining AWS and Scaleway! This expansion gives you more choices for your cloud needs.

We're extracting reusable code sections, like those related to Helm integration, and transforming them into dedicated libraries. By making these functionalities available as open-source libraries, we empower the developer community to leverage them in their projects.

We strongly believe in Rust as a powerful language for building production-grade software, especially for systems like ours that run alongside Kubernetes.

We're also developing a service catalog feature that offers a user-friendly interface and streamlines complex deployments. This feature will allow users to focus on their applications, not the intricacies of the underlying technology.

Bart: Do you have any plans to include Azure?

Pierre: Yes, we have, but integrating a new cloud provider, given our current team size, is challenging. While we are a team of seniors, each cloud provider has nuances; some are more mature or resource-extensive than others.

Today, our focus is on AWS and GCP, as our customers most request. However, we're also working on a more modular approach that will allow Qovery to be deployed on any Kubernetes cluster, irrespective of the cloud provider, although this is still in progress.

Bart: We're looking forward to hearing more about that. So, with your black belt in karate, how does that experience influence how you approach challenges, breaking them down into manageable steps?

Pierre: Karate has taught me the importance of discipline, focus, and breaking down complex tasks into manageable steps. Like in karate, where each move is deliberate and precise, I apply the same approach to challenges in my work, breaking them down into smaller, achievable goals.

Karate has also instilled in me a sense of perseverance and resilience, which are invaluable when facing difficult situations.

Bart: I'm a huge martial arts fan. How do you see martial arts' influence on managing stress in challenging situations?

Pierre: It varies from person to person. My experience in the banking industry has shown me that while some can handle stressful situations, others struggle. Martial arts can help manage stress somewhat, depending on the person.

Bart: How has your 25-year journey in karate shaped your perspective?

Pierre: Karate has become a part of me, and I plan to continue as long as possible.

Bart: What's the best way to reach out to you?

Pierre: You can reach me on LinkedIn or via email. I'm always happy to help.

Wrap up 🌄

If you enjoyed this interview and want to listen to more Kubernetes stories and opinions, head to KubeFM and subscribe to the podcast.
If you want to keep up-to-date with Kubernetes, subscribe to Learn Kubernetes Weekly.
If you want to become an expert in Kubernetes, look at courses on Learnk8s.
And finally, if you want to keep in touch, follow me on Linkedin.