Gulcan Topcu

Posted on Apr 3, 2024

Upgrading Hundreds of Kubernetes Clusters

#kubernetes

Automating the upgrade process for hundreds of Kubernetes clusters is a formidable task, but it's one that Pierre Mavro, the co-founder and CTO at Qovery, is well-equipped to handle. With his extensive experience and a dedicated team of engineers, they have successfully automated the upgrade process for both public and private clouds.

Bart Farell sat with Pierre to understand how he did it without breaking the bank.

You can watch (or listen) to this interview here.

Bart: If you installed three tools on a new Kubernetes cluster, which tools would they be and why?

Pierre: The first tool I recommend is K9s. It's not just a time-saver but a productivity booster. With its intuitive interface, you can speed up all the usual kubectl commands, access logs, edit resources and configurations, and more. It's like having a personal assistant for your cluster management tasks.

The second one is a combination of tools: External DNS, cert-manager, and NGINX ingress. Using these as a stack, you can quickly deploy an application, making it available through a DNS with a TLS without much effort via simple annotations. When I first discovered External DNS, I was amazed at its quality.

The last one is mostly an observability stack with Prometheus, Metric server, and Prometheus adapter to have excellent insights into what is happening on the cluster. You can reuse the same stack for autoscaling by repurposing all the data collected for monitoring.

Bart: Tell us more about your background and how you progressed through your career.

Pierre: My journey in the tech industry has been diverse and enriching. I've had the privilege of working for renowned companies like Red Hat and Criteo, where I honed my skills in cloud deployment. Today, as the co-founder and CTO of Qovery, I bring a wealth of experience in distributed systems, particularly for NoSQL databases, and a deep understanding of Kubernetes, which I began exploring in 2016 with version 1.2.

To provide some context to Qovery's services, we offer a self-service developer platform that allows code deployment on Kubernetes without requiring expertise in infrastructure. We keep our platform cloud-agnostic and place Kubernetes at the core to ensure our deployments are portable across different cloud providers.

Bart: How was your journey into Kubernetes and the cloud-native world, given the changes since 2016?

Pierre: Actually, learning Kubernetes was quite a journey. You had a less developed landscape with most Kubernetes components in alpha at these times. In 2016, I was also juggling between my job at Criteo and my own company.

When it came to deployment, I had several options, and I chose the hard way: deploying Kubernetes on bare metal nodes using KubeSpray. Troubleshooting bare metal Kubernetes deployments honed my skills in pinpointing issues. This hands-on experience provided a deep understanding of how each component, like the Control Plane, kubelet, Container Runtime, and scheduler, interacts to orchestrate containers.

Another resource that I found pretty helpful was "Kubernetes the Hard Way" by Kelsey Hightower despite its complexity.

Lastly, I got help from the official Kubernetes docs.

Bart: Looking back, is there anything you would do differently or advice you would give to your past self?

Pierre: Not really. Looking back, KubeSpray was the best option at the time, and there were no significant changes I would make to the decision.

Bart: You've worked on various projects involving bare metal and private clouds. Can you share more about your Kubernetes experience, such as the scale of clusters and nodes?

Pierre: At Criteo, I led a NoSQL team supporting several million requests per second on a massive 4,500-node bare-metal cluster. Managing this infrastructure - particularly node failures and data consistency across stateful databases like Cassandra, Couchbase, and Elasticsearch - was a constant challenge.

While at Criteo, I also had a personal project where I built a smaller 10-node bare-metal cluster.
This experience with bare metal management solidified my belief in the benefits of Kubernetes, which I later implemented at Criteo.

When we adopted Kubernetes at Criteo, we encountered initial hurdles. In 2018, Kubernetes operators were still new, and there was internal competition from Mesos. We addressed these challenges by validating Kubernetes performance for our specific needs and building custom Chef recipes, StatefulSet hooks, and startup scripts.

Migrating to Kubernetes took eight months of dedicated effort. It was a complex process, but it was worth it.

Bart: As you’ve mentioned, Kubernetes had competitors in 2018 and continues to do so today. Despite the tooling's immaturity, you led a team to adopt Kubernetes for stateful workloads, which was unconventional. How did you guide your team through this transition?

Pierre: We had large instances — all between 50 and 100 CPUs each and 256 gigabytes of RAM up to 500 gigabytes of RAM.

We had multiple Cassandra clusters on a single Kubernetes cluster, and each Kubernetes node was dedicated to a single Cassandra node. We chose this bare metal setup to optimize disk access with SSD or NVMe.

Running these stateful workloads wasn't just a matter of starting them up. We had to handle them carefully because stateful sets like Elasticsearch and Cassandra must keep their data safe even if the machine they're running on fails.

Kubernetes helped us detect issues with these apps using features like Pod Disruption Budgets (PDBs) that limit how often pods can be disrupted, StatefulSets that have consistent ordering of execution and stable storage, and automated probes that trigger actions and alerts when something goes wrong.

Bart: Your experiences helped me better understand your blog post, The Cost of Upgrading Hundreds of Kubernetes Clusters. After managing large infrastructures, you founded Qovery. What drove you to take this step as an engineer?

Pierre: Kubernetes has become a standard, but managing it can be a headache for developers. Cloud providers offer a basic Kubernetes setup, but it often needs more features developers need to get started and deploy applications quickly. Managing the cluster and nodes and keeping them up-to-date is time-consuming. Developers must spend a lot of time adding extra tools and configurations on top of the basic setup and then updating everything, which can be time-consuming.

To tackle these challenges, I founded Qovery.

Qovery provides two critical solutions. First, it offers a unified, user-friendly stack across cloud providers, simplifying Kubernetes deployment and management complexity. Second, it enables developers to deploy code without hassle.

Bart: Managing clusters can have various interpretations. The term can be broad. How do you define cluster management at Qovery in the context of upgrading and recovery?

Pierre: Yes, that's right. At Qovery, we understand the complexity of managing Kubernetes for customers. That's why we automate and simplify the entire process.

We automatically notify you about upcoming Kubernetes updates and handle the upgrade process on schedule, eliminating the need for manual intervention.

We deploy and manage various essential charts for your environment, including tools for logging, metrics collection, and certificate management. You don't need to worry about these intricacies.

We deploy all the necessary infrastructure elements to create a fully functional Kubernetes environment for production within 30 minutes. We provide a complete solution that's ready to go.

We build your container images, push them to a registry, and deploy them based on your preferences. We also handle the lifecycle of the applications deployed.

We use Cluster Autoscaler to automatically adjust the number of nodes (cluster size) based on your actual usage to ensure efficiency. Additionally, we deploy Vertical and Horizontal Pod Autoscalers to scale your applications' resources as their needs change automatically.

By taking care of these complexities, Qovery frees your developers to focus solely on what matters most: building incredible applications.

Bart: How large is your team of engineers?

Pierre: We have ten engineers working on the project.

Bart: How do you manage hundreds of clusters with such a small team?

Pierre: We run various tests on each code change, including unit tests for individual components and end-to-end tests that simulate real-world usage. These tests cover configurations and deployment scenarios to catch potential issues early on.

Before deploying a new cluster for a customer, we put it through its paces on our internal systems for weeks. Then, we deploy it to a separate non-production environment where we closely monitor its performance and address any problems before it reaches your applications.

We closely monitor Kubernetes and cloud providers' updates by following official changelogsand using RSS feeds, allowing us to anticipate potential issues and adapt our infrastructure proactively.

We also leverage tools like Kubent, popeye, kdave, and Pluto to help us manage API deprecations (when Kubernetes deprecates features in updates) and ensure the overall health of our infrastructure.

Our multi-layered approach has proven successful. We haven't encountered any significant problems when deploying clusters to production environments.

Bart: Managing new releases in the Kubernetes ecosystem can be daunting, especially with the extensive changelog. How do you navigate this complexity and spot potential difficulties when a new release is on the horizon?

Pierre: While reading the official update changelogs from Kubernetes and cloud providers is our first step, there are other paths to smooth sailing. Furthermore, understanding these detailed technical documents can be challenging, especially for newer team members who don’t have prior on-premise Kubernetes experience.

Cloud providers typically offer well-defined upgrade processes and document significant changes like removed functionalities, changes in API behavior, or security updates in their changelogs. However, many elements are interconnected in a Kubernetes cluster, especially when you deploy multiple charts for components like logging, observability, and ingress. Even with automated tools, we still need extensive testing and a manual process to ensure everything functions smoothly after an update.

Bart: So, what is your upgrading plan for helm charts?

Pierre: Upgrading Helm charts can be tricky because they bundle both the deployment and the software; for example, upgrading the Loki chart also upgrades Loki itself. To better understand what's changing, you need to review two changelogs: one for the chart itself and another for the software it includes.

We keep a close eye on all the charts we use by storing them in a central repository. This way, we have a clear history of every version we've used. We use a tool called helm-freeze to lock down the specific version of each chart we want to use. We can also track changes between chart and software versions using the git diff command.

If needed, we can also adjust specific settings within the chart using values override.

Like any other code change, we thoroughly test the upgraded charts with unit and functional tests to ensure everything works as expected.

Once testing is complete, we route the updated charts to our test cluster for a final round of real-world testing. After a few days of monitoring, if everything looks good, we confidently release the updates to our customers.

Bart: How do you handle unexpected situations? Do you have a specific strategy or write more automation in the Helm charts?

Pierre: We're excited to see more community Helm charts, including built-in tests! This practice will make it easier for everyone to trust and use these charts in the future.

At Qovery, we enable specific Helm options by default, like 'atomic' and 'wait,' which help prevent upgrade failures during the process. However, there can still be issues that only show up in the logs, so we run additional tests specifically designed to catch these hidden problems.

Upgrading charts that deploy Custom Resource Definitions (CRDs) requires special attention. We've automated this process to upgrade the CRDs first (to the required version) and then upgrade the chart itself. Additionally, for critical upgrades like cert-manager (which manages certificates), we back up and restore resources before applying the upgrade to avoid losing any critical certificates.

If you’re running an older version of a non-critical tool like a logging system, upgrading through each minor version one by one can be time-consuming. We have a better way! Our system allows you to skip to the desired newer version, bypassing all those intermediate updates.

We've also built safeguards into our system to handle potential problems before they occur during cluster upgrades. For example, the system checks for issues like failed jobs, incorrect Pod Disruption Budgets configuration, or ongoing processes that might block the upgrade. If it detects any problems, our engine automatically attempts to fix or clean up the issue. It will also warn you if any manual intervention is needed.

Our ultimate goal is to automate the upgrade process as much as possible.

Bart: Would you say CRDs are your favorite feature in Kubernetes, or do you have another one?

Pierre: CRDs are a powerful tool for customizing Kubernetes, offering a high degree of flexibility. However, the current support and tooling around them leave room for improvement. For example, enhancing Helm with better CRD management capabilities would significantly improve the user experience.

Despite these limitations, the potential of CRDs for customizing Kubernetes is undeniable, making them a genuinely standout feature.

Bart: With your vast Kubernetes experience since 2016, how does your current process scale beyond 100 clusters? What do you need for such scalability?

Pierre: While basic application metrics can provide a general sense of health, managing hundreds of clusters requires more in-depth testing. Here at Qovery, with our experience handling nearly 300 clusters, we've found that:

More than basic metrics are needed. We need comprehensive testing that leverages application-specific metrics to ensure everything functions as expected.

Scaling requires more granular control over deployments, such as halting failures and providing detailed information to our users. For instance, quota issues from the cloud provider might necessitate user intervention.

Drawing from my experience at Criteo, where robust tooling was essential for managing complex tasks, powerful tools are the key to effectively scaling beyond 100 clusters.

Bart: Looking ahead at Qovery's roadmap, what's next for your team?

Pierre: Qovery will add Google Cloud Platform (GCP) by year-end, joining AWS and Scaleway! This expansion gives you more choices for your cloud needs.

We're extracting reusable code sections, like those related to Helm integration, and transforming them into dedicated libraries. By making these functionalities available as open-source libraries, we empower the developer community to leverage them in their projects.

We strongly believe in Rust as a powerful language for building production-grade software, especially for systems like ours that run alongside Kubernetes.

We're also developing a service catalog feature that offers a user-friendly interface and streamlines complex deployments. This feature will allow users to focus on their applications, not the intricacies of the underlying technology.

Bart: Do you have any plans to include Azure?

Pierre: Yes, we have, but integrating a new cloud provider, given our current team size, is challenging. While we are a team of seniors, each cloud provider has nuances; some are more mature or resource-extensive than others.

Today, our focus is on AWS and GCP, as our customers most request. However, we're also working on a more modular approach that will allow Qovery to be deployed on any Kubernetes cluster, irrespective of the cloud provider, although this is still in progress.

Bart: We're looking forward to hearing more about that. So, with your black belt in karate, how does that experience influence how you approach challenges, breaking them down into manageable steps?

Pierre: Karate has taught me the importance of discipline, focus, and breaking down complex tasks into manageable steps. Like in karate, where each move is deliberate and precise, I apply the same approach to challenges in my work, breaking them down into smaller, achievable goals.

Karate has also instilled in me a sense of perseverance and resilience, which are invaluable when facing difficult situations.

Bart: I'm a huge martial arts fan. How do you see martial arts' influence on managing stress in challenging situations?

Pierre: It varies from person to person. My experience in the banking industry has shown me that while some can handle stressful situations, others struggle. Martial arts can help manage stress somewhat, depending on the person.

Bart: How has your 25-year journey in karate shaped your perspective?

Pierre: Karate has become a part of me, and I plan to continue as long as possible.

Bart: What's the best way to reach out to you?

Pierre: You can reach me on LinkedIn or via email. I'm always happy to help.

Wrap up 🌄

If you enjoyed this interview and want to listen to more Kubernetes stories and opinions, head to KubeFM and subscribe to the podcast.
If you want to keep up-to-date with Kubernetes, subscribe to Learn Kubernetes Weekly.
If you want to become an expert in Kubernetes, look at courses on Learnk8s.
And finally, if you want to keep in touch, follow me on Linkedin.

The Ops Community ⚙️

Upgrading Hundreds of Kubernetes Clusters

Top comments (0)

Read next

Introduction to Day 2 Serverless Operations – Part 2

Monitoring Web Performance: Why Your Synthetic Tests Aren't Telling the Whole Story

Introduction to Day 2 Serverless Operations – Part 1

Terraform depends_on: What it is, When to use it, and Best Practices