The Ops Community ⚙️

David Krohn
David Krohn

Posted on • Originally published at globaldatanet.com

Enterprise-scaled Self-Healing StackSets

With more than 5 million articles from over 7,000 brands, OTTO is one of the leading German online shopping platforms. In the future, it will open up to even more brands and partners as part of its transformation. OTTO is part of the internationally active Otto Group, with headquarters in Hamburg, and employs 6,100 people throughout Germany. In the 2020/21 financial year, OTTO generated revenues of 4.5 billion euros.

At OTTO, we faced several challenges to operate AWS CloudFormation StackSets at Scale. We must govern several hundred AWS accounts for our product teams, all while balancing the need for agility and control.

At this scale, operations can take a lot of time, because there are multiple operational tasks that we need to do when AWS accounts are leaving the AWS Organization or Teams are nuking the AWS account, StackSets Instances get drifted, because not all required resources for compliance can be secured ( SCP Limitations ), existing AWS accounts are joining the AWS Organization and all mandatory StackSets needs to be deployed, and manual steps should be reduced to a minimum. Furthermore, there is no feature from the Service itself to gain an overview of the status of drifted Instances and the general health of your StackSet health and compliance.

The cloud competence center at OTTO IT, also known as the Governance at Scale (GAS) team, developed a solution for self-healing on StackSets, that is integrated into the OTTO tooling ecosystem with Confluence and Microsoft Teams.

OTTO worked with globaldatanet to set up its Landing Zone. globaldatanet is an award-winning AWS Advanced Consulting Partner and longtime Cloud Solution Provider for OTTO, supporting the team in cloud security and GAS. Their focus on building cloud-native solutions using Serverless supported over 100 companies within 5 years to develop and innovate products and services in the cloud.

In this post, we’ll demonstrate how to implement fully automated enterprise-scaled self-healing on StackSets using AWS StepFunctions and create a Dashboard to get an overview of your StackSet health and compliance and reduce operational time.

The solution workflow includes the following steps:

  1. The tagging concept for StackSets
  2. Automatically create StackSets configuration in SSM Parameter Store
  3. Implementing StepFunction for StackSet Self-Healing

Let’s see how this works.

Prerequisites

The following prerequisites are necessary for following along with the contents of this post:

Solution overview

The following architecture shows the whole solution of the Self Healing StackSets.

Blog Content

Architecture of fully-automated Self Healing Solution with integration to Confluence.

Tagging concept for StackSets

The solution requires a JSON file in the AWS parameter store, the easiest way is to create it automatically based on the StackSet configurations and the tags assigned there. We'll go into more detail about this in the next section of the Automatically create StackSets configuration Parameter Store article. In the following, we describe which tags we introduced to our StackSet and what we need these tags for.

⚠️ AWS tags do not allow commas in value, so ":" as divider for arrays

Key Value Result Example
antidependson StackSet Name antidependson marks stacksets which collide with each other. MYSTACKSET
dependson [List of StackSet Names] List of Stacksets that need to be rolled out before deploying this stackset (e.g. Enable Config before Activate Config Rules). NOTE : Please reduce to only one dependson-stackset for now. Form "chains" for multi-dependencies. MY-STACKSET1:MYSTACKSET2
mandatory true or false The stackset instances must be present on all AWS accounts true
selfhealing true or false StackSet can be healed via Delete & Redeploy (exception e.g. IDP roles) - Parameter Overwrites will be cached. true
region [Regions] List of Regions in which the stackset instances are to be deployed eu-west-1:eu-central-1:us-east-1

Automatically create StackSets configuration Parameter Store

The automated generation of the Stackset-configuration via JSON inside the ParameterStore is a multi-purpose-utility:

  1. Removing the chore to configure manually a JSON-document
  2. Ensure the Account vending-machines knows what to deploy in which order
  3. Supporting the self-healing StepFunction about the expected setup of the member-accounts

The Lambda responsible for the task is invoked via a Events-Rule:

Every time a Stackset-Operation has been finished with status "succeeded".

This is due the tags on a Stackset are part of the stackset, not Additional items describing a Stackset, therefore a change to the tags always will result in a Stackset-Update-operation.

In terms of computerscience the Lambda is quite interesting, as the primary problem was to build a nonweighted tree based on the "dependson" and "antidependson" tags and then compile an ordered one-dimensional list, like in the good old "travelling salesmen"-problem.

Implementing StepFunction for StackSet Self-Healing

AWS Step Functions is a cloud service that enables you to coordinate the components of distributed applications and microservices using visual workflows. It allows you to build and automate the execution of complex processes and tasks across multiple AWS services, using a visual interface to define and execute your workflows. Since the Self Healing Solutions needs a complex workflow we decided to use Step Functions for this Usecase. Following we will explain you the workflow of the Self Healing.

StepFunction Workflow

Blog Content

Functionality

ƛ Serverless Functions

  • StackSetInitCleanupLambda: Performs a search to identify StackSet instances of AWS Accounts that are either not present within the AWS Organization or deployed to AWS accounts that are suspended. Once identified, proceed with the deletion of these instances from all associated StackSets.
  • MandatoryStackSetDeploymentLambda: Search missing StackSets Instances (which are tagged with mandatory = true) and deploy those Instances
  • StackSetDriftDetectionLambda: Trigger Drift Detection on all StackSets
  • TriggerDriftStatusLambda: Check if Drift Detection is completed on all StackSets
  • SearchStackSetInstanceToHealLambda: Searches for drifted StackSet Instances from StackSets which are tagged with Selfhealing = true
  • StackSetCleanupLambda: Removes unhealthy StackSet Instances and redeploys them. Parameter Overrides will be cached so the new deployed instance will have the same setting as before.
  • StatusPrepareHTMLLambda: Prepare the HTML output Dashboard for Confluence and Json log file of the current StackSet Healthiness State
  • TeamsNotificationLambda: Send Teams Notification which summary to notify the GAS Team after each execution

?!Decisions

  • InitCleanup Complete: Check whether all unnecessary instances have been removed. If not, StepFunction is triggering the StackSetInitCleanupLambda function again.
  • MandatoryStackSetDeployment Complete: Checks whether all mandatory instances have been deployed. If not, StepFunction is triggering the MandatoryStackSetDeploymentLambda function again.
  • StackSetDriftDetection Complete: Wait until StackSet Drift Detection has been finished on all StackSets
  • Healing Complete: Check if all unhealthy Instances are healed otherwise invoke StackSetCleanupLambda again

Limitations

While developing the solution we faced several limitations. Here are our findings and solutions for that.

  • 🚨 StackSets instance operations: Maximum number of stack instances, across all stack sets, that you can run operations on in each Region at the same time, per administrator account is limited to 10.000 operations.

    ✅ We implemented a counter to count the current StackSets operations which are in progress, in addition we also catching the Exception from CloudFormation and waiting few seconds to try the operation again.

  • 🚨 Parameter Overwrites Caching: Whenever removing a drifted StackSet Instance which has Parameter Overwrite you will lose the individually parameters of the Instance.

    ✅ Before deleting the drifted StackSet Instance we cache the Parameter Overwrites and deploy the StackSet Instance after successful deletion again with the cached Parameter Overwrites again.

  • 🚨AWS Step Functions Payload size: AWS Step Functions supports payload sizes up to 256KB. For our solution we need more Payloads between the States especially when we want to pass our log or the concurrent Parameter Overwrites per StackSet.

    ✅ We are storing our states in an S3 bucket to pass the state. At the end of the execution we are deleting the state from S3 to not to influence the next Step Function execution with the wrong state.

Documentation

After each execution of the StackSet Health StepFunction, we aim to notify our GAS team about the actions taken during the previous run. Therefore, we have implemented a Teams notification that includes a status update, a link to the generated dashboard, and a link to the log file.

The following screenshot illustrates an example of a Teams notification. It provides a summary report and directs you to the dashboard and log file for further details.

Blog Content

Dashboard

Our StackSet Health Dashboard is a simple HTML file which will be generated trough a Lambda Function, saved in S3 and will be distributed trough a CloudFrount. You can integrate this Dashboards in your Confluence or any other internal Wiki. This Dashboard is secured via CloudFormation Function - additionally you can also add a Firewall to restrict the access to an specific CIDR or Geographic region and prevent access from third parties. The screenshot below provides an example of the overall StackSet Health status information for an entire AWS Organization.

Blog Content

Conclusion

In this post, we demonstrated a solution to automatically heal AWS CloudFormation StackSets at scale. By implementing this Solution Organisations we reduced manual effort for StackSet cleanup operations by 4 hours per week, improved the overall reliability of our StackSets, increased our compliance in the organisation, and managed to get a daily updated overview for all StackSet Instances using the dashboards. In summary, the self-healing CloudFormation StackSets solution combines automation, monitoring, and self-recovery capabilities to deliver a robust and resilient system for StackSets.

Image description

Top comments (0)