When deploying a new application in the public cloud, we need to ask the business owner what are the resiliency (or SLA) requirements – How long can the business survive while our application is down and does not serve customers?
There are various answers to that question – from 24/7 availability (not realistic) to uptime of 99.9%, etc.
The domain of resiliency has two main concepts:
- RTO (Recovery Time Objective) – the amount of time it takes to recover a system after disruption
- RPO (Recovery Point Objective) – the amount of data loss, measured by time
To achieve high resiliency, or follow business SLA requirements, there are technical and cost consequences.
Naturally, we want to provision resources in high-availability (such as a farm of front-end web servers behind load-balancer), in a cluster (such as a cluster of database instances), deployed in multiple availability zones or perhaps in multiple regions, and try to avoid single point of failure.
We need to plan an architecture that will support our business resiliency requirements.
In theory, an architect can look at proposed architecture and say whether or not he sees potential availability failures, but it does not scale in large and complex architectures.
In 2021, AWS announced the general availability of the AWS Resilience Hub.
In this blog post, I will review what is the purpose of this service and how can we use it regularly, as part of our CI/CD process.
To work with AWS Resilience Hub, follow the steps below:
AWS Resilience Hub allows you to assess an application by scanning the following resources:
- AWS Resource Groups
- AWS AppRegistry applications
- AWS CloudFormation stacks
- Terraform state files
- Amazon EKS cluster configuration
AWS Resilience Hub supports the following built-in tiers:
- Foundational IT core services
- Mission critical
Choose the target policy according to the application business requirements of RTO and RPO.
Select one of the predefined suggested policies:
- Non-critical application
- Important Application
- Critical Application
- Global Critical Application
- Mission Critical Application
- Global Mission Critical Application
- Foundational Core Service
AWS Resilience Hub allows you to evaluate the resiliency of an application against the following types of disruption:
- Customer Application RTO and RPO
- AWS Infrastructure RTO and RPO
- Cloud Infrastructure Availability Zone (AZ) disruption
- AWS Region disruption
AWS Resilience Hub allows you to either run manual on-time assessments or schedule an assessment daily.
To get the most value from AWS Resilience Hub, you can integrate it as part of a CI/CD pipeline, as an additional step, once you provision Infrastructure as Code (using CloudFormation templates or Terraform modules).
A common example of integration with CI/CD pipeline:
In a mature environment, you can take one step further and integrate AWS Resilience Hub with the built-in chaos engineering service AWS Fault Injection Simulator to conduct controlled experiments on your application and evaluate its resiliency.
Once an assessment was completed, it is time to review the results, to make sure your application meets the business resiliency requirements (in terms of RTO/RPO).
The results will be written in a report, with recommendations for improvements to your application resiliency, such as adding another node to an RDS cluster, deploying another EC2 instance in another availability zone, enabling S3 bucket versioning, etc.
To make things easy to understand and improve over time, you can build dashboards using Amazon QuickSight and send alerts using CloudWatch, as explained in the blog post:
For continuous and automated improvement, you can integrate AWS Resilience Hub with AWS Systems Manager to efficiently recover your application in the event of outages, as explained in the blog post:
In this blog post, we learned about the purpose of AWS Resilience Hub, what are the various steps for using it, and perhaps most important – how to automate the assessment as part of a CI/CD pipeline for continuous improvement.
I encourage anyone who builds applications on top of AWS to learn about the benefits of this service, providing insights into the resiliency of applications to meet business requirements.
- Validating and Improving the RTO and RPO Using AWS Resilience Hub
- Establishing RPO and RTO Targets for Cloud Applications
- How to use the AWS Resilience Hub score
- Prepare & Protect Your Applications from Disruption with AWS Resilience Hub
Eyal Estrin is a cloud and information security architect, the owner of the blog Security & Cloud 24/7 and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry.
Eyal is an AWS Community Builder since 2020.
You can connect with him on Twitter and LinkedIn.