When building an application for serving customers, one of the questions raised is how do I know if my application is resilient and will survive a failure?
In this blog post, we will review what it means to build resilient applications in the cloud, and we will review some of the common best practices for achieving resilient applications.
AWS provides us with the following definition for the term resiliency:
"The ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues."
Resiliency is part of the Reliability pillar for cloud providers such as AWS, Azure, GCP, and Oracle Cloud.
AWS takes it one step further, and shows how resiliency is part of the shared responsibility model:
- The cloud provider is responsible for the resilience of the cloud (i.e., hardware, software, computing, storage, networking, and anything related to their data centers)
- The customer is responsible for the resilience in the cloud (i.e., selecting the services to use, building resilient architectures, backup strategies, data replication, and more).
This blog post assumes that you are building modern applications in the public cloud.
We have all heard of RTO (Recovery time objective).
Resilient workload (a combination of application, data, and the infrastructure that supports it), should not only recover automatically, but it must recover within a pre-defined RTO, agreed by the business owner.
Below are common best practices for building resilient applications:
The public cloud allows you to easily deploy infrastructure over multiple availability zones.
Examples of implementing high availability in the cloud:
- Deploying multiple VMs behind an auto-scaling group and a front-end load-balancer
- Spreading container load over multiple Kubernetes worker nodes, deploying in multiple AZs
- Deploying a cluster of database instances in multiple AZs
- Deploying global (or multi-regional) database services (such as Amazon Aurora Global Database, Azure Cosmos DB, Google Cloud Spanner, and Oracle Global Data Services (GDS)
- Configure DNS routing rules to send customers' traffic to more than a single region
- Deploy global load-balancer (such as AWS Global Accelerator, Azure Cross-region Load Balancer, or Google Global external Application Load Balancer) to spread customers' traffic across regions
Autoscaling is one of the biggest advantages of the public cloud.
Assuming we built a stateless application, we can add or remove additional compute nodes using autoscaling capability, and adjust it to the actual load on our application.
In a cloud-native infrastructure, we will use a managed load-balancer service, to receive traffic from customers, and send an API call to an autoscaling group, to add or remove additional compute nodes.
Microservice architecture is meant to break a complex application into smaller parts, each responsible for certain functionality of the application.
By implementing microservice architecture, we are decreasing the impact of failed components on the rest of the application.
In case of high load on a specific component, it is possible to add more compute resources to the specific component, and in case we discover a bug in one of the microservices, we can roll back to a previous functioning version of the specific microservice, with minimal impact on the rest of the application.
Event-driven architecture allows us to decouple our application components.
Resiliency can be achieved using event-driven architecture, by the fact that even if one component fails, the rest of the application continues to function.
Components are loosely coupled by using events that trigger actions.
Event-driven architectures are usually (but not always) based on services managed by cloud providers, who are responsible for the scale and maintenance of the managed infrastructure.
Event-driven architectures are based on models such as pub/sub model (services such as Amazon SQS, Azure Web PubSub, Google Cloud Pub/Sub, and OCI Queue service) or based on event delivery (services such as Amazon EventBridge, Azure Event Grid, Google Eventarc, and OCI Events service).
If your application exposes APIs, use API Gateways (services such as Amazon API Gateway, Azure API Management, Google Apigee, or OCI API Gateway) to allow incoming traffic to your backend APIs, perform throttling to protect the APIs from spikes in traffic, and perform authorization on incoming requests from customers.
Immutable infrastructure (such as VMs or containers) are meant to run application components, without storing session information inside the compute nodes.
In case of a failed component, it is easy to replace the failed component with a new one, with minimal disruption to the entire application, allowing to achieve fast recovery.
Find the most suitable data store for your workload.
A microservice architecture allows you to select different data stores (from object storage to backend databases) for each microservice, decreasing the risk of complete failure due to availability issues in one of the backend data stores.
Once you select a data store, replicate it across multiple AZs, and if the business requires it, replicate it across multiple regions, to allow better availability, closer to the customers.
By monitoring all workload components, and sending logs from both infrastructure and application components to a central logging system, it is possible to identify anomalies, anticipate failures before they impact customers, and act.
Examples of actions can be sending a command to restart a VM, deploying a new container instead of a failed one, and more.
It is important to keep track of measurements – for example, what is considered normal response time to a customer request, to be able to detect anomalies.
The base assumption is that everything will eventually fail.
Implementing chaos engineering, allows us to conduct controlled experiments, inject faults into our workloads, testing what will happen in case of failure.
This allows us to better understand if our workload will survive a failure.
Examples can be adding load on disk volumes, injecting timeout when an application tier connects to a backend database, and more.
Examples of services for implementing chaos engineering are AWS Fault Injection Simulator, Azure Chaos Studio, and Gremlin.
In an ideal world, your workload will be designed for self-healing, meaning, it will automatically detect a failure and recover from it, for example, replace failed components, restart services, or switch to another AZ or even another region.
In practice, you need to prepare a failover plan, keep it up to date, and make sure your team is trained to act in case of major failure.
A disaster recovery plan without proper and regular testing is worth nothing – your team must practice repeatedly, and adjust the plan, and hopefully, they will be able to execute the plan during an emergency with minimal impact on customers.
Failure can happen in various ways, and when we design our workload, we need to limit the blast radius on our workload.
Below are some common failure scenarios, and possible solutions:
- Failure in a specific component of the application – By designing a microservice architecture, we can limit the impact of a failed component to a specific area of our application (depending on the criticality of the component, as part of the entire application)
- Failure or a single AZ – By deploying infrastructure over multiple AZs, we can decrease the chance of application failure and impact on our customers
- Failure of an entire region – Although this scenario is rare, cloud regions also fail, and by designing a multi-region architecture, we can decrease the impact on our customers
- DDoS attack – By implementing DDoS protection mechanisms, we can decrease the risk of impacting our application with a DDoS attack
Whatever solution we design for our workloads, we need to understand that there is a cost and there might be tradeoffs for the solution we design.
A multi-region architecture will allow the most highly available resilient solution for your workloads; however, multi-region adds high cost for cross-region egress traffic, most services are limited to a single region, and your staff needs to know to support such a complex architecture.
Another limitation of multi-region architecture is data residency – if your business or regulator demands that customers' data be stored in a specific region, a multi-region architecture is not an option.
When designing a highly resilient architecture, we must take into consideration service quotas or service limits.
Sometimes we are bound to a service quota on a specific AZ or region, an issue that we may need to resolve with the cloud provider's support team.
Sometimes we need to understand there is a service limit in a specific region, such as a specific service that is not available in a specific region, or there is a shortage of hardware in a specific region.
Horizontal autoscale (the ability to add or remove compute nodes) is one of the fundamental capabilities of the cloud, however, it has its limitations.
Provisioning a new compute node (from a VM, container instance, or even database instance) may take a couple of minutes to spin up (which may impact customer experience) or to spin down (which may impact service cost).
Also, to support horizontal scaling, you need to make sure the compute nodes are stateless, and that the application supports such capability.
One of the limitations of database failover is their ability to switch between the primary node and one of the secondary nodes, either in case of failure or in case of scheduled maintenance.
We need to take into consideration the data replication, making sure transactions were saved and moved from the primary to the read replica node.
In this blog post, we have covered many aspects of building resilient applications in the cloud.
When designing new applications, we need to understand the business expectations (in terms of application availability and customer impact).
We also need to understand the various architectural design considerations, and their tradeoffs, to be able to match the technology to the business requirements.
As I always recommend – do not stay on the theoretical side of the equation, begin designing and building modern and highly resilient applications to serve your customers – There is no replacement for actual hands-on experience.
- Understand resiliency patterns and trade-offs to architect efficiently in the cloud
- Building resilience to your business requirements with Azure
- Success through culture: why embracing failure encourages better software delivery
- Building Resilient Solutions in OCI
Eyal Estrin is a cloud and information security architect, the owner of the blog Security & Cloud 24/7 and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry.
Eyal is an AWS Community Builder since 2020.
You can connect with him on Twitter
Opinions are his own and not the views of his employer.