Eyal Estrin

Posted on Feb 19, 2024 • Originally published at eyal-estrin.Medium

Trade-Offs When Designing Workloads in AWS

#aws #cloudops #finops #kubernetes

When designing architectures for modern applications in the cloud, we have multiple ways to achieve similar goals.

No matter which alternative we choose (selecting a service, an architecture, or even a pricing option), it forces us to understand the trade-offs we are making in our decisions.

To decide on the best option, we need to evaluate things such as the cost to implement and operate, the solution's resiliency, the learning curve, and perhaps even vendor lock-in (i.e., the ability to migrate a service between different cloud providers).

In this blog post, we will review some of the most common trade-offs organizations make when using the public cloud, specifically AWS services.

Buy vs. Build

One of the most common debates in many organizations is whether to buy a service or build it in-house.

For organizations with mature development teams, building your solutions (from pin-point service to a full-blown application) might be a better alternative, if the build process does not cost a lot of time and effort.

For many organizations, buying a service (such as a SaaS application, or a managed service), might be a better alternative, since a cloud provider is responsible for the scale and maintenance of the service itself, allowing the organization to be a consumer of a service, investing time and efforts in their business goals.

On-Premise vs. the Public Cloud

The cloud allows us to build modern highly scalable and elastic workloads, using the most up-to-date hardware, with a pay-as-you-go pricing option.

As much as the cloud allows us efficient ways to run applications and switch from running servers to consuming services, using the cloud requires a learning curve for many organizations.

In case an organization is still running legacy applications, sometimes with dedicated hardware or license requirements, or if there are regulatory or data residency requirements to keep data in a specific country, on-premise might be a better alternative for some organizations.

Multi-AZ vs. Multi-Region

Depending on the application's business requirements, a workload can be deployed in a resilient architecture, spread across multiple AZs or multiple regions.

By design, most services in AWS are bound to a specific region and can be deployed in a multi-AZ architecture (for example Amazon S3, Amazon RDS, Amazon EKS, etc.)

When designing an architecture, we may want to consider multi-region architecture, but we need to understand:

Most services are limited to a specific region, and cannot be replicated outside the region.
Some services can be replicated to other regions (such as Amazon S3 cross-region replication, Amazon RDS cross-region read replica, etc.), but the other replicas will be read-only, and in case of failure, you will need to design a manual switch between primary and secondary replicas.
Multi-region architecture increases the overall workload's cost, and naturally, the complexity of designing and maintaining such architecture.

Amazon EC2 vs. Containers

For most legacy or "lift & shift" applications, choosing an EC2 instance, is the easiest way to applications – customers have full control of the content of the EC2 instance, with the same experience as they used to on-prem.

Although developing and wrapping applications inside containers requires a learning curve, containers offer better horizontal scaling, better use of the underlying hardware, and easier upgrade, when using immutable infrastructure (where no session information is stored inside the container image), since an upgrade is simply replacing one container image with a newer version.

Amazon ECS vs. Amazon EKS

Both are managed orchestrators for running containers, and both a fully supported by AWS.

Amazon ECS can be a better alternative for organizations looking to run workloads with predictable scaling patterns, and it is easier to learn and maintain, compared to Amazon EKS.

Amazon EKS offers full-blown managed Kubernetes service for organizations who wish to deploy their applications on top of Kubernetes. As with any Kubernetes deployment, it takes time for the teams to learn how to deploy and maintain Kubernetes clusters, due to the large amount of configuration options.

Containers vs. AWS Lambda

Both alternatives offer organizations the ability to run production applications in a microservice architecture.

Containers allow development teams to develop their applications anywhere (from the developer's IDE to running an entire development in the cloud), push container images to a container registry, and run them on any container environment (agnostic to the cloud provider's ecosystem).

Containers also allow developers SSH access to control the running containers, mostly for troubleshooting purposes on a small scale.

AWS Lambda is running in a fully managed environment, where AWS takes care of the scale and maintenance of the underlying infrastructure, while developers focus on developing Lambda functions.

Although AWS allows customers to wrap their code inside containers and run them in a Lambda serverless environment, Lambda is considered a vendor lock-in, since it cannot run outside the AWS ecosystem (i.e., other cloud providers).

AWS Lambda does not allow customers access to the underlying infrastructure and is limited to a maximum of 15 minutes per invocation, meaning, long-running invocations are not suitable for Lambda.

On-demand vs. Spot vs. Saving Plans

AWS offers various alternatives to pay for running compute services (from EC2 instances, ECS or EKS, Lambda, Amazon RDS, and more).

Each alternative is slightly better for different use cases:

On-demand – Useful for unpredictable workloads such as development environments (may be running for an entire month, or a couple of hours)
Spot – Useful for workloads that can survive sudden interruption, such as loosely coupled HPC workloads, or stateless applications
Saving Plans – Useful for workloads that are expected to be running for a long period (1 or 3 years), with the ability to replace instance type according to needs

Amazon S3 lifecycle policies vs. Amazon S3 Intelligent-Tiering

When designing persistent storage solutions for workloads, AWS offers various storage tiers for storing objects inside Amazon S3 – from the standard tier to an archive tier.

Amazon S3 allows customers efficient ways to store objects:

Amazon S3 lifecycle policies – Allows customers to set up rules for moving objects from a real-time tier to an archive tier, according to the last time an object was accessed. It is a useful one-way solution, but it requires customers to set up the rules. Useful for expected and predictable data access patterns.
Amazon S3 Intelligent-Tiering – Uses machine learning to inspect each object's last access time, and automatically move objects between tiers (from real-time to archive and vice versa). Useful for unpredictable data access patterns.

NAT Gateway vs. NAT Instances

When a service in a private subnet requires access to resources outside its subnet (for example the public Internet), we need to configure one of the NAT alternatives:

NAT Gateway – A fully managed NAT service, supporting automated scaling capability, high availability, and performance, but with high cost (compared to NAT instance).
NAT Instance – An EC2 instance, based on a generic AMI for allowing NAT capabilities. Requires customer maintenance (such as patching, manual resiliency, manual instance family size selection, and limited network bandwidth) at the cost of an EC2 instance (cheaper than NAT Gateway). If an organization knows to automate the deployment and maintenance of NAT instances, they can use this alternative and save costs, otherwise, NAT Gateway is a much more resilient alternative.

Summary

Making an architectural design has its trade-offs.

In many cases, you will have more than a single solution for the same challenge, and you need to measure the cost and benefits of each alternative, as we showed in this blog post.

We need to understand the implications and consequences of our decisions to be able to prioritize our options.

Reference

AWS re:Invent 2023 - Advanced integration patterns & trade-offs for loosely coupled systems

About the Author

Eyal Estrin is a cloud and information security architect, the owner of the blog Security & Cloud 24/7 and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry.

Eyal is an AWS Community Builder since 2020.

You can connect with him on Twitter

Opinions are his own and not the views of his employer.

The Ops Community ⚙️