Designing Production Workloads in the Cloud

#aws #azure #gcp #cloudops

Whether we serve internal customers or external customers over the public Internet, we all manage production workloads at some stage in the application lifecycle.

In this blog post, I will review various aspects and recommendations when managing production workloads in the public cloud (although, some of them may be relevant for on-premise as well).

Tip #1 – Think big, plan for large scale

Production workloads are meant to serve many customers simultaneously.

Don't think about the first 1000 customers who will use your application, plan for millions of concurrent connections from day one.

Take advantage of the cloud elasticity when you plan your application deployment, and use auto-scaling capabilities to build a farm of virtual machines or containers, to be able to automatically scale in or scale out according to application load.

Using event-driven architecture will allow you a better way to handle bottlenecks on specific components of your application (such as high load on front web servers, API gateways, backend data store, etc.)

Tip #2 – Everything breaks, plan for high availability

No business can accept downtime of a production application.

Always plan for the high availability of all components in your architecture.

The cloud makes it easy to design highly-available architectures.

Cloud infrastructure is built from separate geographic regions, and each region has multiple availability zones (which usually means several distinct data centers).

When designing for high availability, deploy services across multiple availability zones, to mitigate the risk of a single AZ going down (together with your production application).

Use auto-scaling services such as AWS Auto Scaling, Azure Autoscale, or Google Autoscale groups.

Tip #3 – Automate everything

The days we used to manually deploy servers and later manually configure each server are over a long time ago.

Embrace the CI/CD process, and build steps to test and provision your workloads, from the infrastructure layer to the application and configuration layer.

Take advantage of Infrastructure-as-Code to deploy your workloads.

Whether you are using a single cloud vendor and putting efforts into learning specific IaC language (such as AWS CloudFormation, Azure Resource Manager, or Google Cloud Deployment Manager), or whether you prefer to learn and use cloud-agnostic IaC language such as Terraform, always think about automation.

Automation will allow you to deploy an entire workload in a matter of minutes, for DR purposes or for provisioning new versions of your application.

Tip #4 – Limit access to production environments

Traditional organizations are still making the mistake of allowing developers access to production, "to fix problems in production".

As a best practice human access to production workloads must be prohibited.

For provisioning of new services or making changes to existing services in production, we should use CI/CD process, running by a service account, in a non-interactive mode, following the principle of least privilege.

For troubleshooting or emergency purpose, we should create a break-glass process, allowing a dedicated group of DevOps or Service Reliability Engineers (SREs) access to production environments.

All-access attempts must be audited and kept in an audit system (such as SIEM), with read permissions for the SOC team.

Always use secure methods to login to operating systems or containers (such as AWS Systems Manager Session Manager, Azure Bastion, or Google Identity-Aware Proxy)

Enforce the use of multi-factor authentication (MFA) for all human access to production environments.

Tip #5 – Secrets Management

Static credentials of any kind (secrets, passwords, certificates, API keys, SSH keys) are prone to be breached when used over time.

As a best practice, we must avoid storing static credentials or hard-code them in our code, scripts, or configuration files.

All static credentials must be generated, stored, retrieved, rotated, and revoked automatically using a secrets management service.

Access to the secrets management requires proper authentication and authorization process and is naturally audited and logs must be sent to a central logging system.

Use Secrets Management services such as AWS Secrets Manager, Azure Key Vault, or Google Secret Manager.

Tip #6 – Auto-remediation of vulnerabilities

Vulnerabilities can arise for various reasons – from misconfigurations to packages with well-known vulnerabilities to malware.

We need to take advantage of cloud services and configure automation to handle the following:

Vulnerability management – Run vulnerability scanners on regular basis to automatically detect misconfigurations or deviations from configuration standards (services such as Amazon Inspector, Microsoft Defender, or Google Security Command Center).
Patch management – Create automated processes to check for missing OS patches and use CI/CD processes to push security patches (services such as AWS Systems Manager Patch Manager, Azure Automation Update Management, or Google OS patch management).
Software composition analysis (SCA) – Run SCA tools as part of the CI/CD process to automatically detect open-source libraries/packages with well-known vulnerabilities (services such as Amazon Inspector for ECR, Microsoft Defender for Containers, or Google Container Analysis).
Malware – If your workload contains virtual machines, deploy anti-malware software at the operating system level, to detect and automatically block malware.
Secure code analysis – Run SAST / DAST tools as part of the CI/CD process, to detect vulnerabilities in your code (if you cannot auto-remediate, at least break the build process).

Tip #7 – Monitoring and observability

Everything will eventually fail.

Log everything – from system health, performance logs, and application logs to user experience logs.

Monitor the entire service activity (from the operating system, network, application, and every part of your workload).

Use automated services to detect outages or service degradation and alert you in advance, before your customers complain.

Use services such as Amazon CloudWatch, Azure Monitor, or Google Cloud Logging.

Tip #8 – Minimize deviations between Dev, Test, and Production environments

Many organizations still believe in the false sense that lower environments (Dev, Test, QA, UAT) can be different from production, and "we will make all necessary changes before moving to production".

If you build your environments differently, you will never be able to test changes or new versions of your applications/workloads in a satisfying manner.

Use the same hardware (from instance type, amount of memory, CPU, and storage type) when provisioning compute services.

Provision resources to multiple AZs, in the same way, as provision for production workloads.

Use the same Infrastructure-as-Code to provision all environments, with minor changes such as tagging indicating dev/test/prod, different CIDRs, and different endpoints (such as object storage, databases, API gateway, etc.)

Some managed services (such as API gateways, WAF, DDoS protection, and more), has different pricing tiers (from free, standard to premium), allowing you to consume different capabilities or features – conduct a cost-benefit analysis and consider the risk of having different pricing tiers for Dev/Test vs. Production environments.

Summary

Designing production workloads have many aspects to consider.

We must remember that production applications are our face to our customers, and as such, we would like to offer highly-available and secured production applications.

This blog post contains only part of the knowledge required to design, deploy, and operate production workloads.

I highly recommend taking the time to read vendor documentation, specifically the well-architected framework documents – they contain information gathered by architects, using experience gathered over years from many customers around the world.

Additional references

About the Author

Eyal Estrin is a cloud and information security architect, the owner of the blog Security & Cloud 24/7 and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry.

Eyal is an AWS Community Builder since 2020.

You can connect with him on Twitter and LinkedIn.

The Ops Community ⚙️