Eyal Estrin

Posted on Jun 10 • Originally published at eyal-estrin.Medium

Introduction to Day 2 Serverless Operations – Part 2

#aws #azure #gcp #serverless

In part 1 of this series, I introduced some of the most common Day 2 serverless operations, focusing on Function as a Service.

In this part, I will focus on serverless application integration services commonly used in event-driven architectures.

For this post, I will look into message queue services, event routing services, and workflow orchestration services for building event-driven architectures.

Message queue services

Message queues enable asynchronous communication between different components in an event-driven architecture (EDA). This means that producers (systems or services generating events) can send messages to the queue and continue their operations without waiting for consumers (systems or services processing events) to respond or be available.

Security and Access Control

Security should always be the priority, as it protects your data, controls access, and ensures compliance from the outset. This includes data protection, limiting permissions, and enforcing least privilege policies.

When using Amazon SQS, manage permissions using AWS IAM policies to restrict access to queues and follow the principle of least privilege, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-basic-examples-of-iam-policies.html#security_iam_id-based-policy-examples
When using Amazon SQS, enable server-side encryption (SSE) for sensitive data at rest, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-server-side-encryption.html
When using Amazon SNS, manage topic policies and IAM roles to control who can publish or subscribe, as explained here: https://docs.aws.amazon.com/sns/latest/dg/security_iam_id-based-policy-examples.html
When using Amazon SNS, enable server-side encryption (SSE) for sensitive data at rest, as explained here: https://docs.aws.amazon.com/sns/latest/dg/sns-server-side-encryption.html
When using Azure Service Bus, use managed identities and configure roles, following the principle of least privileged, as explained here: https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-managed-service-identity
When using Azure Service Bus, enable encryption at rest using customer-managed keys, as explained here: https://learn.microsoft.com/en-us/azure/service-bus-messaging/configure-customer-managed-key
When using Google Cloud Pub/Sub, tighten and review IAM policies to ensure only authorized users and services can publish or subscribe to topics, as explained here: https://cloud.google.com/pubsub/docs/access-control
When using Google Cloud Pub/Sub, configure encryption at rest using customer-managed encryption keys, as explained here: https://cloud.google.com/pubsub/docs/encryption

Monitoring and Observability

Once security is in place, implement comprehensive monitoring and observability to gain visibility into system health, performance, and failures. This enables proactive detection and response to issues.

When using Amazon SQS, monitor queue metrics such as message count, age of oldest message, and queue length using Amazon CloudWatch, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/monitoring-using-cloudwatch.html https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-available-cloudwatch-metrics.html
When using Amazon SQS, set up CloudWatch alarms for thresholds (e.g., high message backlog or processing latency), as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/set-cloudwatch-alarms-for-metrics.html
When using Amazon SNS, use CloudWatch to track message delivery status, failure rates, and subscription metrics, as explained here: https://docs.aws.amazon.com/sns/latest/dg/sns-monitoring-using-cloudwatch.html
When using Azure Service Bus, use Azure Monitor to track metrics such as queue length, message count, dead-letter messages, and throughput. Set up alerts for abnormal conditions (e.g., message backlog, high latency), as explained here: https://learn.microsoft.com/en-us/azure/service-bus-messaging/monitor-service-bus
When using Azure Service Bus, monitor and manage message sessions for ordered processing when required, as explained here: https://learn.microsoft.com/en-us/azure/service-bus-messaging/message-sequencing
When using Google Cloud Pub/Sub, monitor message throughput, error rates, and latency, and set up alerts for operational anomalies, as explained here: https://cloud.google.com/pubsub/docs/monitoring

Error Handling

With monitoring established, set up robust error handling mechanisms, including alerts, retries, and dead-letter queues, to ensure reliability and rapid remediation of failures.

When using Amazon SQS, configure Dead Letter Queues (DLQs) to capture messages that fail processing repeatedly for later analysis and remediation, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html
When using Amazon SNS, integrate with DLQs (using SQS as a DLQ) for messages that cannot be delivered to endpoints, as explained here: https://docs.aws.amazon.com/sns/latest/dg/sns-dead-letter-queues.html
When using Azure Service Bus, regularly review and process messages in dead-letter queues to ensure failed messages are not ignored, as explained here: https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-dead-letter-queues https://learn.microsoft.com/en-us/azure/service-bus-messaging/enable-dead-letter
When using Google Cloud Pub/Sub, monitor for undelivered or unacknowledged messages and set up dead-letter topics if needed, as explained here: https://cloud.google.com/pubsub/docs/handling-failures

Scaling and Performance

After ensuring security, visibility, and error resilience, focus on scaling and performance. Monitor throughput, latency, and resource utilization, and configure auto-scaling to match demand efficiently.

When using Amazon SQS, adjust queue settings or consumer concurrency as traffic patterns change, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/best-practices-message-processing.html
When using Amazon SQS, monitor usage for unexpected spikes, as explained here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Best_Practice_Recommended_Alarms_AWS_Services.html#SNS
When using Azure Service Bus, adjust throughput units, use partitioned queues/topics, and implement batching or parallel processing to handle varying loads, as explained here: https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-performance-improvements
When using Google Cloud Pub/Sub, adjust quotas and scaling policies as message volumes change to avoid service interruptions, as explained here: https://cloud.google.com/pubsub/quotas

Maintenance

Finally, establish ongoing maintenance routines such as regular reviews, updates, cost optimization, and compliance audits to sustain operational excellence and adapt to evolving needs.

When using Amazon SQS, purge queues as needed and archive messages if required for compliance, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-using-purge-queue.html
When using Amazon SNS, review and clean up unused topics and subscriptions, as explained here: https://docs.aws.amazon.com/sns/latest/dg/sns-delete-subscription-topic.html
When using Azure Service Bus, delete unused messages, as explained here: https://learn.microsoft.com/en-us/azure/service-bus-messaging/batch-delete
When using Google Cloud Pub/Sub, delete unused messages, as explained here: https://cloud.google.com/pubsub/docs/replay-overview

Event routing services

Event routing services act as the central hub in event-driven architectures, receiving events from producers and distributing them to the appropriate consumers. This decouples producers from consumers, allowing each to operate, scale, and fail independently without direct awareness of each other.

Monitoring and Observability

Serverless event routing services require robust monitoring and observability to track event flows, detect anomalies, and ensure system health; this is typically achieved through metrics, logs, and dashboards that provide real-time visibility into event processing and failures.

When using Amazon EventBridge, set up CloudWatch metrics and logs to monitor event throughput, failures, latency, and rule matches. Use CloudWatch Alarms to alert on anomalies or failures in event delivery, as explained here: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-monitoring.html
When using Azure Event Grid, use Azure Monitor and Event Grid metrics to track event delivery, failures, and latency, as explained here: https://learn.microsoft.com/en-us/azure/event-grid/monitor-namespaces
When using Azure Event Grid, set up alerts for undelivered events or high failure rates, as explained here: https://learn.microsoft.com/en-us/azure/event-grid/set-alerts
When using Google Eventarc, monitor for event delivery status, trigger activity, and errors, as explained here: https://cloud.google.com/eventarc/standard/docs/monitor

Error Handling and Dead-Letter Management

Effective error handling uses mechanisms like retries and circuit breakers to manage transient failures, while dead-letter queues (DLQs) capture undelivered or failed events for later analysis and remediation, preventing data loss and supporting troubleshooting.

When using Amazon EventBridge, configure dead-letter queues (DLQ) for failed event deliveries. Set retry policies and monitor DLQ for undelivered events to ensure no data loss, as explained here: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-rule-dlq.html
When using Azure Event Grid, Configure retry policies and use dead-lettering for events that cannot be delivered after multiple attempts, as explained here: https://learn.microsoft.com/en-us/azure/event-grid/manage-event-delivery
When using Google Eventarc, use Pub/Sub dead letter topics for failed event deliveries, as explained here: https://cloud.google.com/eventarc/docs/retry-events

Security and Access Management

Security and access management involve configuring fine-grained permissions to control which users and services can publish, consume, or manage events, ensuring that only authorized entities interact with event routing resources and that sensitive data remains protected.

When using Amazon EventBridge, review and update IAM policies for event buses, rules, and targets. Use resource-based policies to restrict who can publish or subscribe to events, as explained here: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-manage-iam-access.html https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-use-resource-based.html
When using Azure Event Grid, use managed identity to an Event Grid topic and configure a role, following the principle of least privilege, as explained here: https://learn.microsoft.com/en-us/azure/event-grid/enable-identity-custom-topics-domains https://learn.microsoft.com/en-us/azure/event-grid/add-identity-roles
When using Google Eventarc, manage IAM permissions for triggers, event sources, and destinations, following the principle of least privilege, as explained here: https://cloud.google.com/eventarc/standard/docs/access-control
When using Google Eventarc, encrypt sensitive data at rest using customer-managed encryption keys, as explained here: https://cloud.google.com/eventarc/docs/use-cmek

Scaling and Performance

Serverless platforms automatically scale event routing services in response to workload changes, spinning up additional resources during spikes and scaling down during lulls, while performance optimization involves tuning event patterns, batching, and concurrency settings to minimize latency and maximize throughput.

When using Amazon EventBridge, monitor event throughput and adjust quotas or request service limit increases as needed. Optimize event patterns and rules for efficiency, as explained here: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-quota.html
When using Azure Event Grid, monitor for throttling or delivery issues, as explained here: https://learn.microsoft.com/en-us/azure/event-grid/monitor-push-reference
When using Google Eventarc, monitor quotas and usage (e.g., triggers per location), as explained here: https://cloud.google.com/eventarc/docs/quotas

Workflow orchestration services

Workflow services are designed to coordinate and manage complex sequences of tasks or business processes that involve multiple steps and services. They act as orchestrators, ensuring each step in a process is executed in the correct order, handling transitions, and managing dependencies between steps.

Monitoring and Observability

Set up and review monitoring dashboards, logs, and alerts to ensure workflows are running correctly and to quickly detect anomalies or failures.

When using AWS Step Functions, monitor executions, check logs, and set up CloudWatch metrics and alarms to ensure workflows run as expected, as explained here: https://docs.aws.amazon.com/step-functions/latest/dg/monitoring-logging.html
When using Azure Logic Apps, use Azure Monitor and built-in diagnostics to track workflow runs and troubleshoot failures, as explained here: https://learn.microsoft.com/en-us/azure/logic-apps/monitor-logic-apps-overview
When using Google Workflows, use Cloud Logging and Monitoring to observe workflow executions and set up alerts for failures or anomalies, as explained here: https://cloud.google.com/workflows/docs/monitor

Error Handling and Retry

Investigate failed workflow executions, enhance error handling logic (such as retries and catch blocks), and resubmit failed runs where appropriate. This is crucial for maintaining workflow reliability and minimizing manual intervention.

When using AWS Step Functions, review failed executions, configure retry/catch logic, and update workflows to handle errors gracefully, as explained here: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html
When using Azure Logic Apps, handle failed runs, configure error actions, and resubmit failed instances as needed, as explained here: https://learn.microsoft.com/en-us/azure/logic-apps/error-exception-handling
When using Google Workflows, inspect failed executions, define retry policies, and update error handling logic in workflow definitions, as explained here: https://cloud.google.com/workflows/docs/reference/syntax/catching-errors

Security and Access Management

Workflow orchestration services require continuous enforcement of granular access controls and the principle of least privilege, ensuring that each function and workflow has only the permissions necessary for its specific tasks.

When using AWS Step Functions, use AWS Identity and Access Management (IAM) for fine-grained control over who can access and manage workflows, as explained here: https://docs.aws.amazon.com/step-functions/latest/dg/auth-and-access-control-sfn.html
When using Azure Logic Apps, use Azure Role-Based Access Control (RBAC) and managed identities for secure access to resources and connectors, as explained here: https://learn.microsoft.com/en-us/azure/logic-apps/authenticate-with-managed-identity
When using Google Workflows, use Google Cloud IAM for permissions and access management, which allows you to define who can execute, view, or manage workflows, as explained here: https://cloud.google.com/workflows/docs/use-iam-for-access

Versioning and Updates

Workflow orchestration services use versioning to track and manage different iterations of workflows or services, allowing multiple versions to coexist and enabling users to select, test, or revert to specific versions as needed.

When using AWS Step Functions, update state machines, manage versions, and test changes before deploying to production, as explained here: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-state-machine-version.html
When using Azure Logic Apps, manage deployment slots, and use versioning for rollback if needed, as explained here: https://learn.microsoft.com/en-us/azure/logic-apps/manage-logic-apps-with-azure-portal
When using Google Workflows, update workflows, test changes in staging, and deploy updates with minimal disruption, as explained here: https://cloud.google.com/workflows/docs/best-practice

Cost Optimization

Regularly review usage and billing data, optimize workflow design (e.g., reduce unnecessary steps or external calls), and adjust resource allocation to control operational costs.

When using AWS Step Functions, analyze usage and optimize workflow design to reduce execution and resource costs, as explained here: https://docs.aws.amazon.com/step-functions/latest/dg/sfn-best-practices.html#cost-opt-exp-workflows
When using Azure Logic Apps, monitor consumption, review billing, and optimize triggers/actions to control costs, as explained here: https://learn.microsoft.com/en-us/azure/logic-apps/plan-manage-costs
When using Google Workflows, analyze workflow usage, optimize steps, and monitor billing to reduce costs, as explained here: https://cloud.google.com/workflows/docs/best-practice#optimize-usage

Summary

In this blog post, I presented the most common Day 2 serverless operations when using application integration services (message queues, event routing services, and workflow orchestrations) to build modern applications.

I looked at aspects such as observability, error handling, security, performance, etc.

Building event-driven architectures requires time to grasp which services best support this approach. However, gaining a foundational understanding of key areas is essential for effective day 2 serverless operations.

About the author

Eyal Estrin is a cloud and information security architect, an AWS Community Builder, and the author of the books Cloud Security Handbook and Security for Cloud Native Applications, with more than 25 years in the IT industry.

You can connect with him on social media (https://linktr.ee/eyalestrin).

Opinions are his own and not the views of his employer.

The Ops Community ⚙️

Introduction to Day 2 Serverless Operations – Part 2

Message queue services

Security and Access Control

Monitoring and Observability

Error Handling

Scaling and Performance

Maintenance

Event routing services

Monitoring and Observability

Error Handling and Dead-Letter Management

Security and Access Management

Scaling and Performance

Workflow orchestration services

Monitoring and Observability

Error Handling and Retry

Security and Access Management

Versioning and Updates

Cost Optimization

Summary

About the author

Top comments (0)