In part 1 of this series, I introduced some of the most common Day 2 serverless operations, focusing on Function as a Service.
In this part, I will focus on serverless application integration services commonly used in event-driven architectures.
For this post, I will look into message queue services, event routing services, and workflow orchestration services for building event-driven architectures.
Message queue services
Message queues enable asynchronous communication between different components in an event-driven architecture (EDA). This means that producers (systems or services generating events) can send messages to the queue and continue their operations without waiting for consumers (systems or services processing events) to respond or be available.
Security and Access Control
Security should always be the priority, as it protects your data, controls access, and ensures compliance from the outset. This includes data protection, limiting permissions, and enforcing least privilege policies.
- When using Amazon SQS, manage permissions using AWS IAM policies to restrict access to queues and follow the principle of least privilege, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-basic-examples-of-iam-policies.html#security_iam_id-based-policy-examples
- When using Amazon SQS, enable server-side encryption (SSE) for sensitive data at rest, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-server-side-encryption.html
- When using Amazon SNS, manage topic policies and IAM roles to control who can publish or subscribe, as explained here: https://docs.aws.amazon.com/sns/latest/dg/security_iam_id-based-policy-examples.html
- When using Amazon SNS, enable server-side encryption (SSE) for sensitive data at rest, as explained here: https://docs.aws.amazon.com/sns/latest/dg/sns-server-side-encryption.html
- When using Azure Service Bus, use managed identities and configure roles, following the principle of least privileged, as explained here: https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-managed-service-identity
- When using Azure Service Bus, enable encryption at rest using customer-managed keys, as explained here: https://learn.microsoft.com/en-us/azure/service-bus-messaging/configure-customer-managed-key
- When using Google Cloud Pub/Sub, tighten and review IAM policies to ensure only authorized users and services can publish or subscribe to topics, as explained here: https://cloud.google.com/pubsub/docs/access-control
- When using Google Cloud Pub/Sub, configure encryption at rest using customer-managed encryption keys, as explained here: https://cloud.google.com/pubsub/docs/encryption
Monitoring and Observability
Once security is in place, implement comprehensive monitoring and observability to gain visibility into system health, performance, and failures. This enables proactive detection and response to issues.
- When using Amazon SQS, monitor queue metrics such as message count, age of oldest message, and queue length using Amazon CloudWatch, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/monitoring-using-cloudwatch.html https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-available-cloudwatch-metrics.html
- When using Amazon SQS, set up CloudWatch alarms for thresholds (e.g., high message backlog or processing latency), as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/set-cloudwatch-alarms-for-metrics.html
- When using Amazon SNS, use CloudWatch to track message delivery status, failure rates, and subscription metrics, as explained here: https://docs.aws.amazon.com/sns/latest/dg/sns-monitoring-using-cloudwatch.html
- When using Azure Service Bus, use Azure Monitor to track metrics such as queue length, message count, dead-letter messages, and throughput. Set up alerts for abnormal conditions (e.g., message backlog, high latency), as explained here: https://learn.microsoft.com/en-us/azure/service-bus-messaging/monitor-service-bus
- When using Azure Service Bus, monitor and manage message sessions for ordered processing when required, as explained here: https://learn.microsoft.com/en-us/azure/service-bus-messaging/message-sequencing
- When using Google Cloud Pub/Sub, monitor message throughput, error rates, and latency, and set up alerts for operational anomalies, as explained here: https://cloud.google.com/pubsub/docs/monitoring
Error Handling
With monitoring established, set up robust error handling mechanisms, including alerts, retries, and dead-letter queues, to ensure reliability and rapid remediation of failures.
- When using Amazon SQS, configure Dead Letter Queues (DLQs) to capture messages that fail processing repeatedly for later analysis and remediation, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html
- When using Amazon SNS, integrate with DLQs (using SQS as a DLQ) for messages that cannot be delivered to endpoints, as explained here: https://docs.aws.amazon.com/sns/latest/dg/sns-dead-letter-queues.html
- When using Azure Service Bus, regularly review and process messages in dead-letter queues to ensure failed messages are not ignored, as explained here: https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-dead-letter-queues https://learn.microsoft.com/en-us/azure/service-bus-messaging/enable-dead-letter
- When using Google Cloud Pub/Sub, monitor for undelivered or unacknowledged messages and set up dead-letter topics if needed, as explained here: https://cloud.google.com/pubsub/docs/handling-failures
Scaling and Performance
After ensuring security, visibility, and error resilience, focus on scaling and performance. Monitor throughput, latency, and resource utilization, and configure auto-scaling to match demand efficiently.
- When using Amazon SQS, adjust queue settings or consumer concurrency as traffic patterns change, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/best-practices-message-processing.html
- When using Amazon SQS, monitor usage for unexpected spikes, as explained here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Best_Practice_Recommended_Alarms_AWS_Services.html#SNS
- When using Azure Service Bus, adjust throughput units, use partitioned queues/topics, and implement batching or parallel processing to handle varying loads, as explained here: https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-performance-improvements
- When using Google Cloud Pub/Sub, adjust quotas and scaling policies as message volumes change to avoid service interruptions, as explained here: https://cloud.google.com/pubsub/quotas
Maintenance
Finally, establish ongoing maintenance routines such as regular reviews, updates, cost optimization, and compliance audits to sustain operational excellence and adapt to evolving needs.
- When using Amazon SQS, purge queues as needed and archive messages if required for compliance, as explained here: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-using-purge-queue.html
- When using Amazon SNS, review and clean up unused topics and subscriptions, as explained here: https://docs.aws.amazon.com/sns/latest/dg/sns-delete-subscription-topic.html
- When using Azure Service Bus, delete unused messages, as explained here: https://learn.microsoft.com/en-us/azure/service-bus-messaging/batch-delete
- When using Google Cloud Pub/Sub, delete unused messages, as explained here: https://cloud.google.com/pubsub/docs/replay-overview
Event routing services
Event routing services act as the central hub in event-driven architectures, receiving events from producers and distributing them to the appropriate consumers. This decouples producers from consumers, allowing each to operate, scale, and fail independently without direct awareness of each other.
Monitoring and Observability
Serverless event routing services require robust monitoring and observability to track event flows, detect anomalies, and ensure system health; this is typically achieved through metrics, logs, and dashboards that provide real-time visibility into event processing and failures.
- When using Amazon EventBridge, set up CloudWatch metrics and logs to monitor event throughput, failures, latency, and rule matches. Use CloudWatch Alarms to alert on anomalies or failures in event delivery, as explained here: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-monitoring.html
- When using Azure Event Grid, use Azure Monitor and Event Grid metrics to track event delivery, failures, and latency, as explained here: https://learn.microsoft.com/en-us/azure/event-grid/monitor-namespaces
- When using Azure Event Grid, set up alerts for undelivered events or high failure rates, as explained here: https://learn.microsoft.com/en-us/azure/event-grid/set-alerts
- When using Google Eventarc, monitor for event delivery status, trigger activity, and errors, as explained here: https://cloud.google.com/eventarc/standard/docs/monitor
Error Handling and Dead-Letter Management
Effective error handling uses mechanisms like retries and circuit breakers to manage transient failures, while dead-letter queues (DLQs) capture undelivered or failed events for later analysis and remediation, preventing data loss and supporting troubleshooting.
- When using Amazon EventBridge, configure dead-letter queues (DLQ) for failed event deliveries. Set retry policies and monitor DLQ for undelivered events to ensure no data loss, as explained here: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-rule-dlq.html
- When using Azure Event Grid, Configure retry policies and use dead-lettering for events that cannot be delivered after multiple attempts, as explained here: https://learn.microsoft.com/en-us/azure/event-grid/manage-event-delivery
- When using Google Eventarc, use Pub/Sub dead letter topics for failed event deliveries, as explained here: https://cloud.google.com/eventarc/docs/retry-events
Security and Access Management
Security and access management involve configuring fine-grained permissions to control which users and services can publish, consume, or manage events, ensuring that only authorized entities interact with event routing resources and that sensitive data remains protected.
- When using Amazon EventBridge, review and update IAM policies for event buses, rules, and targets. Use resource-based policies to restrict who can publish or subscribe to events, as explained here: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-manage-iam-access.html https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-use-resource-based.html
- When using Azure Event Grid, use managed identity to an Event Grid topic and configure a role, following the principle of least privilege, as explained here: https://learn.microsoft.com/en-us/azure/event-grid/enable-identity-custom-topics-domains https://learn.microsoft.com/en-us/azure/event-grid/add-identity-roles
- When using Google Eventarc, manage IAM permissions for triggers, event sources, and destinations, following the principle of least privilege, as explained here: https://cloud.google.com/eventarc/standard/docs/access-control
- When using Google Eventarc, encrypt sensitive data at rest using customer-managed encryption keys, as explained here: https://cloud.google.com/eventarc/docs/use-cmek
Scaling and Performance
Serverless platforms automatically scale event routing services in response to workload changes, spinning up additional resources during spikes and scaling down during lulls, while performance optimization involves tuning event patterns, batching, and concurrency settings to minimize latency and maximize throughput.
- When using Amazon EventBridge, monitor event throughput and adjust quotas or request service limit increases as needed. Optimize event patterns and rules for efficiency, as explained here: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-quota.html
- When using Azure Event Grid, monitor for throttling or delivery issues, as explained here: https://learn.microsoft.com/en-us/azure/event-grid/monitor-push-reference
- When using Google Eventarc, monitor quotas and usage (e.g., triggers per location), as explained here: https://cloud.google.com/eventarc/docs/quotas
Workflow orchestration services
Workflow services are designed to coordinate and manage complex sequences of tasks or business processes that involve multiple steps and services. They act as orchestrators, ensuring each step in a process is executed in the correct order, handling transitions, and managing dependencies between steps.
Monitoring and Observability
Set up and review monitoring dashboards, logs, and alerts to ensure workflows are running correctly and to quickly detect anomalies or failures.
- When using AWS Step Functions, monitor executions, check logs, and set up CloudWatch metrics and alarms to ensure workflows run as expected, as explained here: https://docs.aws.amazon.com/step-functions/latest/dg/monitoring-logging.html
- When using Azure Logic Apps, use Azure Monitor and built-in diagnostics to track workflow runs and troubleshoot failures, as explained here: https://learn.microsoft.com/en-us/azure/logic-apps/monitor-logic-apps-overview
- When using Google Workflows, use Cloud Logging and Monitoring to observe workflow executions and set up alerts for failures or anomalies, as explained here: https://cloud.google.com/workflows/docs/monitor
Error Handling and Retry
Investigate failed workflow executions, enhance error handling logic (such as retries and catch blocks), and resubmit failed runs where appropriate. This is crucial for maintaining workflow reliability and minimizing manual intervention.
- When using AWS Step Functions, review failed executions, configure retry/catch logic, and update workflows to handle errors gracefully, as explained here: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html
- When using Azure Logic Apps, handle failed runs, configure error actions, and resubmit failed instances as needed, as explained here: https://learn.microsoft.com/en-us/azure/logic-apps/error-exception-handling
- When using Google Workflows, inspect failed executions, define retry policies, and update error handling logic in workflow definitions, as explained here: https://cloud.google.com/workflows/docs/reference/syntax/catching-errors
Security and Access Management
Workflow orchestration services require continuous enforcement of granular access controls and the principle of least privilege, ensuring that each function and workflow has only the permissions necessary for its specific tasks.
- When using AWS Step Functions, use AWS Identity and Access Management (IAM) for fine-grained control over who can access and manage workflows, as explained here: https://docs.aws.amazon.com/step-functions/latest/dg/auth-and-access-control-sfn.html
- When using Azure Logic Apps, use Azure Role-Based Access Control (RBAC) and managed identities for secure access to resources and connectors, as explained here: https://learn.microsoft.com/en-us/azure/logic-apps/authenticate-with-managed-identity
- When using Google Workflows, use Google Cloud IAM for permissions and access management, which allows you to define who can execute, view, or manage workflows, as explained here: https://cloud.google.com/workflows/docs/use-iam-for-access
Versioning and Updates
Workflow orchestration services use versioning to track and manage different iterations of workflows or services, allowing multiple versions to coexist and enabling users to select, test, or revert to specific versions as needed.
- When using AWS Step Functions, update state machines, manage versions, and test changes before deploying to production, as explained here: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-state-machine-version.html
- When using Azure Logic Apps, manage deployment slots, and use versioning for rollback if needed, as explained here: https://learn.microsoft.com/en-us/azure/logic-apps/manage-logic-apps-with-azure-portal
- When using Google Workflows, update workflows, test changes in staging, and deploy updates with minimal disruption, as explained here: https://cloud.google.com/workflows/docs/best-practice
Cost Optimization
Regularly review usage and billing data, optimize workflow design (e.g., reduce unnecessary steps or external calls), and adjust resource allocation to control operational costs.
- When using AWS Step Functions, analyze usage and optimize workflow design to reduce execution and resource costs, as explained here: https://docs.aws.amazon.com/step-functions/latest/dg/sfn-best-practices.html#cost-opt-exp-workflows
- When using Azure Logic Apps, monitor consumption, review billing, and optimize triggers/actions to control costs, as explained here: https://learn.microsoft.com/en-us/azure/logic-apps/plan-manage-costs
- When using Google Workflows, analyze workflow usage, optimize steps, and monitor billing to reduce costs, as explained here: https://cloud.google.com/workflows/docs/best-practice#optimize-usage
Summary
In this blog post, I presented the most common Day 2 serverless operations when using application integration services (message queues, event routing services, and workflow orchestrations) to build modern applications.
I looked at aspects such as observability, error handling, security, performance, etc.
Building event-driven architectures requires time to grasp which services best support this approach. However, gaining a foundational understanding of key areas is essential for effective day 2 serverless operations.
About the author
Eyal Estrin is a cloud and information security architect, an AWS Community Builder, and the author of the books Cloud Security Handbook and Security for Cloud Native Applications, with more than 25 years in the IT industry.
You can connect with him on social media (https://linktr.ee/eyalestrin).
Opinions are his own and not the views of his employer.
Top comments (0)