Introduction to Day 2 Serverless Operations – Part 1

#aws #azure #gcp #serverless

In April 2023, I published a blog post called "Introduction to Day 2 Kubernetes", discussing the challenges of managing Kubernetes workloads in mature environments, once applications were already running in production.

In the software lifecycle there are usually three distinct stages:

Day 0 – Planning and design
Day 1 – Configuration and deployment
Day 2 – Operations

Serverless services are a cloud-native application development and delivery model where developers can build and run code without having to provision, configure, or manage server infrastructure themselves. Many cloud-native services are considered serverless – from compute (such as Function as a Service), storage (such as object storage), database, etc.

In this series of blog posts, I will review the common day 2 serverless operations.

Part 1 will focus on common operations for Function as a Service (FaaS), and part 2 will focus on application integration services.

Configuration and Revision Management

At this stage, you set the functions runtime version to be deployed, so you will be able to revert to a previous version in case of problems with the deployment or with your application.

When using AWS Lambda, use versions to manage the deployment of your Lambda functions, and use aliases as a pointer to the version you would like to deploy, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/configuration-versions.html
When using Azure Functions, you can manage various aspects of the functions configuration such as hosting plan types, memory quota, scale, environment variables, network settings, etc., as explained here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-how-to-use-azure-function-app-settings
When using Google Cloud Run Functions, you can configure settings such as memory, concurrency, environment variables, network settings, etc., as explained here: https://cloud.google.com/run/docs/deploy-functions

Runtime engine updates

The base assumption at this stage is that the function was already configured and had its initial deployment, but as time goes by, there will be newer versions of the function runtime engine versions.

Although the recommendation is to use the latest stable version of the runtime engine, changing between major versions may require code adjustments and rigorous testing.

When using AWS Lambda, the default setting is set to "Auto", which means AWS will make sure customers are using the latest runtime version whenever customers create or update a function, and later on automatically update all existing functions that haven't been updated to the latest runtime version. For container-based Lambda functions, customers need to manually rebuild the base container, using the latest runtime version, and redeploy the Lambda function, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/runtimes-update.html
When using Azure Functions, and using the FUNCTIONS_EXTENSION_VERSION setting to select a major version, when minor updates are available, the function will automatically update the runtime minor version. Upgrade of major runtime versions of Azure Functions will require manual work, including testing before deploying, as explained here: https://learn.microsoft.com/en-us/azure/azure-functions/set-runtime-version https://learn.microsoft.com/en-us/troubleshoot/azure/azure-functions/config-mgmt/functions-configuring-updateversion
When using Google Cloud Run Functions, minor updates are done automatically, however, upgrading to a new major version of the runtime engine will require redeploying of the functions, as explained here: https://cloud.google.com/functions/docs/runtime-support

Security, Networking, and Access Control

At this stage, you configure network and security settings to protect your functions, before exposing them to clients.

This includes reviewing network access control lists, deployment location (inside or outside your cloud virtual network, according to resources the function needs access to), identity and access management (according to resources in the cloud environment that the function needs access to such as storage, database, etc.)

When using AWS Lambda, in case the function needs access to private AWS resources, deploy the function inside your VPC, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html
To grant a Lambda function access to other AWS resources, configure the Lambda function with an IAM role for its execution role, following the principle of least privilege, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/lambda-intro-execution-role.html
In case the Lambda function needs access to resources using static credentials (such as API keys), configure the Lambda to pull the secrets from AWS Secrets Manager, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/with-secrets-manager.html
When using Azure Functions, in case the function needs access to private Azure resources, use virtual network integration, and enforce access to the function using Network Security Groups, as explained here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-networking-options
To grant an Azure Function access to other Azure resources, configure managed identity, following the principle of least privilege, as explained here: https://learn.microsoft.com/en-us/azure/app-service/overview-managed-identity
In case the Azure Function needs access to resources using static credentials (such as secrets), use Azure Key Vault references, as explained here: https://learn.microsoft.com/en-us/azure/app-service/app-service-key-vault-references
When using Google Cloud Run Functions, in case the function needs access to private GCP resources, use Serverless VPC Access, as explained here: https://cloud.google.com/functions/1stgendocs/networking/connecting-vpc
To grant a Cloud Run Function access to other GCP resources, configure a function identity, and grant the identity minimal permissions, following the principle of least privileged, as explained here: https://cloud.google.com/functions/docs/securing/function-identity

Audit and Compliance

At this stage, you need to make sure your functions automatically send their audit logs to a central system, combined with threat intelligent services that regularly review the audit logs, you can get alerted on security-related topics (such as anomalous behavior).

When using AWS Lambda, configure a trail to send CloudTrail events to a central S3 bucket (in a central AWS account), as explained here: https://docs.aws.amazon.com/lambda/latest/dg/logging-using-cloudtrail.html
To detect security threats in Lambda functions, configure Lambda function protection in Amazon GuardDuty, as explained here: https://docs.aws.amazon.com/guardduty/latest/ug/lambda-protection.html
When using Azure Functions, to be able to collect audit logs into Azure Monitor, configure diagnostic settings (in a central Azure subscription), and select “Audit” and “AuditEvent” as explained here: https://learn.microsoft.com/en-us/azure/azure-monitor/platform/create-diagnostic-settings
In case the Azure Function is deployed inside an App Service plan, use Defender for App Service (part of Microsoft Defender for Cloud), to identify security threats, as explained here: https://learn.microsoft.com/en-us/azure/defender-for-cloud/defender-for-app-service-introduction
When using Google Cloud Run Functions, configure a log bucket and send all functions audit logs to a central Google Cloud Storage (in a central GCP project), as explained here: https://cloud.google.com/logging/docs/audit
To detect security threats in Google Cloud Run Functions, use the Google SecOps, as explained here: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-audit-logs

Monitoring, Logging, Observability and Alerting

Continuously track application health, performance, and security events using tools for real-time insights. This includes setting up dashboards and alerts to detect anomalies and issues before they impact users.

When using AWS Lambda, send all function logs to CloudWatch logs, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/monitoring-cloudwatchlogs.html
To gain visibility into Lambda performance, use CloudWatch Lambda Insights, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/monitoring-insights.html
When using Azure Functions, to be able to collect all logs and metrics into Azure Monitor, configure diagnostic settings (in a central Azure subscription), as explained here: https://learn.microsoft.com/en-us/azure/azure-monitor/platform/create-diagnostic-settings https://learn.microsoft.com/en-us/azure/azure-functions/functions-monitoring
To gain visibility into Azure Functions, use Application Insights (part of Azure Monitor), as explained here: https://learn.microsoft.com/en-us/azure/azure-functions/configure-monitoring
When using Google Cloud Run Functions, configure a log bucket and send all functions logs through Google Cloud Logging to a central Google Cloud Storage (in a central GCP project), as explained here: https://cloud.google.com/logging/docs/central-log-storage
To gain insights into Google Cloud Run Functions, use Google Cloud Observability, as explained here: https://cloud.google.com/monitoring/docs/monitoring-overview

Error Reporting, Troubleshooting, Diagnostics and Debugging

Any running function will generate errors at some point, or you might need to troubleshoot or debug issues with running (or failed) functions. For this purpose, you need to collect errors and diagnostic logs from your functions and store them in a central service.

Implement error-handling strategies within your code (e.g., retries with exponential backoff) to minimize user impact during failures.

When using AWS Lambda, use CloudWatch metrics to build graphs and dashboards, and to send alerts in response to changes in Lambda function activities (such as performance, error rates, etc.), as explained here: https://docs.aws.amazon.com/lambda/latest/dg/monitoring-metrics.html
To troubleshoot issues with Lambda functions, refer to the documentation here: https://docs.aws.amazon.com/lambda/latest/dg/lambda-troubleshooting.html
To display errors related to Azure Functions, refer to the documentation here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-error-pages
To troubleshoot issues with Azure Functions, refer to the documentation here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-diagnostics https://learn.microsoft.com/en-us/troubleshoot/azure/azure-functions/deployment/functions-deploying-runtime-issues-post-deployment
To display errors related to Google Cloud Run Functions, use Error Reporting, as explained here: https://cloud.google.com/run/docs/error-reporting
To troubleshoot issues with Google Cloud Run Functions, refer to the documentation here: https://cloud.google.com/run/docs/troubleshooting

Scaling, Resource Management, Performance Tuning and Optimization

Analyze function performance metrics (duration/memory usage) to identify bottlenecks and adjust concurrency settings or provisioned capacity as needed for optimal resource utilization.

When using AWS Lambda, use AWS Lambda Power Tunning to get the optimal Lambda size to suit your workload, as explained here: https://github.com/alexcasalboni/aws-lambda-power-tuning
Be aware of Lambda quotas—runtime resource limits are often affected by factors like payload size, file descriptors, and /tmp storage space, which are frequently overlooked. For more information see: https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html
In case you need to maintain consistent performance for a Lambda function, consider configuring reserved concurrency, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
To reduce cold starts of Lambda functions, consider configuring provisioned concurrency, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtime-environment.html#cold-starts-pc
When using Azure Functions, consider changing the hosting plan to gain better performance or isolation, as explained here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-scale
To improve your Azure Functions performance, follow the guide below: https://learn.microsoft.com/en-us/azure/azure-functions/performance-reliability
To reduce cold starts of Azure Functions, consider using the Premium plan, as explained here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-premium-plan#eliminate-cold-starts
When using Google Cloud Run Functions, use Recommender to gain recommendations for configuring Google Cloud Run Functions, as explained here: https://cloud.google.com/run/docs/recommender
To reduce cold starts of Google Cloud Run Functions, consider setting a minimum number of instances, as explained here: https://cloud.google.com/run/docs/configuring/min-instances
Be aware of Google Cloud Run Functions limits - runtime resource limits are often affected by factors like maximum deployment size, memory size, number of running functions, etc. For more information see: https://cloud.google.com/functions/quotas

Summary

In this blog post, I presented the most common Day 2 serverless operations when using Functions as a Service to build modern applications.

Transitioning from traditional to serverless development can be challenging, but I encourage readers to keep practicing and gaining hands-on experience. Moving beyond the initial deployment to focus on ongoing operations and maintenance is crucial, and I hope the topics covered here will prove valuable for managing serverless environments in daily work.

In the second part of this series, we will deep dive into serverless application integration services, so stay tuned.

Additional reference materials

About the author

Eyal Estrin is a cloud and information security architect, an AWS Community Builder, and the author of the books Cloud Security Handbook and Security for Cloud Native Applications, with more than 25 years in the IT industry.

You can connect with him on social media (https://linktr.ee/eyalestrin).

Opinions are his own and not the views of his employer.

Top comments (1)

Areeba Nishat • Jun 3

Great intro to a topic that's often overlooked! Day 2 operations are where the real complexity begins—monitoring, debugging, cost optimization, and managing deployments at scale in a serverless environment. I appreciated the focus on observability tools and the importance of structured logging and tracing.

Looking forward to Part 2—hoping it dives deeper into incident response strategies and best practices for handling cold starts and throttling in production. Anyone else here already implementing automated health checks or anomaly detection in their serverless stack?