In April 2023, I published a blog post called "Introduction to Day 2 Kubernetes", discussing the challenges of managing Kubernetes workloads in mature environments, once applications were already running in production.
In the software lifecycle there are usually three distinct stages:
- Day 0 – Planning and design
- Day 1 – Configuration and deployment
- Day 2 – Operations
Serverless services are a cloud-native application development and delivery model where developers can build and run code without having to provision, configure, or manage server infrastructure themselves. Many cloud-native services are considered serverless – from compute (such as Function as a Service), storage (such as object storage), database, etc.
In this series of blog posts, I will review the common day 2 serverless operations.
Part 1 will focus on common operations for Function as a Service (FaaS), and part 2 will focus on application integration services.
Configuration and Revision Management
At this stage, you set the functions runtime version to be deployed, so you will be able to revert to a previous version in case of problems with the deployment or with your application.
- When using AWS Lambda, use versions to manage the deployment of your Lambda functions, and use aliases as a pointer to the version you would like to deploy, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/configuration-versions.html
- When using Azure Functions, you can manage various aspects of the functions configuration such as hosting plan types, memory quota, scale, environment variables, network settings, etc., as explained here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-how-to-use-azure-function-app-settings
- When using Google Cloud Run Functions, you can configure settings such as memory, concurrency, environment variables, network settings, etc., as explained here: https://cloud.google.com/run/docs/deploy-functions
Runtime engine updates
The base assumption at this stage is that the function was already configured and had its initial deployment, but as time goes by, there will be newer versions of the function runtime engine versions.
Although the recommendation is to use the latest stable version of the runtime engine, changing between major versions may require code adjustments and rigorous testing.
- When using AWS Lambda, the default setting is set to "Auto", which means AWS will make sure customers are using the latest runtime version whenever customers create or update a function, and later on automatically update all existing functions that haven't been updated to the latest runtime version. For container-based Lambda functions, customers need to manually rebuild the base container, using the latest runtime version, and redeploy the Lambda function, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/runtimes-update.html
- When using Azure Functions, and using the FUNCTIONS_EXTENSION_VERSION setting to select a major version, when minor updates are available, the function will automatically update the runtime minor version. Upgrade of major runtime versions of Azure Functions will require manual work, including testing before deploying, as explained here: https://learn.microsoft.com/en-us/azure/azure-functions/set-runtime-version https://learn.microsoft.com/en-us/troubleshoot/azure/azure-functions/config-mgmt/functions-configuring-updateversion
- When using Google Cloud Run Functions, minor updates are done automatically, however, upgrading to a new major version of the runtime engine will require redeploying of the functions, as explained here: https://cloud.google.com/functions/docs/runtime-support
Security, Networking, and Access Control
At this stage, you configure network and security settings to protect your functions, before exposing them to clients.
This includes reviewing network access control lists, deployment location (inside or outside your cloud virtual network, according to resources the function needs access to), identity and access management (according to resources in the cloud environment that the function needs access to such as storage, database, etc.)
- When using AWS Lambda, in case the function needs access to private AWS resources, deploy the function inside your VPC, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html
- To grant a Lambda function access to other AWS resources, configure the Lambda function with an IAM role for its execution role, following the principle of least privilege, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/lambda-intro-execution-role.html
- In case the Lambda function needs access to resources using static credentials (such as API keys), configure the Lambda to pull the secrets from AWS Secrets Manager, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/with-secrets-manager.html
- When using Azure Functions, in case the function needs access to private Azure resources, use virtual network integration, and enforce access to the function using Network Security Groups, as explained here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-networking-options
- To grant an Azure Function access to other Azure resources, configure managed identity, following the principle of least privilege, as explained here: https://learn.microsoft.com/en-us/azure/app-service/overview-managed-identity
- In case the Azure Function needs access to resources using static credentials (such as secrets), use Azure Key Vault references, as explained here: https://learn.microsoft.com/en-us/azure/app-service/app-service-key-vault-references
- When using Google Cloud Run Functions, in case the function needs access to private GCP resources, use Serverless VPC Access, as explained here: https://cloud.google.com/functions/1stgendocs/networking/connecting-vpc
- To grant a Cloud Run Function access to other GCP resources, configure a function identity, and grant the identity minimal permissions, following the principle of least privileged, as explained here: https://cloud.google.com/functions/docs/securing/function-identity
Audit and Compliance
At this stage, you need to make sure your functions automatically send their audit logs to a central system, combined with threat intelligent services that regularly review the audit logs, you can get alerted on security-related topics (such as anomalous behavior).
- When using AWS Lambda, configure a trail to send CloudTrail events to a central S3 bucket (in a central AWS account), as explained here: https://docs.aws.amazon.com/lambda/latest/dg/logging-using-cloudtrail.html
- To detect security threats in Lambda functions, configure Lambda function protection in Amazon GuardDuty, as explained here: https://docs.aws.amazon.com/guardduty/latest/ug/lambda-protection.html
- When using Azure Functions, to be able to collect audit logs into Azure Monitor, configure diagnostic settings (in a central Azure subscription), and select “Audit” and “AuditEvent” as explained here: https://learn.microsoft.com/en-us/azure/azure-monitor/platform/create-diagnostic-settings
- In case the Azure Function is deployed inside an App Service plan, use Defender for App Service (part of Microsoft Defender for Cloud), to identify security threats, as explained here: https://learn.microsoft.com/en-us/azure/defender-for-cloud/defender-for-app-service-introduction
- When using Google Cloud Run Functions, configure a log bucket and send all functions audit logs to a central Google Cloud Storage (in a central GCP project), as explained here: https://cloud.google.com/logging/docs/audit
- To detect security threats in Google Cloud Run Functions, use the Google SecOps, as explained here: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-audit-logs
Monitoring, Logging, Observability and Alerting
Continuously track application health, performance, and security events using tools for real-time insights. This includes setting up dashboards and alerts to detect anomalies and issues before they impact users.
- When using AWS Lambda, send all function logs to CloudWatch logs, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/monitoring-cloudwatchlogs.html
- To gain visibility into Lambda performance, use CloudWatch Lambda Insights, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/monitoring-insights.html
- When using Azure Functions, to be able to collect all logs and metrics into Azure Monitor, configure diagnostic settings (in a central Azure subscription), as explained here: https://learn.microsoft.com/en-us/azure/azure-monitor/platform/create-diagnostic-settings https://learn.microsoft.com/en-us/azure/azure-functions/functions-monitoring
- To gain visibility into Azure Functions, use Application Insights (part of Azure Monitor), as explained here: https://learn.microsoft.com/en-us/azure/azure-functions/configure-monitoring
- When using Google Cloud Run Functions, configure a log bucket and send all functions logs through Google Cloud Logging to a central Google Cloud Storage (in a central GCP project), as explained here: https://cloud.google.com/logging/docs/central-log-storage
- To gain insights into Google Cloud Run Functions, use Google Cloud Observability, as explained here: https://cloud.google.com/monitoring/docs/monitoring-overview
Error Reporting, Troubleshooting, Diagnostics and Debugging
Any running function will generate errors at some point, or you might need to troubleshoot or debug issues with running (or failed) functions. For this purpose, you need to collect errors and diagnostic logs from your functions and store them in a central service.
Implement error-handling strategies within your code (e.g., retries with exponential backoff) to minimize user impact during failures.
- When using AWS Lambda, use CloudWatch metrics to build graphs and dashboards, and to send alerts in response to changes in Lambda function activities (such as performance, error rates, etc.), as explained here: https://docs.aws.amazon.com/lambda/latest/dg/monitoring-metrics.html
- To troubleshoot issues with Lambda functions, refer to the documentation here: https://docs.aws.amazon.com/lambda/latest/dg/lambda-troubleshooting.html
- To display errors related to Azure Functions, refer to the documentation here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-error-pages
- To troubleshoot issues with Azure Functions, refer to the documentation here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-diagnostics https://learn.microsoft.com/en-us/troubleshoot/azure/azure-functions/deployment/functions-deploying-runtime-issues-post-deployment
- To display errors related to Google Cloud Run Functions, use Error Reporting, as explained here: https://cloud.google.com/run/docs/error-reporting
- To troubleshoot issues with Google Cloud Run Functions, refer to the documentation here: https://cloud.google.com/run/docs/troubleshooting
Scaling, Resource Management, Performance Tuning and Optimization
Analyze function performance metrics (duration/memory usage) to identify bottlenecks and adjust concurrency settings or provisioned capacity as needed for optimal resource utilization.
- When using AWS Lambda, use AWS Lambda Power Tunning to get the optimal Lambda size to suit your workload, as explained here: https://github.com/alexcasalboni/aws-lambda-power-tuning
- Be aware of Lambda quotas—runtime resource limits are often affected by factors like payload size, file descriptors, and /tmp storage space, which are frequently overlooked. For more information see: https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html
- In case you need to maintain consistent performance for a Lambda function, consider configuring reserved concurrency, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
- To reduce cold starts of Lambda functions, consider configuring provisioned concurrency, as explained here: https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtime-environment.html#cold-starts-pc
- When using Azure Functions, consider changing the hosting plan to gain better performance or isolation, as explained here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-scale
- To improve your Azure Functions performance, follow the guide below: https://learn.microsoft.com/en-us/azure/azure-functions/performance-reliability
- To reduce cold starts of Azure Functions, consider using the Premium plan, as explained here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-premium-plan#eliminate-cold-starts
- When using Google Cloud Run Functions, use Recommender to gain recommendations for configuring Google Cloud Run Functions, as explained here: https://cloud.google.com/run/docs/recommender
- To reduce cold starts of Google Cloud Run Functions, consider setting a minimum number of instances, as explained here: https://cloud.google.com/run/docs/configuring/min-instances
- Be aware of Google Cloud Run Functions limits - runtime resource limits are often affected by factors like maximum deployment size, memory size, number of running functions, etc. For more information see: https://cloud.google.com/functions/quotas
Summary
In this blog post, I presented the most common Day 2 serverless operations when using Functions as a Service to build modern applications.
Transitioning from traditional to serverless development can be challenging, but I encourage readers to keep practicing and gaining hands-on experience. Moving beyond the initial deployment to focus on ongoing operations and maintenance is crucial, and I hope the topics covered here will prove valuable for managing serverless environments in daily work.
In the second part of this series, we will deep dive into serverless application integration services, so stay tuned.
Additional reference materials
About the author
Eyal Estrin is a cloud and information security architect, an AWS Community Builder, and the author of the books Cloud Security Handbook and Security for Cloud Native Applications, with more than 25 years in the IT industry.
You can connect with him on social media (https://linktr.ee/eyalestrin).
Opinions are his own and not the views of his employer.
Top comments (1)
Great intro to a topic that's often overlooked! Day 2 operations are where the real complexity begins—monitoring, debugging, cost optimization, and managing deployments at scale in a serverless environment. I appreciated the focus on observability tools and the importance of structured logging and tracing.
Looking forward to Part 2—hoping it dives deeper into incident response strategies and best practices for handling cold starts and throttling in production. Anyone else here already implementing automated health checks or anomaly detection in their serverless stack?