This post was originally published to blinkops.com by Ops Community members @haviv and @johnson_brad.
Serverless, microservices, and containerized architectures unlock flexibility for cloud engineering teams. By adopting a serverless approach, cloud engineers have the freedom to deploy different tools each in order to solve specific use cases in the most effective way.
But using a myriad of different cloud tools creates a logistical nightmare for the DevOps and SRE teams managing all those different applications. In practice, this looks like DevOps and SREs logging in and out of different cloud platforms, searching through platform documentation, and troubleshooting API connections just to create an integration between two different services. Then, rinse and repeat for every additional integration project that comes down the pipeline.
This post will seek to highlight some of the challenges and pitfalls of building integrations between cloud tools.
Where did all these API integration projects come from?
In the most basic terms, modern DevOps and SRE teams need to write integrations to perform their everyday work duties. Whether it’s adding a new cloud tool to the development toolbox, improving observability into cloud architecture, or enabling a new application feature, cloud engineers are constantly wiring different cloud tools and APIs together.
DevOps workflows also involve multiple cloud tools. In the course of a single day, a DevOps practitioner or SRE might expect to log in to:
- Monitoring tools (DataDog, New Relic, etc…)
- Cloud and infrastructure tools (AWS, GCP, K8s, etc..)
- Communication tools (Slack, Discord, etc…)
- Project management tools (JIRA, Asana, etc..)
- On-call tools (PagerDuty, etc..)
As DevOps and SREs become increasingly embedded on product and project development teams, this means individual cloud engineers are often left becoming the sole experts for numerous platforms. Their daily experience involves context-switching between cloud platforms. Factor in the intricacies of different platforms’s APIs and terminology, and integrations projects become a recipe for complexity and inefficiency.
Lots of daily DevOps tasks require custom integrations work
There are many daily DevOps and SecOps tasks that require custom integrations work. Whether you’re trying to get log data from a cloud platform into an observability tool or automate an on-call response workflow, eventually you’ll need to connect tools together. Many platforms offer native integrations, but for more complex use cases, cloud-native DevOps and developers will most likely find themselves building a custom integration using scripts and APIs.
Let’s explore some examples where cloud engineer might need to build a custom integration:
Suppose you want to expose a webhook that receives an alert from your monitoring tool and automatically kickstarts an on-call process. Once an alert is received, you want to perform the following steps in order:
- Create a new Slack channel
- Open an on-call bridge (Zoom)
- Invite relevant team members to the Slack channel
- Enrich the alert with relevant data from different systems (run kubectl commands, fetch logs, fetch status page data, fetch latest GitHub PRs, etc)
It’s unlikely there’s a native integration capable of supporting such an involved workflow. Instead, to enable this workflow, a custom integration is going to be required.
Scheduling Day-2 Operations tasks
There are some processes that DevOps and SREs must run occasionally, for example daily or weekly, that require custom integrations in order to perform the required task.
For example, consider these workflows:
Checking that all employees have appropriate Auth and SSO tools installed
Suppose that at the end of every day, you needed to validate that all employees registered in Okta have an instance of JumpCloud installed, then send a report by email with a list of any offenders without the appropriate tools installed. If you want to coordinate across all these tools, you’ll need a custom integration.
Finding Kubernetes orphaned resources
It’s hard enough finding orphaned resources, but how do you notify the appropriate teams who can remediate any issues. You’ll need to build a custom integration to find orphaned Kubernetes resources and send a notification to the relevant team for approval before deletion.
Event-based tasks (IFTTT scenarios)
Sometimes an event is the trigger for a DevOps process. This may require the creation of a custom integration in order to:
- Define a condition, like when a certain cloud resource is created (new AWS EC2 instance is created, new S3 bucket is created, etc)
- Perform checks against certain resources (for example, the existence of tags, TTL, etc..)
Self-service or on-demand tasks
In order to enable coworkers to run a workflow on their own, you may need to create a custom integration and expose it as a self-service workflow. Self-service workflows can enable other developers to run purpose-built tasks such as:
- Scaling up or down Kubernetes HPA
- Running whitelisted kubectl commands agains your production services
Building a new integration is deceptively complex
So what about building application integrations is so difficult? Aren’t APIs designed for the explicit purpose of connecting to another application? The reality is that integrations are rarely so straightforward, and require a fair amount of upfront consideration to implement successfully.
Here are some of the common pitfalls and gotchas that developers sometimes forgot to think about when building an integration:
- Connections/Authentication: How will you establish credentials and verify identity between different tools? There are so many ways to do authentications: OAuth, Basic Auth, API keys, Bearer tokens. And this isn’t even mentioning a process for refreshing tokens and storing credentials and access information.
- Retries: When there are failures, how you are going to detect them? Once an error is detected, for which errors will you retry sending the request? For example, if you are getting 400 or similar error, there is no point in retrying since there are likely Authentication-related issues, but if you get a 429 error (too many requests) or 503 error (service unavailable), it makes sense to retry.
- Rate limits: How do u plan to recover from rate limits errors? Will your integration simply fail?
- Testing: How do you keep your integration functioning at all times? How are you going to test your integration’s performance and security over time?
- Error handling: When an error occurs with your integration, what steps will you take to troubleshoot it? How will you know what is wrong?
- Performance: How will you monitor integration performance? How will you no if the integration is behaving as expected or incorrectly?
- Platform: What platform will run the integration code (Lambda, microservices, host it on prem)?
- CI/CD: How will you deploy your integration code? Does your CI/CD platform support using this cloud tool?
- Observability: How will you make sure the above solutions are always up and running? How do you get visibility into the overall status of executions?
And this is hardly a comprehensive list, the are countless other variable to consider depending on your particular integration use case.
Advantages of using a no-code automation platform
No-code automation platforms like Blink benefit from pre-existing integrations with cloud and SaaS tools, offloading the burden of repetitive integration efforts from your DevOps teams, so you can focus on creating the workflow you desire. By adopting a no-code automation platform for your cloud operations, you can eliminate the development effort required for creating custom cloud integrations, thereby:
- Reducing time to ship
- Improving the quality of your workflows (more reliable, resilient, auditable)
- Decreasing maintenance effort
- Solving all the integration complexities listed above
Want to start building DevOps and SecOps workflows faster today? Sign up for Blink early access to take advantage of hundreds of automations purpose-built for solving common cloud operations challenges.
Top comments (1)
Since this is my first non-Ops Community related post, I'm really curious if you all found this article to be helpful.
If you stopped by and gave this post a read, please let me know what you think in the comments! 🙏