Lob’s core API has been fully migrated to HashiCorp's Nomad, Lob’s Next Generation service platform. This is a major milestone for the Nomad Project, the Platform Team, and Lob Engineering. This migration is the culmination of a year of R&D, months of practice migrating other Lob services, and weeks of work on this particular service. It’s absolutely worth celebrating for the complexity and customer impact.
TLDR: Lob’s core API has migrated from Convox to Nomad. Our API is running better than it was before and with Nomad we have the tools to tackle platform issues that were previously not possible.
The way we run code for our API and dozens of supporting tasks has organically grown based on the needs and constraints of the business. Most apps at Lob ran on Convox, a straightforward service management tool we license, but a handful of workloads have needs that are not met by Convox! Historically, these run on AWS Lambda for security or scaling reasons, or Heroku for development ease, and some run on ECS for various customization needs.
The point is that this artisanal, organic, Non-GMO architecture was and continues to be a huge burden on Lob Engineering. The Platform team needs to know half a dozen tools very well and the rest of engineering must settle for a narrow feature set or spend weeks ramping up on a different technology. This drags on all new products, features, and bug fixes resulting in lower engineering velocity and a lot of hair-pulling.
It was clear back in 2021 that Lob needed to consolidate how we run code and none of our current tools were up to the task; it wasn’t a matter of if Lob would upgrade to something new, but when. The Platform team kicked off a research project to find Lob’s next service platform. Forever ago (back in 2019) we investigated migrating to Kubernetes, a popular but notoriously difficult-to-manage tool for this sort of thing, but that project fizzled out for many reasons, forcing us to consider something else. We chose Nomad which offers a comparable feature set to Kubernetes in a much more streamlined package. Nomad is developed by HashiCorp, a leader in the DevOps space, and is used by companies like Pandora, Cloudflare, Internet Archive, and Roblox.
Nomad is able to meet all of Lob’s business needs today and into the future, allowing us to consolidate where we run most of our code onto one platform. We are able to specialize in this tool, create custom workflows to meet Lob’s changing needs, and tune the underlying architecture to outperform our old solution(s). Where before we had to stretch ourselves thin, or live without some features, Nomad is a one-stop shop.
Our core API running on Nomad has performance parity with our old container orchestrator, but it’s even better in some ways. Here’s how:
- Lob releases happen via GitHub Actions, giving us better visibility into when the last release was, what got deployed, if it failed, and who kicked it off.
- Ad-hoc tasks like Database Migrations and Scripts are also managed in GitHub Actions, moving even more processes off of your laptop and onto the cloud.
- Nomad offers a much richer UI, enabling Lobsters to gain better visibility into how their app is working or debug their app when it isn’t.
- Nomad’s autoscaling is much more customizable, allowing us to scale our API more responsively to meet spikes in customer requests.
- Tag-driven releases allow us to audit when code was deployed and promoted to customers.
The real reason to migrate to Nomad is for the laundry list of shiny new features that unlock major potential down the road…
- Hosting containers on AWS Spot Instances should save us money on our AWS bill.
- Middleware for handling authentication and other business needs at the edge.
- Canary deployments, finding regressions before rolling them out to all customers.
- Federated services, hosting our apps in AWS regions closer to our customers.
- Autoscaling based on custom metrics like workload prediction.
Most of our API’s migration to Nomad happened in August, but the key feature of autoscaling was not working as expected. This turned out to be a bug in Nomad which James Douglas tracked down. The issue was recently fixed and autoscaling works as expected, completing the migration!
This migration, as with all of Lob’s Nomad migrations, is intended to be zero downtime. This means we would slowly migrate traffic to Nomad and roll back if we noticed any regressions in order to minimize customer impact; in most cases, customers cannot tell that this type of change took place. Autoscaling was an optional feature—we could just run Lob at max—but that would be a waste of resources and wouldn't really bring us to parity with the old Convox deployment.
50+ services have now been migrated successfully!
This is the type of project that cannot happen in a vacuum. The Platform team is incredibly thankful for all of the Lobsters past and present who have helped us out. You have lent us your time, expertise, patience, and trust in a project that really needed it. While it’s easy to move on to the next big project, moving our core API to Nomad is an accomplishment worth taking a moment to celebrate.
Special thanks to Senior Platform Engineer Elijah Voigt.