In chasing our goal to be the world's largest sender of direct mail, our systems need to be robust and scalable; one place where we've made major upgrades to support our next phase of growth is in our webhooks.
Our webhooks product is a labor of love, built from the ground up with our own blood, sweat, and tears. But when just maintaining it was bringing our engineers to tears, it was time to think about an overhaul.
Webhooks are API push notifications, or HTTP requests, that get triggered by specific events and send data to a specified URL. At Lob, we use webhooks to send real-time notifications to our customers based on events within Lob’s system. A common example for our customers is to use webhooks to get notified when a mailpiece is delivered for their tracking and analytics. (Here is a complete list of events triggered by Lob’s webhooks, and if you want to try your hand at integrating, here is a tutorial using Node.js.)
Lob currently sends about 50 million events to our customers each month—needless to say, we need this functionality to be reliable and scalable.
Webhooks V1 was built completely internally, and while it’s been performant, it was not an ideal solution long-term for a number of reasons.
- An overly-complex design made it difficult for developers to understand and work with.
- Not scalable: V1 utilized an AWS service called simple workflow (SWF), which has a hard limit on concurrent executions.
- Expensive: The AWS service cost was increasing and impacting our engineering budget.
- V1 lacked key features that our customers have requested such as rate limiting, OAuth, easy replay of events (and because of the complex design, these features would be very difficult to add).
Overall, Webhooks V1 was showing its age and giving a poor experience for both Lob engineers (hard to work with, difficult to debug) and the customer (less features, slower incident resolution).
The team was faced with the classic dilemma of build vs buy. We’d already tried the build option and wanted to explore the buy option. Our first attempt resulted in accidental complexity. We thought an out-of-the box solution could potentially prevent this. Enter 3rd-party web service: Svix.
Svix is a YC-backed webhooks-as-a-service platform. As Svix notes, “webhooks are harder than they seem” (true statement), so their goal is to reduce engineering time, resources, and ongoing maintenance.
For our webhooks infrastructure, in both versions, events flow from a few sources, like the USPS, and all are routed through Lob’s webhooks REST API. In V1, what followed was a lengthy and complex process—involving multiple queues, workers, and different services—before getting to the customer. In V2, we replaced this complexity with Svix: we now have just one queue, one autoscaling worker, and Svix.
Version 2 includes changes to the Lob dashboard. Requests go from the dashboard to Lob’s API, then to a new webhooks service we built to send the requests onto Svix.
As Svix is a growing startup, there was an initial concern about their ability to handle our scale. As a part of due diligence, we ran a series of load and functional tests with them before deciding to move forward.
“We had a robust service that automatically scales with usage, but we weren’t ready for how quickly Lob’s traffic can scale from normal levels to very high requests-per-second. We had to redesign parts of our system and make our autoscaling much more aggressive to be able to handle these sudden spikes,” said Tom Hacohen, Svix founder and CEO.
This is just one example of how, as we proceeded with implementation (and still now, after launch), it’s been a collaboration. Svix has been able to evolve the product to meet our specific needs. They added servers in our AWS region, rather than have us use their EU-hosted ones. This vastly improved our message processing throughput. They also implemented persistent HTTP connections, which also improved our throughput by preventing unnecessary duplicate authentication requests and TLD handshakes.
- In simplifying our webhooks infrastructure, we achieve greater scalability by eliminating dependencies on some of the individual components (like SWF concurrency limits).
- We also made our engineers’ lives much easier. No more endless manual interventions, and late nights trying to debug. Instead of having to deal with seven different repos to make a webhooks change, it's now very simple.
- Our costs became more predictable. With V1 our costs scaled as a function of event count. With V2 we have a flatter cost structure and decreasing cost per event as we scale.
- The new dashboard is feature-rich: customers can now easily self-serve and replay web attempts themselves; they also can specify rate limits when creating or updating an endpoint.
Webhooks V2 is now a scalable product that is easier to work with, offers more robust functionality, and is budget-friendly.
We were excited to get this project into use by our customers. We rolled out to one of our large enterprise customers, and within the first week and a half, they sent over 10 million webhooks. We've added a few more features to our service, and have asked Svix to do the same; V2 will be GA for all customers in the next few weeks.
Adding a third party to the mix was definitely the right decision for us in this case. In addition to happy customers, our engineers have their evenings back (just in time for the new season of Stranger Things).
This article was adapted from a presentation by engineers Martin Han and Zac Leids.