Lob found itself in a similar situation with hapi, an open-source NodeJS framework used to build powerful and scalable web applications. We were running version v16 when v17 was announced. The release notes refer to v17 as a new framework because it makes fundamental changes to how business logic is interfaced with the framework. The main change and the motivation for this release was replacing callbacks with a fully async/await interface. Though few would argue the advantages of this shift, the result was dozens upon dozens of breaking changes. At Lob, it meant hundreds, and our list of dependencies was long. The upgrade sat on the backburner and as 17 turned to 18, then 20, we realized it was time to git-er-done.
Let’s look at ways to minimize the “suck” when tackling a long overdue upgrade.
Delaying a framework upgrade can mean falling several versions behind. You might be tempted to jump to the latest version, but consider how that might play out? Most of the community did the migration between the version you are on and the next version. Any upgrade material will likely focus on moving from version C to D and not version C to G. Every developer’s best friend Stackoverflow probably contains questions (and answers) about issues arising from a C to D migration. Tread carefully here.
At Lob, we set out to upgrade hapi from v16 to v17 and found the task was enormous. It included 13 repos, several third-party libraries, and over 100 plugins. A team of four engineers worked on the project with other departments contributing. For an idea of scale, a typical upgrade, like the subsequent hapi v17 to v18 required just one engineer. Be sure to resource your team appropriately.
Almost every request handler in our environment was going to break. Though mostly syntax changes, once those were made, all tests had to be updated accordingly; we had several hundred.
All plugins from hapi’s ecosystem also required an upgrade to work with v17. We had a number of custom plugins we’d written that needed our attention, along with third-party plugins we either had to upgrade or replace.
Our update process was as follows:
- Make a decision on the third party plugins
- Update our internal plugins
- Update all the route handlers and tests
We did this for every single endpoint (e.g., postcards, then letters, and so on) one by one.
- One for updating the code
- One for the admittedly more difficult task of updating the build tooling
- One to enable GitHub actions to test PRs.
In retrospect, if he had to do it all over again, Software Engineering Manager Sowmitra Nalla said he would have written a script to find-and-replace—with this approach we could have upgraded a repo in about two days. However, the overall thought at the time was that with a number of engineers on the upgrade, we could churn through it versus building a tool. Also, the goal was to improve Lob’s API performance, not upgrade the entire engineering organization’s stack.
Rather than pause all deployments to our API for several weeks while we upgraded, we decided to spin up a v17 side-by-side with hapi v16—an approach we dubbed “double-rainbow”—represented in Slack by our team of exhausted engineers with the following emoji:
“We did a type of canary deployment but with 'feature flags' at the route level. Normal feature flags are at the app level; our toggles were at the load balancer level. Depending on which REST paths we wanted to route, we would drive traffic appropriately,” said Nalla.
We started with 5% of traffic going to this new state, and used a dashboard to compare errors, CPU, and other metrics. As soon as we saw an error, we would divert traffic back to the current state, then investigate the problem. Diverting a small percentage of traffic (in an effort to mitigate risk), we saw a very small number of errors. A small number of errors was not a red flag as we assumed there would be some errors here and there. We learned that was not quite right. Instead of just looking at the number of errors, we needed to look at the percentage of errors. If the percentage of errors increases in one cluster versus the other, then there's something else going on—we did not forget that when we upgraded to hapi 18 and 20.
We had a major incident early on resulting in all traffic being diverted back to v16. As it turned out, one of the internal libraries being upgraded had two versions. We’d made changes on an earlier version that were not merged back in. Looking at the main branch, which was running the “latest” version of that library led to the incident.
Even in the best executed project, unforeseen errors can happen. Fortunately the rollout strategy allowed for limited interruption while we debugged, then we resumed flow to v17. We did end up combing through all the other plugins to ensure this was a one-off mistake—an arduous, but necessary task.
We saw an incredible 100% improvement in API throughput (requests per second). At first, we saw some scary dips in our graph, but realized they were a side effect from testing the number of connections each container has to the database. Results of these tests led to understanding that better connection handling on the database side would increase throughput as well.
While admittedly pretty painful, the upgrade was absolutely worth it. The positive impact to performance on Lob’s API is the most obvious benefit, but on the whole it made our teams more efficient moving forward.
Hapi Version 18 included minor improvements for performance and standards compliance. This was followed by Version 20, another small release. Less significant changes certainly meant quicker subsequent upgrades for us, but we also applied the processes we put in place along with lessons learned from the initial upgrade.
The project was a powerful reminder to take the time upfront for better estimation. (Check out Why Developers Suck at Software Estimation and How to Fix It.) Are there patterns or duplicative work; if yes, would automation/a tool help? We followed a uniform process for updating each plugin; this consistency made the process as efficient as possible under the circumstances. Our “double-rainbow” deployment allowed for a smoother cutover and the opportunity to debug without impact (and we learned to prioritize percentage of errors over number of errors).
We will definitely employ these methods to make similar upgrades less sucky—and hope you can too.