This was a really fun summary of your process and outcomes, @ujjavala - thanks for sharing! I do have a bunch of questions:
You say that the current solution is better. It has made maintenance easier and is helpful in fixing bugs quickly. But.. If you have spent far more than what you have gained or would be gaining in the coming years, then was it even worth it? How do you justify your decision? And more importantly, what are the metrics to measure the success of your decision.
Did you need to project these gains ahead of time, and were those the same metrics that you ended up using after implementation? Were there any surprises at all, or did everything work as expected and optimize as expected?
To answer your question, the entire redesign and refactoring was done to have better grip and grasp over what we had developed. As mentioned in the blog, with the earlier design when we had bugs, it was becoming difficult to fix them and sometimes when we had production bugs, we spent nights to fix it. It was becoming difficult to the level that we just couldn't go on with the code. And on top of that, with the huge infrastructure, we were incurring a lot of loss since most of it was not even being used. The entire redesign took around 2-3 months (most of the time was to get approvals) and IMO it helped us a lot.
To measure it, we had used RED metrics and of course the cost gains that we had.
Surprises I reckon we didn't have many. Yes, we had bugs in the beginning but we could fix them easily as we had lesser codebase and better logs now.
Thanks for responding and explaining in more detail. So, although projecting gains was important, what was more important was that the existing codebase was unmanageable full-stop? I think I understand!
Exactly! Since ours was a platform team, any downtime affected around 10-15 other teams (micro-services). So basically we have to stay alive so that others are alive, and this is only possible when our MTTR is minimal.
This was a really fun summary of your process and outcomes, @ujjavala - thanks for sharing! I do have a bunch of questions:
Did you need to project these gains ahead of time, and were those the same metrics that you ended up using after implementation? Were there any surprises at all, or did everything work as expected and optimize as expected?
Thanks @ellativity . Glad that you liked it.
To answer your question, the entire redesign and refactoring was done to have better grip and grasp over what we had developed. As mentioned in the blog, with the earlier design when we had bugs, it was becoming difficult to fix them and sometimes when we had production bugs, we spent nights to fix it. It was becoming difficult to the level that we just couldn't go on with the code. And on top of that, with the huge infrastructure, we were incurring a lot of loss since most of it was not even being used. The entire redesign took around 2-3 months (most of the time was to get approvals) and IMO it helped us a lot.
To measure it, we had used RED metrics and of course the cost gains that we had.
Surprises I reckon we didn't have many. Yes, we had bugs in the beginning but we could fix them easily as we had lesser codebase and better logs now.
Thanks for responding and explaining in more detail. So, although projecting gains was important, what was more important was that the existing codebase was unmanageable full-stop? I think I understand!
Ain't that the truth 90% of the time, tho!
Exactly! Since ours was a platform team, any downtime affected around 10-15 other teams (micro-services). So basically we have to stay alive so that others are alive, and this is only possible when our MTTR is minimal.
Whatever it takes!