Learnings from a Disaster Recovery

Jul 21, 2025

So, for over a year, we were trying to manage an application crash, and it forced us to build a good number of processes and rebuild a decent number of applications. There are a few things which we realized should have been done a decade ago but which no one thought was something useful. Let me list a few of them like that.

Every process you build should have a bulk upload facility. When the application which gives you data breaks and you use the standard screen in your application as an alternate, and when you realize that you need to make 500 data entries per day, you will really feel the pain.
Modularization of code: One thing we realized is that, every business supra-process should be isolated from the perspective of code. For comfort, we reuse code components, I don’t have any problem with that, but at least ensure that the parameterization is unique. This problem will become a problem especially during batch scheduling in a major application where a thousand jobs run. Yes. Such applications exist. Two good practices are, each process has it’s own batch user id, the whole application is carved into isolated batches which can be controlled independently.
A very robust access management and change deployment process: The pipelines are broken. You are reinventing the process. These two are the top-most priorities for you. People will not realize the impact of them on day one but as days pass on, you will realize how critical they are.
Resource Planning is critical: You can have ten different workstreams - I don’t have any problem but realistically speaking, you have no more than 2 or 3 architects. And then, there are always some applications(especially the middleware layer - it can be a proper middleware system or an ERP which works as a glorified middleware), you will just not have sufficient hands.
Have a central plan: Along with critical resources required for that workstream, there are a few other things you need to consider -
1. There is enough time between making an application or an interface available and a major change
2. Ensure there are no clashing priorities - keep the deployments and sanity testings as separate as possible, especially if they involve multiple applications
3. Ensure that the stakeholders are informed in advance and with sufficient clarity over what is expected out of them - don’t surprise them
Have ambitious but realistic targets: Let’s have this scenario. At the last minute, you realized that there is a process which is not considered. That process involves data purge from a table. First problem is you are the one to do the work but you are not made aware you should do that. The second problem is, you still decide to go ahead. You realize that the table has 500 million records and you need to purge 200 million out of it. From the earliest decision till the cut-off, you are given two hours. The person who decided that didn’t bother to inform you, and on the other hand, doesn’t know that the process is too time consuming and you are not even in a position to give a timeline. Don’t create such surprises for your teams
Are there any application horizontal workstreams?: As a general rule, we focus only on business vertical workstreams but at a later stage, we realize there are behemoths(like an ERP system) which is a whole workstream in itself. 40% of the application is not a part of any established workstream, archival and data cleanup is nobody’s child and the application health on the whole has no clear owner. You may have to think of having a robust workstream identification which includes application isolates which have their own story to tell. Some areas which exist like that - Central Master Data, ERP Applications(SAP etc), Networking etc.

Project Management and User Experience

Discussion about this post