While software development is immune from almost all physical laws, the inexorable increase in entropy hits us hard.
David Thomas & Andrew Hunt — Pragmatic Programmer 20th-anniversary edition
Introduction
In this blog post, we will share a methodology for paying back tech debt. DISQO has used this methodology with great success. With the gradual rewrite method, we removed a huge amount of unnecessarily complex code and replaced it with simpler implementations. At the same time, we created a robust test set that helps developers in future code changes. As a result of the rewrite, we reduced lines of code (LOC) by at least 20%, improved performance and developed new features with basically zero production incidents.
Before going into more detail, what do we mean by a “gradual rewrite”? Let’s take a brief look at what tech debt is and how it can be paid back.
What is tech debt
There is no single reason that causes entropy to increase in a software project or even which part of the software development cycle or stack the increase originates from.
Tech debt types and reasons
Tech debt types |
Tech debt reasons |
Requirements |
Uncertainty of use cases in the beginning |
Architecture |
Business evolution |
Design |
Time pressure: deadlines with penalties |
Code |
Priority of features over product |
Test |
Split of budget into project budget and maintenance budget |
Build |
Lack of specification/emphasis on critical architectural requirements |
Documentation |
Reuse of legacy, third-party, or open-source |
Infrastructure |
Parallel development |
Versioning |
Effects uncertainty |
Defect |
Non-completed refactoring |
Technology evolution |
Human factor |
But all developers know how annoying it is to work with a badly indebted codebase. It makes each change extremely frustrating when a change in one place could have a rippling effect with seemingly unconnected parts of the code. You can never be sure why the current code is written like that or what side effects a simple change will bring. This phenomenon may also have a significant impact on developers’ morale and productivity.
Paying back technical debt
The most commonly recommended technique for reducing technical debt is to constantly refactor the old code during the development of new code. This technique works well for purely code-type technical debt, but when the technical debt is in design or architecture, refactoring may become hard to do safely, since the refactoring work would cause effects in multiple parts of the whole code base. Sometimes, the code has aged so badly that current developers are afraid to touch it for fear of breaking revenue-generating functionality. Thus, it is not always possible to do refactoring work during new feature development, but refactoring must be done separately.
In those cases, it is very tempting to suggest a complete rewrite of the system. Unfortunately, rewriting software can be considered the “single worst strategic mistake that any software company can make” for very good reasons. It is a lot harder to read code than it is to write it. In cases where the old code has been running in production for a very long time, it already has been extensively tested and debugged. Multiple defects have been fixed. In the worst case, rewrite leads to the maintenance of two parallel systems and more than doubles the required maintenance effort. Completely rewriting a software system is a task that should not be considered unless all stakeholders are aware of the risks involved. It can be argued that if it took N years to develop the original code, the full rewrite will also take at least N years.
The only way to speed up a rewrite is to simplify the requirements. However, in actuality, this is a development of a new system, not a rewrite of the old code.
Since there are no quick wins to be had, the problem becomes: how to pay back a system’s debt so developers are not afraid of changing the code, while at the same time avoiding disruptions to other maintenance work and new feature development.
Gradual rewrite
This section explains a methodology that we have used in DISQO for gradually rewriting a software system.
A gradual rewrite can be divided into the following steps:
- Delete all code that is not being used.
- Identify disparate subsystems of the codebase that can be rewritten in relative isolation.
- Choose one of these subsystems to rewrite.
- Design and publish a set of metrics that can be used to make sure that the subsystem is working properly.
- Create a fast and reliable set of automated tests for the subsystem.
- Rewrite the subsystem.
- Test the rewritten system by releasing it to a small part of the user base.
- Track the metrics designed in step 4 for the new code.
- When the metrics show that the new code is working properly, gradually increase the number of users running the new code until all users have updated.
- Return to step 3 until all necessary subsystems have been rewritten.
A gradual rewrite is most effective only on certain types of tech debt such as code and architecture debt, and in some cases, design and requirement debt also.
Rewriting a codebase is not a simple task and there are no quick wins to be found. The law of a rewrite taking at least as long as the original development applies to gradual rewriting as well. However, a gradual rewrite allows the new and old code to co-exist in the same system, doesn’t halt new feature development, and forces developers to build the new code with healthy habits such as good test sets and metrics.
The following sections describe the steps presented above in more detail.
Delete code
In some cases, the codebase may have evolved so much that it contains code that is not being used anymore. All the unused or commented code must be deleted so that it doesn’t add any noise to the codebase. Most compilers and static analysis tools can find code paths that are never executed. But it is harder to find code paths that are only executed in configurations that are not relevant anymore. This requires an understanding of the codebase, current production configuration and what possible configuration values are sensible in the production system.
Identify subsystems
First, it is necessary to understand what is meant by a subsystem in this context. An individual class or a file is not considered to be a subsystem. A subsystem of a code base is a set of classes, packages or modules that fulfill a certain business requirement. A few practical examples could be:
- REST endpoint
- UI view or UI component
- Data processing pipeline for one data type
- Client-side data storage and upload
A good understanding of the code base and business requirements is needed to identify the subsystems. Some static analysis tools may help, but in most cases, it requires manual review and analysis of the codebase, architecture, design, and requirements. A good collaboration with all stakeholders is essential in understanding what counts as a subsystem and makes sense to rewrite in isolation.
Choose subsystem
There is no straightforward rule that tells you which subsystem should be rewritten first. The criteria can range from simplest to most business critical to the worst performing, etc. It fully depends on what is the desired goal of the project rewrite.
Identify metrics
For a successful rewrite, it is essential to get an understanding of how the code is performing in the production environment. Without visibility into the running software, it is impossible to know if it is working properly or not. Deciding what metrics to collect and how to interpret them should not be an afterthought, but should be built into each subsystem. Depending on the requirements of the subsystem, the metrics may include things such as:
- Run times of critical sections of the system
- Throughput
- Error rate
- Number of events consumed/produced
Since most software systems are some kind of data transformation operation, one important metric to understand is that the transformations work as they should after the rewrite. These metrics could be:
- Minimum, maximum, mean, median, and deviation of attribute values
- Attribute fill rates
It must be possible to separate the metrics for the old and rewritten code to understand the differences in runtime behavior.
Create and run automated tests
To get a successful rewrite result, it is mandatory to have an extensive test set that is reliable and executes quickly. Nothing is more detrimental to productivity than a code-change fear due to a lack of knowledge about how to ensure changes do not break existing behavior.
There is no requirement whether the tests are unit, integration, functional, regression or other tests. The only criteria are that the tests execute fast, have large behavioral coverage, and are stable.
The automated tests must be run often during the rewrite. The tests need to give developers the confidence to work with the code without fear of breaking critical behavior.
However, it is not easy to come up with such an extensive test set. In some simple cases, it could be possible to create such a test set manually, but in most cases, it is very hard to remember to cover all possible edge cases.
One good method for creating an extensive test set is to persist a portion of production inputs, outputs, and states. This data can then be replayed over and over during the development process.
For example, in a web service backend, this data could be a REST endpoint request, response, and relevant database state before and after calling the endpoint. The test case would do the following:
- Set the database to the before state.
- Call the endpoint with the persisted request
- Assert the response and the database after state.
Rewrite
Time to get your hands dirty. Obviously, there is no point in rewriting unless there is a clear vision of how the code will improve. This needs to be clear for all stakeholders. Unfortunately, there is no standard way to measure readability or understandability of a code base, so it always comes down to some kind of subjective analysis. However, it is possible to try to find some objective measures such as the number of mutable states, cyclomatic and cognitive complexity, etc. to compare the complexity of implementations.
Release to user subset and monitor metrics
The most important thing to remember when doing a release is never to release a feature to all users at once. The release must be gradually released to a small subset of users, for example, by controlling which users get the rewritten version or by controlling the rewritten code path from a feature flag.
During the release, the metrics must be constantly monitored for any possible regressions, and in case of regression, the release must be halted and possibly rolled back. In the best case, the problematic inputs can be identified, persisted, and added to the test set, thus simplifying the bug-fix development.
Release to all users
When the metrics show that the rewritten code is executing as intended, the code can be released to all users. In case no regressions are identified during the full release, the rewrite can be deemed complete.
The cycle starts again with a new subsystem until all necessary subsystems have been rewritten.
Conclusion
In this article, I presented a method for rewriting a large codebase with a high degree of confidence. The gradual rewrite method does not make the rewriting process faster, but it allows the old and rewritten code to live together for the time being. At the same time, it guarantees that rewritten code is successful through testing and measuring.
As we mentioned in the beginning, DISQO is successfully using this method. We are currently in the process of rewriting our behavioral data collection SDK for Android. In the second part of this blog series, we will go into more detail about how we are rewriting the Android SDK with the gradual rewrite methodology.