Thursday, May 20, 2010

Code quality, refactoring and the risk of change

When you are working on a live, business-critical production system, deciding what work needs to be done, and how to do it, you need to consider different factors:
  1. business value: the demand and urgency for new business features and changes requested by customers or needed to get new customers onboard.

  2. compliance: ensuring that you stay onside of regulatory and security requirements.

  3. operational risk and safety: the risks of injecting bugs into the live system as a side-effect of your change, the likelihood and impact of errors on the stability or usability of the system.

  4. cost: immediate development costs, longer-term maintenance and support costs, operational costs and other downstream costs, and opportunity costs of choosing one work item over another or one design approach over another.

  5. technical: investments needed to upgrade the technology stack, managing the complexity and quality of the design and code.
These factors also come into play in refactoring: deciding how much code to cleanup when making a change or fix. The decision of what, and how much to refactor, isn’t simple. It isn’t about a developer’s idea of what is beautiful and their need for personal satisfaction. It isn’t about being forced to compromise between doing the job the right way vs. putting in a hack. It’s much more difficult than that. It’s about balancing technical and operational risks, technical debt, cost factors, and trading off short-term advantages for longer-term costs and risks.

Let’s look at the longer term considerations first. A live system is going to see a lot of changes. There will be regulatory changes, fixes and new features, upgrades to the technology platform. There will also be false starts and back-tracking as you iterate, and changes in direction, and short-sighted design and implementation decisions made with insufficient time or information. Sometimes you will need to put in a quick fix, or cut-and-paste a solution. You will need to code-in exceptions, and exceptions to the exceptions, especially if you are working on an enterprise system integrated with tens or hundreds of other systems. People will leave the team and new people will join, and everyone’s understanding of the domain and the design and the technology will change over time. People will learn newer and better ways of solving problems and how to use the power of the language and their other tools; they will learn more about how the business works; or they might forget or misunderstand the intentions of the design and wander off course.

These factors, the accumulation of decisions made over time will impact the quality, the complexity, the clarity of the system design and code. This is system entropy, as described by Fred Brooks back in the Mythical Man Month:
“All repairs tend to destroy the structure, to increase the entropy and disorder of the system. Less and less effort is spent on fixing the original design flaws: more and more is spent on fixing flaws introduced by earlier fixes… Sooner or later the fixing ceases to gain any ground. Each forward step is matched by a backward one.”
So, the system will become more difficult and expensive to operate and maintain, and you will end up with more bugs and security vulnerabilities – and these bugs and security holes will be harder to find and fix. At the same time you will have a harder time keeping together a good team because nobody wants to wade knee deep in garbage if they don’t have to. At some point you will be forced to throw away everything that you have learned and all the money that you have spent, and build a new system. And start the cycle all over again.

The solution to this is of course is to be proactive: to maintain the integrity of the design by continuously refactoring and improving the code as you learn, fill-in short-cuts, eliminate duplication, clean up dead-ends, simplify as much as you can. In doing this, you need to balance the technical risks and costs of change in the short-term, with the longer-term costs and risks of letting the system slowly go to hell.

In the short term, we need to understand and overcome the risk of making changes. Michael Feathers, in his excellent book Working Effectively with Legacy Code talks about the fear and risk of change that some teams face:
“Most of the teams that I’ve worked with have tried to manage risk in a very conservative way. They minimize the number of changes they make to the code base. Sometimes this is a team policy: ‘if it’s not broke, don’t fix it’…. ‘What? Create another method for that? No, I’ll just put the lines of code right here in the method, where I can see them and the rest of the code. It involves less editing, and it’s safer.’

It’s tempting to think we can minimize software problems by avoiding them, but, unfortunately, it always catches up with us. When we avoid creating new classes and methods, the existing ones grow larger and harder to understand. When you make changes in any large system, you can expect to take a little time to get familiar with the area you are working with. The difference between good systems and bad ones is that, in the good ones, you feel pretty calm after you’ve done that learning, and you are confident in the change you are about to make. In poorly structured code, the move from figuring things out to making changes feels like jumping off a cliff to avoid a tiger. You hesitate and hesitate.

Avoiding change has other bad consequences. When people don’t make changes often they get rusty at it…The last consequence of avoiding change is fear. Unfortunately, many teams live with incredible fear of change and it gets worse every day. Often they aren’t aware of how much fear they have until they learn better techniques and the fear starts to fade away.”
It’s clear that avoiding changes won’t work. We need to get and keep control over the situation, we need to make careful and disciplined changes. And we need to protect ourselves from making mistakes.

Back to Mr. Feathers:
“Most of the fear involved in making changes to large code bases is fear of introducing subtle bugs; fear of changing things inadvertently”.
The answer is to ensure that you have a good testing safety net in place (from Michael Feathers one more time):
“With tests, you can make things better with impunity… With tests, you can make things better. Without them, you just don’t know whether things are getting better or worse.”
You need enough tests to ensure that you understand what the code does; and that you target tests that will will detect changes in behavior in the area that you want to change.

Put in a good set of tests. Refactor. Review and verify your refactoring work. Then make your changes and review and verify again. Don’t change implementation and behavior at the same time.

But there are still risks – you are making changes, and there are limits to how much protection you can get from developer testing, even if you have a high level of coverage in your automated tests and checks. Ryan Lowe reinforces this in “Be Mindful of Code Entropy”:
“The reason you don’t want to refactor established code: this ugly working code has stood the test of time. It’s been beaten down by your users, it’s been tested to hell in the UI by manual testers and QA people determined to break it. A lot of work has gone into stabilizing that mess of code you hate.

As much as it might pain you to admit it, if you refactor it you’ll throw away all of the manual testing effort you put into it except the unit tests. The unit tests… can be buggy and aren’t nearly as comprehensive as dozens/hundreds/thousands of real man hours bashing against the application.”
As with any change, code reviews will help find mistakes, and so will static analysis. We also ask our developers to make sure that the QA team understands the scope of their refactoring work so that they can include additional functional and system regression testing work.

Beyond the engineering discipline, there is the decision of how much to refactor. So as a developer, how do you make this decision, how do you decide how much is right, how much is necessary? There is a big difference between minor, in-phase restructuring of code and just plain good coding; and fundamental re-design work, what Martin Fowler and Kent Beck call “Big Refactorings” which clearly need to be broken out and done as separate pieces of work. The answer lies somewhere in between these points.

I recently returned from a trip to the Buddhist kingdom of Bhutan, where I was reminded of the value of finding balance in what we do, the value of following the Middle Way. It seems to me that the answer is to do “just enough”. To refactor only as much as you need to make the problem clear, to understand better how the code works, to simplify the change or fix… and no more.

By doing this, we still abide by Bob Martin’s Boy Scout Rule and leave the code cleaner than when we checked it out. We help protect the future value of the software. At the same time we minimize the risk of change by being careful and disciplined and patient and humble. Just enough.

No comments:

Site Meter