No offense, but despite your best intentions, you might not be handling risk properly. In this day and age, everything is software-dependent; even if you do not consider yourself a “software-firm” per-se, even if you are just running a small development team that develops in-house software, your business still depends on said software to run smoothly, and any outages cost money. The bigger the problem, the greater the cost. If you, like many other modern software-based organizations, try to reduce risk by taking every precaution to avoid the occurrence of failures, then I am talking to you. If you are (still) following the waterfall methodology (why would you do that???), then I am definitely talking to you.
In this blog post I will explain what is fundamentally wrong with the waterfall way of addressing risk, why you should resist the temptation to avoid failure, and what you should be doing instead, in order to truly reduce risk that is inherent to delivering software.
What is Wrong with Waterfall?
When following waterfall-based methodologies, software projects get developed in phases – first you gather all of the requirements (system, software), then you analyze the requirements and come up with a program design to satisfy the requirements. Once designed, you implement, or develop it. Once the development is done, you (hopefully) test the system thoroughly, and finally, you hand it over to operations, to deploy it and maintain it.
So, what is so fundamentally wrong with that, you might ask. This is a very simple and straightforward process. The problem is, of course, that, as even Winston Royce, the author of the now infamous paper about from 1970 titled "Managing the Development of Large Software Systems” said, this can only work for the most simple and straight forward projects.
It boils down to this: Software development is complicated. In waterfall, we proceed from one phase to the next, when the former completes successfully. Unfortunately, success is by no means guaranteed, or even likely. Worse, we tend to detect most problems only after we completed the development phase, during the testing phase, deployment phase, or worst of all, only after we have already released the flaws into production. What really makes this difficult, is that some of the problems that we uncover may have been introduced prior to development (the design, analysis or even requirement gathering), and as everyone knows, the cost of fixing a problem grows exponentially over time.
So what do we do? How do we mitigate the risk that we might introduce a costly flaw into the system? Intuitively, we attempt to get everything right the first time. We try to think of everything that the system or software might require, create comprehensive design documentation that proves that we thought really hard about the problem, and create lengthy, highly regulated processes and checks that prove that we crossed every ‘T’ and dotted every ‘I’ (and a few lowercase J’s for extra measure).
In other words, we attempt to reduce risk by reducing the likelihood of a problem/incident.
And here our intuition fails us.
Risk Management in Modern Software Projects
Reducing Likelihood of Problems is the Wrong Approach
There are many different ways that a project may fail. Too many to count them all. Missing a requirement, getting a requirement wrong, designing the wrong architecture, designing a system too rigid to change, developing the wrong capabilities, developing a capability incorrectly, deploying incorrectly, not designing for the right scale, insecure code, etc. The list goes on…
So we set up policies, we come up with plans, we have audits, we enforce waiting periods, we have sign-offs, and because releasing new software is so complicated and scary, we do so rarely, often no more that four times per year.
But here’s the problem – we aren’t eliminating risk. We are – at best – reducing the likelihood of something getting through our safety gates. This means that things will get through eventually, because given enough time, anything that can happen, eventually will.
And when failure does happen, the flaw in the system expresses itself in its full glory. In other words, there might be a 1% chance of a bug reaching production, but when it does, it’ll be there 100%!
All of our audits, sign-offs, and controls fail to stop us from making mistakes. At best they catch some of the mistakes that we’ve made when it is expensive to fix them; often the mistakes get caught too late – discovered flaws become too hard or too expensive to fix. These defects get shipped anyway, hopefully fixed in a service update. Worse, and all-too-often, these audits, controls and sign-offs do nothing to help identify problems, and are instead in place in order to identify whom to blame for the failures – a useless endeavor, in my opinion.
Worst of all, our lengthy processes delay the feedback that analysts, architects, and developers need, making it impossible to learn from mistakes! A bug found 6 months after it was introduced, will do nothing to teach the responsible party how to avoid making the same mistake again. Cause and effect becomes all but lost at this point.
Finally, due to the infrequency of releases, we are not used to dealing with deployment-related issues, and therefore we are surprised and scared every single time we have them.
Manage Risk by Reducing the Impact of a Problem!
What if rather than attempting to minimize the chance that something goes wrong, we instead try to reduce how badly the problems affect us? Ask yourself this, given the choice between having a 1% chance of suffering a heart-attack, or a 100% chance to suffer something that is 1% the strength of a heart-attack, perhaps a flutter, or skipping a beat - which would you pick? I’d definitely go with the latter. In software development, not only is the likelihood of a production-failure more than 1% likely to occur, it does so every quarter or however frequently you release changes.
Agile Risk Management
Agile project managers, whether following Scrum, Kanban, or any other methodology or framework are designed around the following notions:
Software is complicated
Complicated things risk failure
Complexity is directly proportional to risk and to impact of failure
Complexity increases with the size of the workload
Therefore, we design processes that reduce complexity, and thus – impact
We should follow methodologies that allow us to reduce the size of our workload. In Kanban, we focus on single-item flow. In Scrum, we iterate through our entire release process in one month or less. High-performing teams deploy to production small increments of functionality even more frequently, often multiple times per day!
In the next post, I will cover the steps that an organization can take, to reduce the impact of the risks involved with developing software.