In the previous post we discussed risk management, how many of us manage it today by attempting to control the likelihood of failure, and why we should instead focus on reducing the impact of failures, as a way to manage the risks involved with software development.
Below are 5 techniques that software development teams can use to reduce the impact of failures in production:
1. Reduce Batch Sizes
Smaller workloads take less time to complete – in each phase, and altogether. Smaller workloads include less functionality, thus fewer potential flaws, each likely to have a smaller impact on the system as a whole. Small batches are easier to deploy, thus easier to revert – or rather it is easier to design a successful rollback plan, meaning that the recovery time will be shorter.
Less functionality in a deployment unit results into a smaller area of impact. Faster deployments with faster rollback options result in a reduced time to recover, thus reducing the impact on the system.
Reducing the batch size exponentially reduces the impact of failure on the system.
2. Deploy as Early and Often as Possible
Developers depend on feedback in order to know whether or not, they created the right thing, as well as whether or not they created the thing right. Building the system every time changes are made is a good start, and running automated unit tests is even better, but some defects may only be detected in production or a production-like environment!
By shortening the development cycle, reducing the amount of time that a developer must wait before finding out if his or her changes were successful, the developer will:
-
have an easier time correcting any mistakes
-
have an easier time identifying cause and effect between changes made and defects in production
-
have an easier time learning from these mistakes and findings
Reduced feedback cycle, therefore, results in faster fixes, thus reducing the impact of failures, while the increased learning results in a happy side-effect of reduced likelihood of failure.
3. Shift-Left and Automate Audits and Controls
Risk aversion, or fear of failure, is the primary reason that we have so many audits, checks and controls, and sign-offs for every single change that we introduce in the system. As mentioned in the previous post, these measures are expensive (time-wise) and ineffective at reducing risk; they are introduced too late in the development lifecycle to serve as an efficient method of reducing risk. Worse, most of these controls delay the deployment of changes into production by days, even weeks, thus increasing the feedback cycle, effectively increasing the impact and likelihood of failure!
See the irony here? The very measures taken to mitigate risk actually make it worse!
By replacing post-factum audits and governance boards with automated checks that analyze the code, and test it in a production-like environment, and by discussing operational concerns earlier in the planning process, we can introduce these audits earlier in the lifecycle, even as early as while the developer is coding! Automation also helps reduce the cost and the length of the feedback cycle.
With earlier and faster warnings, fewer quality issues will reach production. Those that do get through these measures are likely to be the smaller and less significant ones. This means that both impact and likelihood will be reduced.
4. Decouple Sub-Systems
Huge monoliths might be easy to develop, and often offer the greatest performance. Unfortunately, they come with inherent disadvantages:
-
It is difficult or impossible to deliver changes to part of a monolith; monolith deployments are usually an all-or-nothing endeavor.
-
Tightly coupled, monoliths are often designed in a way that if one part of a workflow fails, the entire workflow crashes.
By decoupling the architecture, separating steps into individual components that communicate with each other asynchronously, using message queues or event-based communication systems, we can:
-
Deploy each component separately, ensuring that defects introduced into a system component are isolated from other components, reducing the impact of failures to one localized component.
-
Rather than fail the workflow if something goes wrong, we can flag defective messages for service teams to handle, notify the user that completion is delayed, and eventually complete each flow when services are restored, thus reducing the impact of failure even further.
As a bonus, decoupled systems are much easier to scale as demand for services increase. A whole class of problems can be completely avoided by architecting loosely coupled systems.
As for performance concerns, make sure that you are developing for good enough performance, rather than best. Remember that Good Enough is by definition – good enough.
5. Continuously Improve the Definition of Done
Whether your development and operations team(s) use Scrum, Kanban or any other agile methodology or framework to drive the product, the key to successful risk management is to uphold and improve the quality level you demand for anything that you develop and deploy to production.
Following any and all of the aforementioned techniques will greatly reduce the risk to your production pipeline, but never totally eliminate risk.
The most important way is to make sure that the same issue does not cause a failure twice. Any failure that does get through whatever quality measures you already have in place, must be analyzed, and you must figure out how to make sure that this class of problems never goes uncaught again.
By rigorously applying this technique, you will be able to continuously improve your quality controls, ensuring that failures consistently grow smaller until they are no more than a nuisance.
Conclusion
Nobody and nothing is perfect. How ever some are closer to perfection than others. If any or all of these ideas are new to you, I would highly recommend that you start with your definition of done, and look at the most harmful failures that you have recently had, and identify the measures that are most valuable for you to introduce into your production line!
And then move on to the next one.