All computer systems have technical risks. It’s impractical to design risk completely out of a system. Since we don’t have unlimited time or resources, it’s important that we be methodical about how we prioritize investments to mitigate that risk.
As engineers we go to great lengths to measure many aspects of our systems; latency, error rates, health checks, and so on; but too often we rely on our intuition to make decisions about risk. Systematically surfacing and understanding your technical risks is absolutely essential to ensuring you prioritize mitigating your highest risks first.
What is the risk matrix?
I first heard of the risk matrix from the book Architecting for Scale by Lee Atchison. The idea is pretty simple. Brainstorm and document all your technical risks, categorizing each risk on two important dimensions: likelihood and severity. The likelihood is the chance of the risk occurring. The severity is the impact when the risk occurs.
Once you’ve put the list together, you can sort it from most to least likely and severe. This exercise works well for a variety of contexts: a project, your team’s services, a department, or even an entire company.
The risk matrix is a living document. It’s meant to be re-visited periodically. How often you choose really depends on your situation and what purpose you are creating it for.
What constitutes a risk?
There are different types of technical risks. Here are a few:
What endangers the availability of your system? For example, in a microservices environment, could an outage in a upstream system that you depend on take down your service?
Some mitigations to an availability risk might include chaos engineering, capacity planning, or re-architecture.
Knowledge silo/bus-factor risk
When tribal or specialized knowledge is concentrated in a small number of individuals, you have a risk. What happens if something goes wrong in the middle of the night with a system and your only expert is sick or on vacation? Do you have sufficient knowledge across your team to deal with it? Is the person on-call likely to be able to triage the problem effectively?
Some mitigations for knowledge risk might include hiring, training, improving documentation, pair programming, or other knowledge sharing opportunities.
Release predictability risk
Are there threats to your ability to predictably release software? Some examples of risks in this category include: ongoing operational toil, a large suite of flaky tests, or unsustainable interruptions from outside your team.
Development velocity risk
Are there areas where significant improvements can be made to your development velocity? Some examples might include: investing in platform capabilities, re-factoring legacy code, better tooling, or better process.
Besides the benefits of actually surfacing your risks, categorizing and sorting your risks by likelihood and severity helps you avoid some common thinking errors when it comes to risk.
How do we get priorities wrong?
Human beings are susceptible to many thinking errors. Here are a few related to assessing risk:
There is a tendency to look in the rear view mirror at the most recent problems rather than the highest risk problems. For example, you might prioritize a fix for an issue that you got paged for a few times in the last month rather than a more serious problem that hasn’t happened yet.
Neglect of probability bias
People have a tendency to over-estimate or ignore the actual probability of events. Not all failure cases are created equal. You want to be addressing the most probable failure cases first.
A special case of the neglect of probability bias is a pre-occupation with so-called black swan events. When designing a new service we become obsessed with scenarios like: “What happens if an entire AWS region goes down?“.
One of the nodes in your cluster is much more likely to go down than an entire AWS region. Have you designed and tested for that case appropriately? What about a broker going down in your messaging system? And so on.
It’s not that the other case isn’t a risk or isn’t important, it’s just much less likely. And you might just be able to live with that risk depending on your risk tolerance and investment appetite. Different types of organizations have vastly different risk tolerances. Not every company needs to aim for 99.99% availability. It just may not be that important or practical for your business. Everything is a trade-off.
Working with the risk matrix
Once you’ve created your list, for each risk ask yourself: Would I be OK with not mitigating this risk for the next 6 months? 12 months? 18 months?
Would I be OK with not mitigating this risk for the next 6 months? 12 months? 18 months?
It’s important to recognize that you don’t necessarily have to eliminate a risk, just mitigate it. For example, if you rated a risk to be of medium severity, it might be well worth your time to invest in reducing its impact when it occurs. This might take the severity from medium to low. The same can be said for likelihood.
A risk matrix template
The book includes a handy risk matrix template. You can easily copy this to a spreadsheet and set up some sorting by a very basic risk score based on the sum of the 2 dimensions to provide a very crude ranking.
Qualitative vs quantitative
It’s important to recognize that the risk matrix is not a be-all and end-all for evaluating risk. The risk matrix’s value is primarily qualitative; it helps you surface and categorize risks in a systematic way you might not do otherwise.
But don’t let the risk categorizations or scores fool you into a sense of precision or accuracy. Ranking based on a coarse bucketing such as high/medium/low is obviously not very robust. If you want to prioritize in a robust quantitative way, you should look to use an approach with more rigor.
Next time you plan technical investments, consider creating a risk matrix to surface and understand your risks first. It might even help you get alignment with other stakeholders on the need for an important technical investment you’ve been finding it hard to get buy-in for.
It’s impractical to build systems without risk. Understand your risks so that you can mitigate the most important ones, and make the deliberate choice to live with the rest. You’ll sleep better at night for it.
Nice post, but I noticed one easily-made mistake. Dependencies are on upstream systems, not downstream, unless water flows differently in your world?
Thanks. I update the post with that correction.
Leave a comment