The Power of Upstream Thinking
When we are presented with a problem it’s natural to focus on solving it immediately. If your sink is overflowing, you call a plumber to fix it. If you break your arm, you go to the emergency room and get a cast. If your critical piece of software goes down, you work through the night to bring it back up.
It’s not as if it’s bad to solve these problems, but they all reflect what I call downstream solutions. All the problems above are likely the result of some underlying cause, which we will ignore if we only focus on the immediate problem. Perhaps your sink got clogged because you keep dumping coffee grounds into it. Perhaps you broke your arm because you were skateboarding without protective gear. Perhaps your critical software went down because it was architected in a fragile way.
Upstream thinking is the idea of looking past downstream problems to identify what are their underlying causes (you could think of this as related to Toyota’s “five whys” approach), and working to fix those instead. Let’s take a look at some common symptoms of upstream problems:
Frequent breakages. If you are constantly putting out fires, there is always some underlying problem at play. Take a look at commonalities across all the issues and you will likely find several factors. For example, perhaps there were untested changes shipped to production. In that case you need more robust testing and CI. Or perhaps your service was overloaded with a traffic spike. In that case you need better throttling or scaling. The exact solution will of course depend, but the key is that robust systems shouldn’t need constant tending. Take some time out of oncall duties and spend them on thinking of ways to improve the system.
Too big a backlog. A ubiquitous problem with every team I’ve worked on has been too much backlog. Every team seems to have hundreds of items they’d theoretically like to do, but clearly won’t have enough time to get to. Typically this is addressed by some kind of “fix week” or other event to reduce the backlog. But inevitably this only handles a small percentage of the items, and the backlog keeps growing. Instead we should take an upstream approach. Why are there too many backlog items? Is it because people are spending time on the wrong things (this is unlikely)? Or is it because too many low priority things are added that will never realistically get fixed (probably this)? In the latter case there needs to be more vigilance with what is allowed to be added, there should only be an amount that can reasonably be actually done.
Too many alerts. Another ubiquitous problem is that teams have too many alerts. You get 10+ messages every day that some value is not where it is expected to be. As a result everyone quickly learns to ignore them. And so when there is a real issue, the alert that detected it falls into the noise. Ignoring alerts is just ignoring the downstream symptom. The real problem is that alerts need to be actionable a high percentage of the time. Merely informational alerts or those with many false positives are worse than nothing, because they drown out high signal alerts. Try having fewer, but higher signal alerts, and keep a very high bar for what can qualify as an alert, culling any that stop meeting it.
Missed deadlines. Projects often have deadlines, and inevitably these get pushed back or missed completely. As a deadline nears, it may be tempting to add more engineers to a project to restore the deadline. Of course many of us know that “adding workers to a late software project makes it even later” (Brooks’s Law). But since there is no alternative, we do it anyway. Instead we should consider the upstream actions that led to the missed deadline. This might be that the initial time estimate was bad (the “planning fallacy”). It could be that too much time was spent in operational work (see the above three items). It could be that not enough time was spent on designing the system, so that it wasn’t easy to build it. Or it could be that too much time was spent planning, not leaving enough time to actually build it. Or perhaps the right people weren’t involved, or not involved early enough. These are all things that should be considered at the start of a project, but unfortunately they often only happen at the end, when it’s too late.
In all these cases it’s tempting to treat the symptoms directly, and that’s what often occurs. Still, in the long run it’s more effective to “move upstream” and prevent the problem from ever recurring. Whenever you encounter a situation it’s always helpful to think about “what was the underlying cause” and “how can I prevent this from happening again”. This can help you to direct your efforts upstream.