Sunday, June 17, 2018

How Cops Taught Me to Manage Software Problems

Police car
Photo by Matt Popovich on Unsplash
Application production support teams are called on to respond to incidents. They investigate incidents by collecting evidence from servers and devices involved in the incident. They identify problem areas where multiple incidents occur and plan how to apply resources to reduce future incidents in those areas. What other profession does this sound like? Police have performed these functions for hundreds of years of human history without the benefit of being able to review consolidated Splunk logs or analyze the source code of suspects.

If you have performance problems and production incidents and you are unsure how to manage the underlying problems that cause them, consider borrowing a management technique used by many police and other civil government organizations called CompStat. CompStat is short for “Compare Statistics” and is based on four key components:
  1. Timely and accurate information or intelligence
  2. Rapid deployment of resources
  3. Effective tactics
  4. Relentless follow-up

I translate these key components for production application support as the following: 

1. Timely and accurate information
To be proactive and respond before a small problem becomes a big problem, you must have a way to detect problems. You need to see trends in performance and monitor failure rates. To properly prioritize response to problems, you need to know what functions are causing the most failures or lag in your system. A good application monitoring program can provide you the quantitative data you need to quickly identify problem areas that cause the most incidents and performance issues. Combining that quantitative data with the impact on your business or mission can give you a good idea of what to address first.

2. Rapid response
When you notice small performance problems, do they typically just magically disappear? Probably not. Small performance problems tend to become big performance problems as data and application usage grows. To prevent little problems from becoming big problems, respond quickly. To achieve system stability for your customers, you cannot stop at resolving the incident. You have to determine the underlying problems that lead to incidents and attack those problems with a sense of urgency.

3. Effective tactics, techniques and procedures
Are you able to employ resources to quickly isolate and resolve incidents? After resolving an incident consider documenting how it was isolated and investing in tools and methods to more quickly detect, isolate, and resolve the same class of problems in the future.

4. Relentless follow-up
Follow-up is the most easily forgotten component but is critical to successfully reducing problems in a system. If you deploy a change to resolve a problem and don’t validate that it actually resolved the problem, you will not know if your efforts are effective. Not following up on the effectiveness of a change can lead to moving on to solving lower priority problems when you have not actually resolved higher priority problems which might re-occur at inopportune times. Follow-up can reveal that there were multiple causes that lead to an incident, not just the one you fixed. After a investing time and money into fixing a problem, reporting the decline in incidents or increase in performance after the changes can help quantify the value provided by fixing problems and increase support for allocating resources to resolve production problems. After I deploy a change, I’ve gotten in the habit of not marking a user story done until after I’ve validated my fix has had the intended impact in production. I add a task to the story for post-deployment verification and set a calendar invite for myself to perform the task a week or two later -- however much time is needed to collect enough evidence to reasonably conclude the issue was fixed. Rushing to mark user stories done does not help quality or system stability. If you want a stable system, you have to follow up to ensure your changes are having the intended impact and not creating new problems.

No comments:

Post a Comment