Sunday, June 17, 2018

How Cops Taught Me to Manage Software Problems

Police car
Photo by Matt Popovich on Unsplash
Application production support teams are called on to respond to incidents. They investigate incidents by collecting evidence from servers and devices involved in the incident. They identify problem areas where multiple incidents occur and plan how to apply resources to reduce future incidents in those areas. What other profession does this sound like? Police have performed these functions for hundreds of years of human history without the benefit of being able to review consolidated Splunk logs or analyze the source code of suspects.

If you have performance problems and production incidents and you are unsure how to manage the underlying problems that cause them, consider borrowing a management technique used by many police and other civil government organizations called CompStat. CompStat is short for “Compare Statistics” and is based on four key components:
  1. Timely and accurate information or intelligence
  2. Rapid deployment of resources
  3. Effective tactics
  4. Relentless follow-up

I translate these key components for production application support as the following: 

1. Timely and accurate information
To be proactive and respond before a small problem becomes a big problem, you must have a way to detect problems. You need to see trends in performance and monitor failure rates. To properly prioritize response to problems, you need to know what functions are causing the most failures or lag in your system. A good application monitoring program can provide you the quantitative data you need to quickly identify problem areas that cause the most incidents and performance issues. Combining that quantitative data with the impact on your business or mission can give you a good idea of what to address first.

2. Rapid response
When you notice small performance problems, do they typically just magically disappear? Probably not. Small performance problems tend to become big performance problems as data and application usage grows. To prevent little problems from becoming big problems, respond quickly. To achieve system stability for your customers, you cannot stop at resolving the incident. You have to determine the underlying problems that lead to incidents and attack those problems with a sense of urgency.

3. Effective tactics, techniques and procedures
Are you able to employ resources to quickly isolate and resolve incidents? After resolving an incident consider documenting how it was isolated and investing in tools and methods to more quickly detect, isolate, and resolve the same class of problems in the future.

4. Relentless follow-up
Follow-up is the most easily forgotten component but is critical to successfully reducing problems in a system. If you deploy a change to resolve a problem and don’t validate that it actually resolved the problem, you will not know if your efforts are effective. Not following up on the effectiveness of a change can lead to moving on to solving lower priority problems when you have not actually resolved higher priority problems which might re-occur at inopportune times. Follow-up can reveal that there were multiple causes that lead to an incident, not just the one you fixed. After a investing time and money into fixing a problem, reporting the decline in incidents or increase in performance after the changes can help quantify the value provided by fixing problems and increase support for allocating resources to resolve production problems. After I deploy a change, I’ve gotten in the habit of not marking a user story done until after I’ve validated my fix has had the intended impact in production. I add a task to the story for post-deployment verification and set a calendar invite for myself to perform the task a week or two later -- however much time is needed to collect enough evidence to reasonably conclude the issue was fixed. Rushing to mark user stories done does not help quality or system stability. If you want a stable system, you have to follow up to ensure your changes are having the intended impact and not creating new problems.

Saturday, June 16, 2018

Managing Change the DOTMPLF Way


Most of us understand that if we run out one day and buy a cello, we won’t be able to play music like Yo-Yo Ma that same day. To be able to play something that resembles music you would need training, practice, and a safe place to store the cello. However, sometimes we don’t think about what’s necessary to achieve a new capability beyond just buying something like a new piece of software. To help understand the different types of changes that need to be made for successful technology injection, the US military uses the acronym DOTMLPF-P (pronounced Dot-M-L-P-F-P) that stands for Doctrine, Organization, Materiel, Leadership, Personnel, Facilities, and Policy.

Let’s say I want to change a team from doing deployments once a quarter to deploy once a week. Could I just install Jenkins and declare victory? That approach is unlikely to be successful because you will need to plan for personnel to configure Jenkins. You may need to train people on configuring and using Jenkins. You may need an organizational change to form a DevOps team to design and build out the initial deployment pipelines. To get the resources to do this and people focused on it, you will need to provide leadership. You may also need additional equipment like a new server to host Jenkins. You might rethink your change management policy if it can’t accommodate frequent releases. You may need to change the doctrine you follow, the way you develop and test software, by emphasizing smaller, quicker releases, changing your branching structure, using feature flags, and implementing automating tests. For this one change, you had to think about not just obtaining and installing the software but the doctrine, organization, training, materials (software & hardware), leadership, personnel, and policy -- seven components of DOTMLPF-P.

Below is the military definition of each element of DOTMLPF-P along with my own translation for a software developer or other IT professional.

Doctrine
Military Definition: “the way we fight (e.g., emphasizing maneuver warfare, combined air-ground campaigns).”
Developer Definition: The high level approached based on a set of general principles that we use to deliver software. Think agile as guided by the agile manifesto, RUP, or waterfall as different doctrines. Software delivery has predominantly shifted to small multi-disciplinary teams producing small and quick releases to production.

Organization
Military Definition: “how we organize to fight (e.g., divisions, air wings, Marine-Air Ground Task Forces)”
Developer Definition: How we divide into teams to deliver working software (feature teams, project teams, separate functional dev and test teams).

Training
Military Definition: “how we prepare to fight tactically (basic training to advanced individual training, unit training, joint exercises, etc).”
Developer Definition: How we learn to deliver software. Training in programming languages, new technologies, agile practices, security and other software and system engineering competencies. This may include formal training classes like a certification bootcamps, peer training, college courses, etc.

Materiel
Military Definition: “all the ‘stuff’ necessary to equip our forces.” When considering a new purchase, the definition is restricted to equipment “that DOES NOT require a new development effort (weapons, spares, test sets, etc that are ‘off the shelf’ both commercially and within the government)” to focus on considering existing solutions to potentially fill a capability gap.
Developer Definition: Your tools. This may include servers, workstation, Visual Studio, Eclipse, Emacs, etc.

Leadership and education
Military Definition: “how we prepare our leaders to lead the fight (squad leader to 4-star general/admiral -  professional development)”
Developer Definition: Leadership is necessary to bring people together to focus on solving a problem. If you are using 10 year old development tools and techniques or everyone thinks your processes don’t make sense but nobody does anything about it, it’s probably because your team lacks leadership. Leadership doesn’t have to come from formal managers. Indeed, to move up in the ranks, showing initiative, drive and leading even small changes like upgrading Visual Studio across the whole development team goes a long way to distinguish you as someone who is engaged, cares about the team, and can drive change.

Personnel
Military Definition: “availability of qualified people for peacetime, wartime, and various contingency operations”
Developer Definition: The people you have on your team(s).

Facilities
Military Definition: “real property, installations, and industrial facilities (e.g., government owned ammunition production facilities)”
Developer Definition: The physical space where your team works and the physical space where your servers are hosted. If you are hosting servers in your own facility, you have to think about heating and cooling, fire suppression systems, access controls and other physical security issues. If you are forming a new team or rearranging teams you may also think about how your office space is laid out to allow for collaboration and for quiet time.

Policy
Military Definition: “DoD, interagency, or international policy that impacts the other seven non-materiel elements”
Developer Definition: The rules you must follow as part of your software delivery process. For example, all changes must be approved by the Configuration Change Board prior to deploying to production.