Saturday, December 15, 2018

What I Learned About Programming from Forensic Science

I recently listened to the audiobook of Forensics: What Bugs, Burns, Prints, DNA, and More Tell Us About Crime by Val McDermid. The book tells the history of forensic science through captivating stories of its birth, brash adolescence, and evolution into a mature scientific discipline.

EVIDENCE & FAILURE
One of the early stories tells of James Marsh who in 1832 was called as an expert witness in a murder trial to demonstrate that a sample of coffee had been contaminated with arsenic. Marsh identified arsenic was present, however, when the time came to present to the jury his sample had degraded giving rise to reasonable doubt and the accused walked free. McDermid writes:
“James Marsh was a proper scientist. He regarded this failure as a spur towards success. His response to the embarrassment of his court appearance was to devise a better test.”
This attitude towards failure as an opportunity to learn is also ingrained in many software development and operations frameworks. The story is also a reminder of the ephemeral nature of evidence. When an incident occurs, evidence that may help understand the root cause may not last long in memory and eventually may be lost. If root cause analysis is not treated with urgency after an incident is resolved, the evidence necessary to determine the root cause may vanish.

ASSUMPTIONS vs TRUTH
Another story tells the tale of Bernard Spilsbury who came to fame in 1910 for his expert testimony in the trial of Hawley Harvey Crippen accused of murder. McDermid describes Spilsbury as a having “a liberal sprinkling of charisma” and a “handsome, convincing orator”. During the trial “the judge referred to Spilsbury as ‘the greatest living pathologist.’” Spilsbury claimed a skin sample found in Crippen’s home had a scar just like a scar the female murder victim was known to have. However, the defense pointed out hair follicles in the sample indicate it could not have been scar tissue and thus did not point to the victim. Later DNA analysis cast doubt that the sample belonged to a female or was closely related to the victim.

This adolescent period where charisma trumped hard science, is a reminder of the responsibility of those of us who are looked to as experts. If you have unique expertise or knowledge in a group making a decision, you cannot rely on an adversarial system to help you arrive at the best decision. Instead, we must dedicate ourselves to the truth and state and challenge our own assumptions.

EVIDENCE COLLECTION KIT
Spilsbury is also known for establishing the “murder bag”, a collection of gloves, tweezers, evidence bags and other equipment to use in homicide investigations. When you approach an investigation of an incident, there is likely a standard set of tools and techniques you should use for investigation. You check event logs, application logs, CPU utilization, memory utilization, and database logs. Checking all these logs and performance counters manually can take a long time. Consider in your work what prepared kits or tools you could use to collect, process, and analyze logs and performance counters for abnormalities to save time in isolating an incident.

RECOMMENDATION
Forensics: What Bugs, Burns, Prints, DNA, and More Tell Us About Crime by Val McDermid is an engaging, captivating read. Learning how criminal investigators refined their tools and techniques over the history of forensic science can help spawn ideas about how to improve investigation and response to IT incidents.

Sunday, July 29, 2018

How the Military Decision Brief Made Me a Better Software Architect

The decision brief template provides an outline that promotes thinking of different ways to solve a problem and considering the trade-offs of different solutions.
 
As a junior software engineer, I tried to figure out how more senior engineers came up with design ideas I never considered. After learning about and using the course of action decision briefing template, I came to recognize that they were thinking of different evaluation criteria for a design and then proposing solutions based on those criteria. They were considering the impact to storage, memory, CPU, scalability, supportability, maintainability, security, effort to implement, and other factors I wasn't considering. Considering these different evaluation criteria would often prompt them to come up with alternative design options.
 
If you are able to define the evaluation criteria for your design, you will better understand the rubric you’ll be graded against before you present it for peer review.


Once you know how your design will be evaluated, you can start asking yourself questions like, “What design would be best for scalability?” or “What design would be the quickest to implement?” Thinking about how to optimize each evaluation criteria one at the time can help you generate different ideas that may not have been obvious. It is easier to come up with ideas if you have a specific problem or constraint you are trying to solve for rather than trying to solve for many considerations all at once. If you over-constrain your thinking, you tend to lose out on those creative big ideas.
 
When faced with a new or complex engineering problem, I find it useful to think about this briefing template and focus in on the evaluation criteria I will use to judge different software architecture or design options. Below is an outline as well as a link to a sample of the version of the decision briefing template I've used. Nowadays I use a simplified template in a document rather than slide format that focuses on each key decision, but I find the slide format useful for learning and understanding the format and way of thinking.
How to Make Cereal for Breakfast Slide
Sample Course of Action Brief
  • Purpose
  • Problem
  • Recommendation
  • Prior Coordination
  • Background Information
  • Facts Bearing on the Problem
  • Assumptions
  • Courses of Action (COA)
  • Screening Criteria
  • Screened COAs
  • Surviving COAs
  • Evaluation Criteria
  • Analysis of Each COA
  • Comparison of COAs
  • Recommendation
  • Decision

Saturday, July 14, 2018

How Foursquare Can Save You from Business Disasters

The simple four square layout can be a powerful decision-making tool. It can help you analyze a problem in two different dimensions to hone in on what is most important. The hardest thing about risk management isn’t spending time dreaming up what can go wrong. The hardest part of risk management is determining what risks need to be addressed and in what priority.

Risk = Impact x Likelihood. The higher the impact and the higher the likelihood, the more risk you have. Quantifying impact and likely can be difficult and expensive if you try to do so with excessive precision. A matrix like the below can help you roughly rank your risks relative to each other in terms of likelihood and impact. Once you have ranked your risks relative to each other, you can focus in on the High Impact / High Likelihood risks. Items in the Low Impact/Low Likelihood column do not deserve much attention. You may need to focus some attention on those high impact but low likelihood events and monitor the low impact / high likelihood events.

Road Trip Risks
The matrix above of potential adverse events plotted against the axes of risk for a road trip. While we could spend many hours applying actuarial science to calculating probabilities and impacts of each of the consequences in the matrix, we can simply use the relative estimated positions in the chart to quickly focus on what potential consequences merit planning out mitigations. In the High/High quadrant we have “Traffic jam on I-95” that we may mitigate by leaving early. You could also generate ideas for mitigating the risk of being stopped for speeding such as avoiding the urge to engage “crazy mode” in your vehicle. Having your car stolen during a road trip would be a very high impact but low likelihood event. Since it’s low likely, incurring the expense of buying a second backup car that someone else follows you around with would likely cost much more than the mathematical expectation of losing the car. But since it’s high impact, you might choose to transfer the risk to an insurance company and have a plan to use a rental car if needed.

You could also invert this concept to map out various opportunities and rank each on their probability of success and positive impact. 

This exercise can be done individually or in a group. In a group setting, I recommend having everyone write down events on sticky notes. Mark off the grid with painter's tape on the wall. Have each person place their notes in the matrix. After all the notes are placed, go around to each person to ask if they think any changes should be made and continue round robin rounds until the matrix is stable.

Sunday, June 17, 2018

How Cops Taught Me to Manage Software Problems

Police car
Photo by Matt Popovich on Unsplash
Application production support teams are called on to respond to incidents. They investigate incidents by collecting evidence from servers and devices involved in the incident. They identify problem areas where multiple incidents occur and plan how to apply resources to reduce future incidents in those areas. What other profession does this sound like? Police have performed these functions for hundreds of years of human history without the benefit of being able to review consolidated Splunk logs or analyze the source code of suspects.

If you have performance problems and production incidents and you are unsure how to manage the underlying problems that cause them, consider borrowing a management technique used by many police and other civil government organizations called CompStat. CompStat is short for “Compare Statistics” and is based on four key components:
  1. Timely and accurate information or intelligence
  2. Rapid deployment of resources
  3. Effective tactics
  4. Relentless follow-up

I translate these key components for production application support as the following: 

1. Timely and accurate information
To be proactive and respond before a small problem becomes a big problem, you must have a way to detect problems. You need to see trends in performance and monitor failure rates. To properly prioritize response to problems, you need to know what functions are causing the most failures or lag in your system. A good application monitoring program can provide you the quantitative data you need to quickly identify problem areas that cause the most incidents and performance issues. Combining that quantitative data with the impact on your business or mission can give you a good idea of what to address first.

2. Rapid response
When you notice small performance problems, do they typically just magically disappear? Probably not. Small performance problems tend to become big performance problems as data and application usage grows. To prevent little problems from becoming big problems, respond quickly. To achieve system stability for your customers, you cannot stop at resolving the incident. You have to determine the underlying problems that lead to incidents and attack those problems with a sense of urgency.

3. Effective tactics, techniques and procedures
Are you able to employ resources to quickly isolate and resolve incidents? After resolving an incident consider documenting how it was isolated and investing in tools and methods to more quickly detect, isolate, and resolve the same class of problems in the future.

4. Relentless follow-up
Follow-up is the most easily forgotten component but is critical to successfully reducing problems in a system. If you deploy a change to resolve a problem and don’t validate that it actually resolved the problem, you will not know if your efforts are effective. Not following up on the effectiveness of a change can lead to moving on to solving lower priority problems when you have not actually resolved higher priority problems which might re-occur at inopportune times. Follow-up can reveal that there were multiple causes that lead to an incident, not just the one you fixed. After a investing time and money into fixing a problem, reporting the decline in incidents or increase in performance after the changes can help quantify the value provided by fixing problems and increase support for allocating resources to resolve production problems. After I deploy a change, I’ve gotten in the habit of not marking a user story done until after I’ve validated my fix has had the intended impact in production. I add a task to the story for post-deployment verification and set a calendar invite for myself to perform the task a week or two later -- however much time is needed to collect enough evidence to reasonably conclude the issue was fixed. Rushing to mark user stories done does not help quality or system stability. If you want a stable system, you have to follow up to ensure your changes are having the intended impact and not creating new problems.

Saturday, June 16, 2018

Managing Change the DOTMPLF Way


Most of us understand that if we run out one day and buy a cello, we won’t be able to play music like Yo-Yo Ma that same day. To be able to play something that resembles music you would need training, practice, and a safe place to store the cello. However, sometimes we don’t think about what’s necessary to achieve a new capability beyond just buying something like a new piece of software. To help understand the different types of changes that need to be made for successful technology injection, the US military uses the acronym DOTMLPF-P (pronounced Dot-M-L-P-F-P) that stands for Doctrine, Organization, Materiel, Leadership, Personnel, Facilities, and Policy.

Let’s say I want to change a team from doing deployments once a quarter to deploy once a week. Could I just install Jenkins and declare victory? That approach is unlikely to be successful because you will need to plan for personnel to configure Jenkins. You may need to train people on configuring and using Jenkins. You may need an organizational change to form a DevOps team to design and build out the initial deployment pipelines. To get the resources to do this and people focused on it, you will need to provide leadership. You may also need additional equipment like a new server to host Jenkins. You might rethink your change management policy if it can’t accommodate frequent releases. You may need to change the doctrine you follow, the way you develop and test software, by emphasizing smaller, quicker releases, changing your branching structure, using feature flags, and implementing automating tests. For this one change, you had to think about not just obtaining and installing the software but the doctrine, organization, training, materials (software & hardware), leadership, personnel, and policy -- seven components of DOTMLPF-P.

Below is the military definition of each element of DOTMLPF-P along with my own translation for a software developer or other IT professional.

Doctrine
Military Definition: “the way we fight (e.g., emphasizing maneuver warfare, combined air-ground campaigns).”
Developer Definition: The high level approached based on a set of general principles that we use to deliver software. Think agile as guided by the agile manifesto, RUP, or waterfall as different doctrines. Software delivery has predominantly shifted to small multi-disciplinary teams producing small and quick releases to production.

Organization
Military Definition: “how we organize to fight (e.g., divisions, air wings, Marine-Air Ground Task Forces)”
Developer Definition: How we divide into teams to deliver working software (feature teams, project teams, separate functional dev and test teams).

Training
Military Definition: “how we prepare to fight tactically (basic training to advanced individual training, unit training, joint exercises, etc).”
Developer Definition: How we learn to deliver software. Training in programming languages, new technologies, agile practices, security and other software and system engineering competencies. This may include formal training classes like a certification bootcamps, peer training, college courses, etc.

Materiel
Military Definition: “all the ‘stuff’ necessary to equip our forces.” When considering a new purchase, the definition is restricted to equipment “that DOES NOT require a new development effort (weapons, spares, test sets, etc that are ‘off the shelf’ both commercially and within the government)” to focus on considering existing solutions to potentially fill a capability gap.
Developer Definition: Your tools. This may include servers, workstation, Visual Studio, Eclipse, Emacs, etc.

Leadership and education
Military Definition: “how we prepare our leaders to lead the fight (squad leader to 4-star general/admiral -  professional development)”
Developer Definition: Leadership is necessary to bring people together to focus on solving a problem. If you are using 10 year old development tools and techniques or everyone thinks your processes don’t make sense but nobody does anything about it, it’s probably because your team lacks leadership. Leadership doesn’t have to come from formal managers. Indeed, to move up in the ranks, showing initiative, drive and leading even small changes like upgrading Visual Studio across the whole development team goes a long way to distinguish you as someone who is engaged, cares about the team, and can drive change.

Personnel
Military Definition: “availability of qualified people for peacetime, wartime, and various contingency operations”
Developer Definition: The people you have on your team(s).

Facilities
Military Definition: “real property, installations, and industrial facilities (e.g., government owned ammunition production facilities)”
Developer Definition: The physical space where your team works and the physical space where your servers are hosted. If you are hosting servers in your own facility, you have to think about heating and cooling, fire suppression systems, access controls and other physical security issues. If you are forming a new team or rearranging teams you may also think about how your office space is laid out to allow for collaboration and for quiet time.

Policy
Military Definition: “DoD, interagency, or international policy that impacts the other seven non-materiel elements”
Developer Definition: The rules you must follow as part of your software delivery process. For example, all changes must be approved by the Configuration Change Board prior to deploying to production.