Incident Investigation and Root Cause Analysis
The material presented on this page is extracted from the ebook Incident Investigation and Root Cause Analysis.
Yet, in spite of their undoubted value, these predictive techniques do have the following limitations:
Actual incidents, by contrast, provide hard information as to how things can go wrong, thus helping to cut through wishful thinking, prejudice, ignorance and misunderstandings. The root cause analysis that follows an incident investigation will help identify weaknesses and limitations in a facility's management system, thereby reducing the chance of recurrence of similar incidents.
Another reason for emphasizing the importance of incident investigation is that process safety management (PSM) systems - of which Incident Investigation and Analysis constitutes one element - have been in place in many cases for more than fifteen years. Many of these facilities have made good progress in meeting regulatory requirements. However, the fact that such systems can "survive an audit" and are working well on paper does not mean that they are as effective at actually improving safety as they might be. Incident investigations help identify how the elements of PSM really are functioning, and can provide management with insights as to how the PSM program can be improved.
Incident Investigation and Analysis Methods
Publications in the field of incident investigation and analysis often promote a particular methodology with the implicit claim that their approach is better than the methods promulgated by other organizations. Such publications are often commercial in their approach, thus tending to create a concern in the mind of the reader as to the objectivity of the materials that are presented.
This site does not advocate or promote any particular methodology. Indeed, it is suggested here that an effective incident investigation and analysis requires much more than the mere application of a particular investigation technique. Equally important - maybe more so - is the ability on the part of the investigators to inculcate an atmosphere of trust and confidence with everyone with whom they work - not only those involved in the incident itself, but also the managers who will be charged with taking appropriate corrective actions. Each analytical technique has its strengths and weaknesses - an effective investigation will use a judicious mix of approaches as circumstances dictate.
Therefore, rather than stress the use of just one particular analytical method this ebook suggests that a successful investigation should be conducted through use of the six strategies and techniques listed below and also shown in Figure 1.
Trust and CandorThe most important feature of a successful investigation is the establishment of trust between the investigators - who are not interrogators - and the persons involved in the incident itself. In one instance a technician whose actions had contributed to the occurrence of an injury event approached his boss twenty four hours after the interviews had been concluded; he voluntarily reported that a valve that should have been open at the time of the incident was actually closed. Without that information it is unlikely that the investigating team would have ever have fully understood what happened. The technician was not the only person who was candid. His boss, who had twenty five years of experience with the equipment involved, took the initiative to successfully work out the complex sequence of events that led to the incident, even though the upshot was to make his own company and he himself look more accountable for the event. The integrity and candor displayed by these two persons showed that the investigation process had gone well (and neither of them got into trouble).
It is also important to establish trust with the managers of the facility where the incident occurred. This can be done by ensuring that the investigation team keeps management fully informed at each stage in the investigation process. Thus the project becomes "our investigation", not "their investigation".
Listen to the FactsMany incident investigators are intelligent, highly experienced, and are not lacking in self-confidence. Although these attributes are important they can get in the way of simply listening to the facts. For example, during one investigation, the team members noted that a piece of equipment was damaged. By assuming that the damage occurred immediately prior to the event a plausible explanation as to what happened was developed. Unfortunately for the credibility of the team members who had jumped to the (incorrect) conclusion as to what had happened, a manager who arrived at the site a few hours after the incident noted that the equipment had not been damaged at that time; therefore the damage must have occurred when the affected equipment item was being removed for inspection and repair. This inconvenient fact overturned the investigators' elegant and satisfying analysis.
An investigator must always be thorough - particularly when he or she thinks that the investigation is complete, and no more fact finding work is needed. For example, on one investigation the equipment involved in the event was moved from its location in the field to the vendor's yard. The lead investigator felt that there was really no point in going to the site of the incident because there would be nothing to be learned. Nevertheless, he did visit the site as a matter of duty. When he did so he uncovered some new information that led to a basic reassessment as to how serious the event could have been.
Cause and Effect
Virtually all investigations include the development of a timeline. By ordering the events sequentially it becomes possible to determine their causes, and then the causes of those causes.
One engineering company had as its motto, "There's no substitute for knowing what you're doing". In many investigations it is found that a real expert is needed in order to establish the technical details as to what happened. As already noted in the example provided above, a senior manager who had twenty five years experience to do with the equipment involved took it upon himself to determine what happened. Without his insight, knowledge and experience the investigation team would have taken much longer to determine what happened - indeed they might never have done so.
Root Cause AnalysisOnce the facts have been established and an understanding of the event has been established, a root cause analysis can be carried out in order to apply lessons learned to a broader set of circumstances. There are four types of root cause analysis. They are:
Project ManagementOne of the difficulties associated with many investigations is that they tend to suffer from "scope creep". They grow and grow and grow with root causes being piled upon root causes, without any clear idea as to when the end point has been reached. As one manager once sarcastically observed, "The team seems to be trying to solve world poverty". It must be understood by all the team members that an investigation is a project, just like any other project. As such it needs a budget, a schedule, a clear scope of work and a contingency plan for when things go awry.
A particularly common project management difficulty is that an investigation proceeds well until the root causes of the incident have been established, at which time the investigators are reassigned to their "real work".
CommunicationsAn effective incident investigation and analysis program generally contains two major components: technical and human. The technical side of the investigation is what most publications in this area focus on, particularly with regard to root cause analysis. However, what does not always receive the same degree of attention is the human aspect of incident investigation work. An effective investigator understands how people think and behave. Consequently he or she must be able to communicate with a wide range of people, particularly those listed below.
Words and phrases such as incident, accident and near miss tend to be used quite loosely in general conversation. They also tend to have different connotations in English, American and Canadian usage. However, in the context of formal incident investigation and analysis such words need to be tightly defined. The definitions used for these terms in this page are provided below.
An incident is an event that has either caused harm or loss, or that has the potential to cause harm or loss, and that could have been prevented or reduced in severity through use of the company's management systems or by improvements to those systems.
The key to the above definition of the term incident requires that it be preventable through use of the facility's normal management systems - thus excluding bizarre external events such as an airplane crashing into the facility. However, many external events, such as earthquakes or very severe weather, can be anticipated and should therefore be considered in the design and operation of the facility and in the development of the emergency response program.
Some incidents are outside the control of the facility managers; such incidents require attention at a higher level. For example, most large corporations have a procurement policy that is used throughout the whole company. If an incident investigation at one site shows that problems with procurement were a contributing factor then the corrective action will probably have to be addressed at the corporate level.
The definition of the word incident covers not just safety and environmental harm but also economic loss. Most of the literature to do with incident investigation and analysis focuses on safety-related events. But there is no reason why the techniques developed to investigate and understand such events could not also be used to address lost production, reduced efficiency and unexpected equipment failure.
The word accident is not used in these publications because it the word implies surprise and lack of controllability. There is nothing anyone can do about accidents. The whole point of an incident investigation and analysis program is that all aspects of an operation under control of management. Only unpredictable external events such as an airplane crash alluded to above are true accidents.
Near Miss / Hit
The term near miss - which may better be called near hit - describes an incident that did not result in an actual loss but that had the potential to do so. For example if an object is dropped from a crane but no one is hurt then the incident is a near miss. Near misses, particularly those that could have had high consequences, should be investigated thoroughly because they are strong indicator of system failures. They are a free lesson learned. In terms of fault tree analysis a near miss is an event in which one or more of the inputs to an AND Gate was negative.The following are examples of near misses:
A potential incident creates the possibility of an event, but nothing actually happens. The key difference between a near miss and a potential incident is that, with a near miss, an event did take place but the consequences were minor. With a potential incident nothing happened at all. For example, if a worker drops a wrench from an upper deck and it hits the floor three stories below but no one is hurt then a near miss has taken place. If the same worker holds the same wrench such that, were he to drop it, it would fall to a lower deck, then he has created a potential incident.
High Potential Incident
A High Potential Incident (HPI) is a potential incident which, were it to have occurred, would have led to major loss. For example, if a toxic gas leaks from a flange into the atmosphere but no one is present then an HPI has occurred. No one was present, but the potential for a fatality existed.
The end result of an incident analysis is the development of one or more root causes. A careful study of root causes suggests ways in which management can implement systemic improvements throughout the organization.
A root cause of an incident is the most basic cause that can reasonably be identified and that management can change.
The process of investigating and analyzing incidents that is used throughout this page can be divided into the six steps shown in Figure 3.
Step 1 - Initial Investigation
The initial investigation, which can start as soon as in the emergency response is completed, is carried out by what is sometimes referred to as the Go Team. Speed is of the essence at this stage of the investigation. The Go Team provides management with quick feedback as to what happened and what immediate corrective actions may need to be taken at other sites or facilities that share the same technology. The team collects information as soon possible, particularly information that may change quickly such as that do with process conditions. The team should not to disturb evidence, except as needed to ensure the continued safety of the facility. One of the team's most important tasks is to interview participants and witnesses as soon as possible. People's memories quickly fade, and they adjust their memories based on what they think should have happened or on what other people tell them. It is vital that these people be asked to tell their story as soon as possible.
Step 2 - Evaluation and Team Formation
Following the initial investigation, management will evaluate the seriousness of the incident and assess the potential it provides for lessons learned. Management must also decide as to how detailed the investigation should be. This means that a method for evaluating the seriousness of events - particularly near misses and potential incidents - has to be selected.
At the top of Figure 4 are the sponsor and the incident owner. The sponsor is a senior executive, usually with line authority over the persons involved in the incident. He or she will authorize the terms of reference for the investigation and fund the work. The incident owner will typically be the line manager over the facility where the incident occurs. The owner may not spend a lot of time working with the team, but he or she provides direction, ensures that the terms of reference are being followed and is the recipient of the final report. Figure 4 also shows that the owner has delegated the task of managing the information to do with the incident to the Process Safety Management (PSM) coordinator, who uses an incident register to document the progress of the investigation and to manage the subsequent follow-up.
A major incident investigation can have as many as three investigative teams, one for each of the steps shown in Figure 4 (Go Team, Investigation Team and Analysis Team). The composition of the teams will depend on a variety of issues such as the seriousness of the event, the likelihood of litigation, and the technical aspects of the incident. It is likely that the three teams will have many members in common, but it is useful to make the distinctions shown in Figure 4 so that the team members have a clear idea as to their role at each stage in the process. (For example, the analysis team may have a member whose only role is to help the team understand the incident investigation methodology that has been selected and to run the applicable software.)
As an incident investigation proceeds the teams will be required to brief management as to its findings on a regular basis. The frequency and level of detail of the briefings will naturally depend on the severity of the event. Figure 4 outlines a representative reporting procedure. The Go Team issues its first report which summarizes the major issues to do with the incident. The formal investigation team issues one or more interim reports as it progresses with its work. Lastly, the analysis team delivers the final report, containing both the root cause analysis, the findings and suggested action items.
Step 3 - Information Gathering
Referring once more to Figure 3, after the team has been assembled and the terms of reference generated, the first step in the formal investigation process itself is to collect information about what happened. The information will generally come from interviews, documents, instrument records and field observations. At this stage of the investigation it is especially important not to jump to conclusions but to let the facts speak for themselves. The focus must be on gathering data - mostly from interviews, site inspections and the examination of log books and instrument records.
Step 4 - Timeline Development
Once the information-gathering step is more or less complete the investigation team can develop a timeline that outlines the sequence of events. As the timeline is developed it will become clear that certain information items are either missing or not detailed enough, so the team will go back to Step 3 - Information Gathering.
Step 5 - Root Cause Analysis
Individual incidents are usually indicative of a broader range of management or system problems. Simply correcting the actions and events that led to the particular incident that is being investigated represents an opportunity missed - what is needed is a process for identifying underlying or root causes so that a broader range of future incidents can be avoided. This is done through the process of root cause analysis.
Step 6 - Report and Recommendations
The final step in an investigation as shown in Figure 4 is to issue a report to all affected parties and then to take the appropriate corrective actions to prevent recurrence of similar events. Typically the report will summarize the event itself, the root causes that were identified and recommendations as to how the findings may be addressed. Follow up of the recommendations will not usually be the responsibility of the investigation team; that activity is the responsibility of the facility management, particularly the sponsor, not the investigation team.
Copyright © Sutton Technical Books 2007-2012. All rights reserved
PO Box 2217