Engineering

Investigation and Learning Methods

How organizations examine incidents — and what determines whether they learn anything from them

Learning Objectives

By the end of this module you will be able to:

  • Compare root cause analysis (RCA), CAST, and FRAM across their underlying assumptions about causation, their outputs, and their appropriate use cases.
  • Conduct a basic FRAM analysis for a software incident, identifying function variability and resonance.
  • Explain why just culture is a prerequisite for a learning organization, and describe what a just culture approach looks like in practice.
  • Identify the organizational conditions — psychological safety, leader behavior, reporting systems — that enable or block incident learning.
  • Distinguish investigation that produces organizational learning from investigation that produces only individual accountability.

Core Concepts

The Three Investigation Paradigms

Software incidents get investigated. What varies enormously is the theoretical framework behind that investigation — the assumptions investigators bring about what causes events, what counts as an explanation, and what a good outcome looks like. Three frameworks dominate the modern landscape: Root Cause Analysis, CAST, and FRAM. Each rests on different foundations and produces different kinds of insight.

Root Cause Analysis

Root cause analysis is the dominant approach in Safety-I thinking. Its underlying assumption is that incidents have identifiable root causes — defects in workplace design, equipment, procedures, staffing, or management — that, once removed, prevent recurrence. RCA is inherently reactive: it requires an incident to have already happened. Its value is in surfacing systemic contributors rather than stopping at surface-level symptoms, but its causation model is fundamentally linear. You trace back from the event through a chain of contributing factors to a set of root causes, and then fix those causes.

RCA is well-suited to incidents where the causal chain is relatively straightforward: a misconfigured firewall rule that allowed unauthorized access, a missing input validation that caused data corruption. The method struggles when the system is complex, when multiple interacting factors produced the outcome, or when the same surface conditions normally produce fine results.

The reactive constraint

Because RCA requires a failure to have occurred, it produces no learning from the vast majority of operations that succeed. Organizations relying exclusively on RCA are systematically blind to what keeps things working.

CAST: Causal Analysis Using Systems Theory

CAST is a post-incident investigation methodology built on STAMP (Systems-Theoretic Accident Model and Processes). Rather than tracing linear causal chains, CAST examines the hierarchical control structure of the sociotechnical system and asks: where did control fail to prevent the accident? This means looking across system levels — technical components, human operators, management processes, organizational policies — and identifying the inadequate controls that allowed safety constraints to be violated.

Unlike traditional approaches such as Fault Tree Analysis, CAST does not assume that a sequence of component failures caused the incident. It instead looks at how inadequate control across multiple levels of the hierarchy contributed — making it particularly well-suited to complex systems where software, organizational factors, and human decision-making are deeply intertwined.

The output of a CAST investigation is a picture of the control structure failures: what signals were not detected, what corrective actions were not taken, what organizational pressures degraded the controls, and what changes would strengthen the system's ability to enforce safety constraints in the future.

FRAM: Functional Resonance Analysis Method

FRAM takes a different starting position entirely. Where RCA and CAST both seek to explain why something went wrong, FRAM is designed to understand how the system functions — both when things go right and when they go wrong. FRAM is more compatible with Safety-II and resilience engineering perspectives than with traditional Safety-I methods.

The core insight behind FRAM is that in complex sociotechnical systems, outcomes emerge from the resonance of multiple functions interacting with each other, each of which has natural variability in how it performs. Functions are described across six aspects: input, output, preconditions, resources, time, and control. When variability in one function propagates to and amplifies variability in others, the resonance can produce unexpected outcomes — both bad and good.

This framing has important consequences. FRAM does not look for a defective component or a failed control. It looks for how normal, everyday variability in how work is actually performed can combine in ways that produce surprising results. FRAM has been used in healthcare to identify gaps between work-as-imagined and work-as-done — which is precisely the kind of insight that linear causal methods miss.

FRAM does not ask "what broke?" It asks "how do the functions of this system interact, and how did their variability combine to produce this outcome?"

What Counts as a "Cause"

Each method encodes a different theory of causation. Understanding this matters because the theory of causation shapes every other decision — what questions get asked, who gets interviewed, what counts as a satisfactory explanation, and what corrective actions get recommended.

AssumptionRCACASTFRAM
Causation modelLinear chain of root causesControl structure failures across hierarchyResonance of function variability
DirectionalityBackward trace from outcomeMulti-level control analysisNon-linear, emergent
Unit of analysisDefective components or conditionsControl actions and constraintsFunctions and their interactions
Learning scopeWhy this failure occurredWhy these controls failedHow the system actually works
Applicable to success?NoLimitedYes

Just Culture: The Organizational Precondition

Investigation methods matter. But they operate within organizational conditions that determine whether the findings from any investigation produce genuine learning or get suppressed, distorted, or ignored.

The most important of these conditions is just culture. Sidney Dekker's restorative just culture framework reframes accountability fundamentally: rather than seeking to assign blame and punishment, accountability in a just culture means seeking an account of what happened — a rich, contextual explanation that serves organizational learning. Restorative justice principles guide the process: parties discuss harms, needs, and obligations together. This is a deliberate departure from retributive models that assign blame to individuals.

This distinction matters because the kind of accountability applied to an incident determines what people reveal in investigation. A just culture differentiates between human error, at-risk behavior, and reckless behavior — and applies different responses to each. Honest mistakes are not punished. Deliberately unsafe acts remain subject to accountability. The fairness of this distinction is what makes it possible for people to participate honestly in investigations.

The Reporting Chain

Just culture is not merely a philosophical stance. It has a concrete organizational function: just culture unlocks a reporting culture, and reporting culture enables a learning culture.

When people fear punishment for honest mistakes, they conceal errors, underreport near-misses, and participate in post-incident investigations only to the degree needed to protect themselves. The information that would allow the organization to detect unsafe conditions and patterns simply does not surface. Just culture removes the fear of unjust punishment, making it safe to report. That reporting creates the information system that a learning culture requires.

Just culture is not blame-free culture

A common misreading is that just culture means no accountability. It does not. Just culture is a fair and contextual accountability system — individuals are held appropriately responsible, with accountability assessed in the context of contributing system factors. The goal is balance, not absolution.

Psychological Safety and the Investigation Room

Just culture operates at the organizational policy level. Psychological safety operates at the team level, and it determines what actually happens in the investigation room — whether people ask questions, admit uncertainty, surface problems they observed before the incident, and engage with the post-mortem honestly.

Teams with high psychological safety can detect, discuss, and learn from errors more effectively because acknowledging mistakes carries no interpersonal risk. Without psychological safety, teams develop defensive routines: self-censorship, concealment of errors, reframing of problems to deflect blame. These routines create a vicious cycle — silence prevents correction, problems persist, and the environment becomes less safe to speak up in.

Psychological safety also enables the questioning and help-seeking that good investigation depends on. When investigators can ask naive questions, admit they do not understand a system behavior, and seek clarification from operators without the interaction becoming interpersonally charged, they gather better information and construct better accounts of what happened.

Learning from Success, Not Just Failure

A final organizational condition that shapes the scope of learning: whether the organization looks only at failures, or at everyday work as a source of insight.

Safety-II approaches emphasize learning from successful operations and everyday adaptations, not just from incidents. The reason is that things go right primarily because workers make sensible, situationally-appropriate adaptations — not simply because they follow procedures as written. Success depends on situational awareness, flexibility, and the ability to adjust to varying conditions. These capacities cannot be fully substituted with rules and procedures, and understanding them requires studying everyday work, not just incident reports.

Organizational learning, in a resilience engineering sense, is collective, multilevel, and multidimensional. It extends beyond incident investigation to include learning from normal operations and successful adaptations — enabling organizations to continuously refine their ability to anticipate, monitor, and respond to both expected and unexpected challenges.

Step-by-Step Procedure

Conducting a Basic FRAM Analysis

FRAM analysis proceeds through a structured sequence. The steps below are calibrated to a software incident scenario, where the goal is to understand how function variability contributed to an unexpected outcome.

Step 1: Define the analysis scope

Establish the scope of the analysis: the system, the timeframe, and the specific outcome you are trying to understand. For a software incident, this might be: "the payment processing system during the 23-minute degradation window on the evening of March 14."

Step 2: Identify the functions

List the functions involved in the system's operation during the relevant period. A function is anything that contributes to the system's behavior — a monitoring alert, a deployment pipeline step, an on-call rotation handoff, a configuration management tool run, a load balancer decision. Think broadly: include human actions and organizational processes, not just software operations.

Step 3: Characterize each function across six aspects

For each function, describe:

  • Input: What triggers or feeds into this function?
  • Output: What does this function produce?
  • Preconditions: What must be true for this function to execute?
  • Resources: What does this function consume or depend on?
  • Time: When does this function execute, and what temporal constraints does it operate under?
  • Control: What governs or regulates this function?

Not every aspect will be relevant for every function. Work through what is observable and knowable.

Step 4: Identify variability in each function

For each function, ask: how does this function's performance vary? Was the monitoring alert slower than usual? Did the handoff happen under time pressure? Was the configuration tool run by someone less familiar with this system? Variability is normal — FRAM assumes all functions have natural variation. Document it without judgment.

Step 5: Map functional couplings

Identify which functions are connected — where the output of one function is the input, precondition, resource, or control of another. Draw these connections. The resulting map shows the coupling structure of the system during the incident.

Step 6: Trace resonance

Examine the coupling map and ask: where did variability in one function propagate to others? Did the slow monitoring alert change the timing of the on-call response, which changed the information available to the engineer making a rollback decision, which changed the corrective action taken? Resonance occurs when variability propagates and compounds across couplings. This is where the unexpected outcome emerged.

Decision points in FRAM

FRAM analysis requires judgment at several steps. When identifying functions, be wary of both over-granularity (every micro-action is a function) and under-granularity (treating an entire team's work as a single function). The right level of abstraction is one where functional variability is observable and meaningful. When tracing resonance, prioritize couplings that were actually active during the incident window, not those that are merely theoretically possible.

Step 7: Draw conclusions about variability management

FRAM analysis does not produce a list of root causes to fix. It produces a picture of functional resonance. The conclusions are about how to manage variability: which functions need tighter constraints to reduce variability, which need more flexibility to absorb variability from connected functions, and where monitoring or buffering between functions would prevent resonance from amplifying.

Compare & Contrast

RCA, CAST, and FRAM Side by Side

Fig 1
Dimension RCA CAST FRAM Theory of causation Linear chain of root causes Control failures across hierarchy Resonance of function variability Primary question What broke and why? Which controls failed, and why? How did variability resonate? Theoretical base Safety-I STAMP / Safety-I+ Safety-II / Resilience engineering Output Root causes + corrective actions Control structure failure map Functional resonance map + variability mgmt Best suited for Straightforward, discrete failures Complex sociotechnical incidents; software+org Complex adaptive systems; normal work Applicable to success? No Limited — requires incident Yes — designed for success analysis too Learning scope This failure This system's controls How the system works
Investigation method comparison across key dimensions

When to Reach for Each Method

Use RCA when the incident has a clear, discrete failure mode: a specific component malfunctioned, a specific step was missing from a procedure, a specific configuration was wrong. RCA is fast, well-understood by most teams, and produces actionable corrective actions. Its limitation is that it naturalizes linear causation and may produce misleading oversimplifications for complex incidents.

Use CAST when the incident involved failures across multiple system levels — not just a technical fault, but interactions between technical issues, human decisions, and organizational factors. CAST is particularly valuable for incidents where the question "whose fault was this?" is already being asked and needs to be replaced with a more systemic analysis. It is more demanding than RCA — requiring familiarity with STAMP — but produces a richer, less blame-prone account of what happened.

Use FRAM when you want to understand the system rather than just the incident — or when the incident seems to have emerged from conditions that were, individually, quite normal. FRAM is the right tool when your intuition says "everything seemed fine and then it wasn't" or when previous RCA-driven corrective actions have not prevented recurrence. FRAM can also be applied proactively, before incidents, to understand how variability in the system could combine in undesirable ways.

Worked Example

A Deployment Incident: Three Investigations

The incident: A software team deployed a routine configuration change to production on a Friday afternoon. Within 11 minutes, error rates for a payment integration began rising. The on-call engineer was mid-handoff at shift change. The deployment's monitoring alert fired but was caught in a notification backlog caused by a separate alerting system that had been generating high-volume false positives for two weeks. By the time the incident was declared, 23 minutes had elapsed and 4,200 payment requests had failed.


Investigation 1: Root Cause Analysis

The RCA works backward from the outcome. Contributing factors identified:

  • Deployment occurred during a risky time window (Friday afternoon, shift change).
  • Monitoring alert was delayed by notification backlog.
  • Alert fatigue from the false-positive alerting issue reduced response urgency.
  • No explicit "deploy freeze" policy for shift change windows.

Root causes: absence of deployment timing policy; unresolved alerting system issue creating alert fatigue.

Corrective actions: deploy a deployment freeze window policy; prioritize resolving the false-positive alerting issue.

The RCA produces clear, actionable fixes. But it treats the incident as the product of two missing controls. It does not explain why the alerting backlog had persisted for two weeks without being prioritized, what the shift change handoff actually looked like, or whether similar conditions regularly occur without incidents.


Investigation 2: CAST

The CAST analysis maps the control structure. It asks: at each level of the system hierarchy, what safety constraints existed, and how did control over those constraints fail?

At the technical level: the deployment pipeline had no constraint preventing deployment during shift change windows. The monitoring system had no constraint preventing alert stacking or fatigue-inducing backlog growth.

At the team level: the on-call rotation had no explicit protocol for mid-handoff deployment awareness. The engineer receiving handoff had no awareness of the pending deployment.

At the organizational level: the unresolved alerting issue had been deprioritized for two weeks without a formal risk assessment. No process required alerting health to be validated before deployments.

CAST reveals that multiple control gaps existed simultaneously and independently. None was sufficient to cause the incident alone. The incident emerged from their combination — which is why it did not happen on the many previous Fridays that shared some of these conditions.


Investigation 3: FRAM

The FRAM analysis maps the functions active during the incident window.

Key functions include: run deployment pipeline, generate monitoring alert, route alert notification, receive handoff, assess incoming alert, declare incident, execute rollback.

For each function, variability is documented:

  • Route alert notification: output variability — alert delivery delayed by backlog; timing was longer than usual.
  • Receive handoff: precondition variability — the incoming engineer had partial context, as the outgoing engineer was managing an unrelated conversation simultaneously.
  • Assess incoming alert: resource variability — the engineer had lower prior exposure to this service's error pattern.

The functional couplings show: the variability in route alert notification (delayed output) coupled with variability in assess incoming alert (resource deficit) to produce a situation where the alert, when it did arrive, was triaged lower than warranted. The resonance of these two variabilities — neither catastrophic alone — produced the 23-minute gap before incident declaration.

The FRAM conclusions are about variability management: tighten the coupling between deployment execution and the on-call handoff (so the incoming engineer is explicitly aware of in-flight deployments); buffer the alert routing function during high-backlog conditions (escalation path); investigate the alerting false-positive issue not just as a prioritization failure but as a systemic source of variability in alert assessment reliability.

Common Misconceptions

"Blameless post-mortems mean no accountability."

Blameless does not mean consequence-free. Just culture distinguishes human error, at-risk behavior, and reckless behavior — and applies different responses to each. Removing blame for honest mistakes is not the same as removing accountability for reckless behavior. The intent of blameless investigation is to remove the distortion that fear of punishment introduces into the information gathered. If people fear the investigation, they shape their account to minimize personal exposure, and the organization loses access to the actual conditions that produced the incident.

"Root cause analysis finds the real cause."

The phrase "root cause" implies that causation has a single, stable bottom. Complex sociotechnical incidents do not have single root causes — they have contributing conditions that interacted. STAMP-based analysis shows that what traditional causal chain models call "root causes" are actually arbitrary stopping points in a backward trace — the investigation stops when it finds something it can act on, not because it has found the actual origin of the outcome.

"Investigation is only worth doing after major incidents."

Safety-II approaches emphasize learning from successful operations and everyday adaptations. The near-miss that was quietly resolved, the workaround that became normal practice, the adaptation that one team made to a tool that another team still uses as designed — these are rich sources of information about how the system actually functions. Limiting investigation to major incidents means learning from the small fraction of operations where something visibly went wrong.

"If people followed procedures, the incident wouldn't have happened."

Things go right primarily because workers make sensible, situationally-appropriate adaptations — not because they follow procedures exactly as written. This cuts both ways: the adaptations that usually succeed are also the adaptations that can sometimes combine in ways that produce incidents. Treating procedure deviation as a root cause simply substitutes individual blame for systemic understanding.

"Psychological safety means everyone feels comfortable."

Psychological safety is specifically about the absence of interpersonal risk for speaking up, asking questions, reporting errors, and proposing ideas. It is not about comfort or about avoiding challenge. Psychologically safe teams can engage in vigorous disagreement — the difference is that disagreement does not threaten team membership. The point for incident investigation is that people can report what they actually observed without fear that doing so will be used against them.

Key Takeaways

  1. Investigation method encodes a theory of causation. RCA assumes linear chains of root causes. CAST examines hierarchical control failures. FRAM traces resonance among variability in system functions. The method chosen determines what questions get asked, what counts as an explanation, and what corrective actions become visible.
  2. Just culture is the organizational precondition for learning. It differentiates human error from reckless behavior, applies proportionate responses to each, and removes the fear that causes people to conceal errors and shape their accounts during investigation. Just culture is not blame-free culture — it is a fair and contextual accountability system that unlocks reporting and, through reporting, learning.
  3. Psychological safety operates at the team level and determines what happens in the investigation room. Without it, defensive routines — self-censorship, concealment, reframing — systematically degrade the quality of information available to investigators. With it, team members can ask questions, admit uncertainty, and surface what they actually observed.
  4. Learning from success is not a nice-to-have. Complex systems succeed primarily because workers adapt to varying conditions in real time. Understanding those adaptations — through FRAM analysis and attention to everyday work — reveals system properties that incident-only investigation cannot access.
  5. The same incident can produce organizational learning or individual accountability, but rarely both at maximum. Investigation processes that are primarily oriented toward determining who was responsible tend to produce defensive participation, distorted accounts, and narrow corrective actions. Investigation processes that seek to understand how the system produced the outcome tend to produce systemic insights and durable improvements.

Further Exploration

Primary sources on investigation methods

On just culture

On psychological safety and learning

On Safety-II and learning from success