Engineering

Safety I: The Traditional View

How the dominant paradigm defines safety as the absence of accidents — and why that matters

Learning Objectives

By the end of this module you will be able to:

  • Define Safety I and articulate its core assumption about the nature of failures.
  • Explain how root cause analysis works and what assumptions it rests on.
  • Describe defence-in-depth and barriers as the primary preventive mechanism in Safety I.
  • Identify the limitations of measuring safety performance through failure and incident counts.
  • Distinguish reactive from proactive safety management approaches.

Core Concepts

What Safety I Is

Safety I is the dominant paradigm in safety management. Its foundational premise is straightforward: safety is the absence of accidents and incidents. The goal is to ensure that "as few things as possible go wrong." To achieve that goal, organizations focus on identifying the causes of failures, establishing protective mechanisms, and learning from adverse events.

Under this view, the normal state of a system is one of safety. Failures are abnormal deviations — things that should not happen and that require explanation. When they do happen, the task is to find out why and stop it from happening again.

Safety I treats safety as a property measured by what is absent: accidents, incidents, adverse events. The fewer there are, the safer the system is.

This definition has three practical consequences that shape everything else about how Safety I operates:

  1. Safety is measured by counting failures. Because safety is defined as the absence of bad outcomes, performance is assessed by incident rates — the number of accidents, near-misses, or adverse events over a period. Reducing this count toward zero is the goal.

  2. Learning happens after the fact. Since something must go wrong before there is anything to investigate, Safety I is fundamentally reactive. Investigations happen post-incident; fixes address the causes identified; the system improves incrementally through repeated cycles of failure and correction.

  3. Failures are treated as traceable to identifiable causes. The assumption is that accidents have explanations — component failures, human errors, procedural gaps — and that identifying those explanations produces actionable corrective measures.

Root Cause Analysis

Root cause analysis (RCA) is the central investigative method of Safety I. OSHA describes it as a reactive procedure that identifies why accidents and incidents occurred by systematically finding the root causes embedded in workplace design, equipment, organizational factors, procedures, staffing, and management.

The word "root" carries theoretical weight. RCA assumes that surface symptoms — the operator error, the equipment fault, the procedural violation — are not the real problem. The real problem is deeper: the condition or decision that allowed those surface failures to exist or to compound. Find the root, remove it, and the symptom will not recur.

RCA techniques vary (5 Whys, fishbone diagrams, fault trees), but they share the same underlying logic:

  • The accident is the starting point.
  • Causality runs backwards through a chain of contributing factors.
  • Somewhere in that chain is a "root" cause that, had it been absent, would have broken the chain.
  • The fix targets that root.

Barriers and Defence-in-Depth

Rather than waiting for failures to occur, Safety I also tries to prevent them in the first place using barriers and layered defences. Barriers are physical or non-physical means intended to prevent, control, or mitigate undesired events. They come in two forms:

  • Hard defences — physical barriers such as guards, interlocks, and containment systems.
  • Soft defences — regulations, procedures, training, and certification requirements.

Defence-in-depth extends this idea by layering multiple barriers so that if one fails, the next catches what slipped through. No single barrier is assumed to be perfectly reliable. What matters is that the layers together are sufficient to stop hazardous events from reaching their consequences.

The Swiss cheese model, developed by James Reason, is the canonical illustration of this idea. Each layer of defence is represented as a slice of Swiss cheese. The holes are the weaknesses or gaps in each individual barrier. Normally the holes do not align. An accident happens when a sequence of events finds a path through aligned holes across every slice simultaneously.

Fig 1
Barrier 1 Barrier 2 Barrier 3 Hazard Accident
The Swiss cheese model: aligned holes across multiple defensive layers allow a hazard to reach the outcome.

Safety Culture as Institutional Infrastructure

Safety I is not only about technical systems. James Reason identified that the human and organizational side of safety requires its own infrastructure — what he called safety culture. He described it as "the engine that drives the system towards the goal of sustaining the maximum resistance towards its operational hazards." For Reason, a mature safety culture has five dimensions:

DimensionWhat it means
Informed cultureKnowledge of hazards and how they are being managed
Reporting culturePeople feel safe reporting concerns without fear of blame
Learning cultureThe organization extracts and acts on lessons from safety information
Just cultureAccountability is balanced with fairness; errors are distinguished from reckless acts
Flexible cultureAdaptive capacity to respond to novel safety challenges

This framing is complementary to barriers and RCA: the technical infrastructure catches failures in hardware and process, while safety culture sustains the human and organizational conditions that keep barriers functional and investigations honest.

The IAEA formalized a closely related definition in 1991, specifying that safety culture is "that assembly of characteristics, attitudes and behaviours in individuals, organizations and institutions which establishes that, as an overriding priority, protection and safety issues receive the attention warranted by their significance." The phrase "overriding priority" is deliberate: safety cannot be traded off against schedule or commercial pressure.

Measuring Safety Performance

Within Safety I, the standard way to assess whether a system is safe is to measure the rate of adverse events. This creates a structural paradox: the baseline condition is one of safety — most operations succeed without incident — so to get a measurement signal, something bad must actually happen. The safer the system, the less data you have to work with.

The low-incident trap

When a system has very few incidents, Safety I metrics become statistically unreliable. A long stretch without accidents can mean the system is genuinely safe — or that hazards are accumulating silently. The metric cannot distinguish between the two.

Step-by-Step Procedure

How a Safety I Investigation Works

The following is the canonical Safety I response to an adverse event. Understanding this sequence also reveals the assumptions embedded in it.

1. Detect and contain the incident. Once an accident or near-miss occurs, the immediate priority is stopping harm from spreading — securing the scene, treating injuries, stabilizing systems. This step is not investigative; it is operational.

2. Preserve evidence. Before the scene is disturbed, document what happened: logs, photographs, equipment states, witness accounts. The quality of subsequent analysis depends on the completeness of this record.

3. Reconstruct the event timeline. Establish a factual sequence: what happened, when, in what order. This is descriptive, not yet explanatory. The goal is a shared factual baseline before interpretation begins.

4. Identify contributing factors. Working backwards from the outcome, catalogue every condition, action, or decision that appears to have contributed. At this stage the net is cast wide — human factors, equipment states, environmental conditions, procedural status.

5. Trace to root causes. Apply a structured method (5 Whys, fault tree, fishbone) to distinguish proximate causes from deeper enabling conditions. The OSHA guidance on RCA emphasizes looking beyond the immediate human action to the organizational and design factors that shaped the context in which that action occurred.

Decision point — Is the root traceable? If no root cause can be identified that, if removed, would have plausibly prevented the incident, the investigation has reached a boundary of the model. This is a signal to consider whether the incident belongs to a class of failures better addressed by systemic methods rather than causal chains.

6. Develop corrective measures. For each identified root cause, propose a specific fix: redesign the component, revise the procedure, retrain the operator, add a barrier. Fixes should target the root, not the symptom.

7. Implement and verify. Corrective measures are put in place and their effectiveness is tracked. The improvement cycle closes when subsequent monitoring shows the specific failure mode no longer recurs.

8. Update the barrier model. Lessons learned feed back into the defence-in-depth architecture. A gap that allowed an incident to propagate becomes the site of a new or strengthened barrier.

Worked Example

A Deployment Pipeline Failure

Scenario. A production deploy at a software company introduces a database migration that drops a column still referenced by live query code. The service fails for 47 minutes. The incident is logged.

Step 1 – Contain. The team rolls back the deployment and restores service. Database schema is reverted.

Step 2 – Preserve evidence. Deployment logs, migration script, CI/CD run records, and the on-call engineer's timeline are assembled before anyone begins analysis.

Step 3 – Reconstruct the timeline. The schema migration ran successfully in the staging environment. CI passed. The deploy was triggered manually outside the standard release window. Service errors appeared three minutes post-deploy.

Step 4 – Contributing factors.

  • Schema migration dropped a column.
  • Application code still referenced the dropped column.
  • No integration test covered this query path.
  • Staging database had a different schema version from production.
  • Deploy was initiated without a second reviewer.

Step 5 – Root cause tracing (5 Whys).

WhyAnswer
Why did the service fail?A query referenced a column that no longer existed.
Why did the column not exist?The migration script dropped it.
Why was it safe to drop it?The migration author believed the column was unused.
Why was that belief unchecked?There was no automated check for column references across the codebase before migration.
Why was there no such check?The CI pipeline was designed for functional correctness, not for schema-code consistency.

Root cause: The CI pipeline had a gap — it did not verify that schema changes were consistent with all active query references.

Step 6 – Corrective measures.

  • Add a pre-deploy check that scans all query files for references to columns being dropped or renamed.
  • Synchronize staging and production schema versions as part of the CI process.
  • Require a second reviewer on all schema-altering migrations.

Step 7 – Verify. The new pre-deploy check catches a similar undetected reference in the next migration attempt two weeks later. The measure is confirmed effective.

Step 8 – Update the barrier model. The new automated check becomes a standard layer in the release pipeline — a soft defence standing between schema changes and production.

What this example shows

The Safety I process worked well here. The failure was discrete, the causal chain was traceable, and the fix was specific and verifiable. These are the conditions under which Safety I is at its strongest.

Common Misconceptions

"Safety I means ignoring human factors." Not accurate. Root cause analysis explicitly looks upstream of human actions to organizational and design factors. Reason's Swiss cheese model and his five-dimensions framework both place human and cultural factors at the centre of safety management. What Safety I does assume is that human behaviour is a deviation from a correct baseline, which is a different and contestable claim.

"Reducing incident counts means the system is getting safer." This conflates the metric with the property it is meant to measure. Because safety is the baseline condition, a stretch without incidents may simply reflect that no hazard has been triggered yet — not that hazards have been removed. Organisations can have very low incident rates while sitting on substantial accumulated risk.

"Root cause analysis finds the cause." RCA finds a cause — one that is consistent with the evidence and useful for generating a corrective action. Complex incidents typically have many contributing causes, and which one gets labelled "root" is shaped by the investigators, the tools used, and the organizational pressures on the investigation. The root cause is an analytical artefact, not a physical object discovered in the wreckage.

"More barriers always means more safety." Barriers interact. Adding new procedures can create compliance burdens that degrade the reliability of existing ones. Physical barriers can introduce new failure modes. Defence-in-depth is strongest when its layers are independent — and maintaining genuine independence requires active management, not just addition.

"Safety I and Safety II are opposites — you choose one." The originators of Safety II do not advocate abandoning Safety I. The argument is that Safety I is necessary but not sufficient, particularly for complex adaptive systems. Incidents should still be investigated; barriers are still valuable. What needs to change is the assumption that Safety I alone constitutes a complete safety management approach.

Boundary Conditions

Safety I works best when:

  • Failures are rare and discrete. When a system produces well-defined adverse events with clear boundaries, RCA can reconstruct the causal chain reliably. A pipeline failure with a specific error code is far more tractable than a slow organizational drift toward risk.

  • Causality is linear and traceable. Safety I's investigative methods assume that causes precede effects in a chain that can be followed backwards. When failures emerge from the interaction of many components — none individually faulty — causal tracing becomes ambiguous or misleading.

  • The fix space is constrained. Corrective measures work best when a specific barrier can be added or a specific gap closed. When the problem is diffuse — embedded in culture, organizational incentives, or architectural trade-offs — point fixes at identified root causes produce diminishing returns.

  • The system is relatively stable. STAMP's analysis of how systems migrate toward risk over time highlights a limitation of Safety I: safety constraints defined at design time may be eroded by operational changes, staffing turnover, or commercial pressures in ways that are not captured by incident counts.

Where Safety I starts to break down:

When the same performance variability that produces success also produces failure, the Safety I framing struggles. If workers must routinely improvise and adapt to make operations work, then the "deviation from correct procedure" framing misidentifies the mechanism. Investigating these failures with RCA tends to find human error at the proximate cause, then organizational factors at the root — but the corrective measures (better procedures, more training) often fail to address the underlying gap between how work is imagined and how it is actually done.

The 'as imagined' trap

Safety I investigations are based on reconstructed accounts of what happened. Those accounts are inevitably shaped by what investigators expect to find — i.e., the "work as imagined" model. In complex, adaptive systems, the gap between work-as-imagined and work-as-done is where Safety I's explanatory power most reliably fails.

Traditional safety analysis methods — fault trees, HAZOP, FMECA — share Safety I's underlying assumption that accidents are produced by chains of component failures. STAMP and related systems-theoretic approaches were developed precisely because this assumption fails for software-intensive and sociotechnical systems where the failure is not a component breaking but a constraint being violated due to inadequate control across system levels. That distinction is the subject of later modules.

Key Takeaways

  1. Safety I defines safety as the absence of accidents. This makes it inherently reactive: learning happens after failures occur, not before.
  2. Root cause analysis is the canonical investigative tool. It assumes failures are traceable to identifiable causes and that removing those causes prevents recurrence. Both assumptions hold in many contexts and fail in others.
  3. Defence-in-depth layers multiple barriers. Physical and procedural barriers are arranged so that no single point of failure reaches a harmful outcome. The Swiss cheese model captures the mechanism.
  4. Measuring safety by counting failures creates a paradox. The safer the system, the less signal the metric provides. A clean safety record is compatible with high latent risk.
  5. Safety I is necessary but limited. It works well for discrete, traceable failures in relatively stable systems. It struggles when failures emerge from the normal variability of complex, adaptive work.

Further Exploration

Foundational Sources

Root Cause Analysis

Barriers and Defence

Alternative Perspectives