Capstone: Integrating Safety Approaches
From frameworks to judgment — building a coherent safety practice across culture, process, and organizational design
Learning Objectives
By the end of this module you will be able to:
- Integrate Safety I, Safety II, HRO, STAMP, and resilience engineering frameworks into a coherent view of safety as a system property rather than a collection of independent theories.
- Select the most appropriate investigation method for a given incident based on system characteristics and what kind of learning you need.
- Evaluate an organization's safety posture across multiple dimensions and identify priority improvement areas.
- Apply cross-industry lessons from aviation CRM and healthcare safety programs to software engineering contexts.
- Design a safety improvement roadmap that addresses culture, processes, and organizational design in a realistic engineering organization.
Core Concepts
Safety Is a System Property, Not a Component Feature
Every framework you have encountered across this learning plan converges on a single foundational insight: safety is not a static property that components either have or lack. It is an emergent, dynamic capability that must be actively sustained through organizational processes, human expertise, and deliberate system design.
This has a practical implication that runs against engineering instinct. You cannot audit your way to safety by inspecting individual components. Safety emerges — or fails to emerge — from the interplay between people, tools, procedures, and organizational structures. Resilience cannot be measured by examining individual components in isolation; it arises from sensing, anticipation, learning, and adaptation working together.
Organizations cannot assume safety as a default state. Safety requires ongoing investment and active management.
The Two Safety Lenses
The most enduring conceptual divide in safety management is between Safety I and Safety II. Understanding both — and knowing when to apply each — is the core skill this capstone develops.
Safety I treats safety as the absence of accidents and incidents. Its goal is to ensure that "as few things as possible go wrong." It works by identifying causes of failure, establishing protective barriers and defenses, and learning from adverse events through root cause analysis. It measures safety by counting failures.
Safety II treats safety as the ability of a system to succeed under varying conditions. Its goal is to ensure that "as many things as possible go right." Rather than focusing on what goes wrong, it emphasizes understanding and amplifying the adaptive capacity that allows systems to function effectively despite unexpected challenges.
Neither lens is sufficient on its own. Recent synthesis research specifically advocates for integrating Safety I and Safety II approaches, combining Safety I's hazard controls with Safety II's methods for learning from successful performance. Integration enables both reactive hazard control and proactive resilience development.
Some Safety II proponents argue that Safety I's reactive, failure-focused approach fundamentally contradicts Safety II principles and cannot be genuinely integrated. This is an honest disagreement in the field. The practical position taken here — integration over purity — is supported by the balance of synthesis research but carries this caveat.
Theoretical Pluralism in Safety Research
The Safety I / Safety II divide is one instance of a broader pattern: safety science has produced several major theoretical frameworks that were initially framed as competing. Contemporary research proposes that Normal Accidents Theory (NAT) and High Reliability Organizations (HRO) theory should be viewed as complementary rather than contradictory. Multiple scholars argue for an integrated systems approach that also incorporates Resilience Engineering and STAMP.
The key insight is that these frameworks address different aspects of system safety without mutual exclusivity. NAT explains why certain tightly-coupled, complex systems will eventually produce failures regardless of mitigation efforts. HRO theory examines the organizational practices that allow some systems to operate reliably despite operating in high-hazard environments. Resilience Engineering asks how systems maintain functioning under variability. STAMP models safety through control structures and constraint violations rather than event chains. Each illuminates something the others do not.
Not everyone agrees on how far integration should go. Some scholars warn against fusing HRO and Resilience Engineering into a "theory of everything," pointing to their different epistemological foundations — HRO is rooted in social science; Resilience Engineering in engineering systems thinking. Full integration may be problematic. Selective, purposeful use is more defensible than forced unification.
Safety Programs Must Be Actively Built
A consequence of viewing safety as an emergent dynamic property is that safety requires ongoing investment and active management. Vigilant monitoring, rapid response, learning from near-misses and successes, and proactive anticipation of emerging threats are all continuous obligations — not one-time certifications.
Empirical research across construction, healthcare, and chemical processing has demonstrated that resilience engineering approaches, when integrated into organizational systems, produce measurable improvements in safety performance. Resilient safety culture positively impacts safety outcomes, and individual and organizational resilience positively influence safety climate. This finding spans multiple industries, suggesting the pattern is generalizable.
Compare & Contrast
Investigation Methods: When to Use What
The method you reach for after an incident shapes what you learn from it — and what you miss.
| Method | Best suited for | Core assumption | What it surfaces | Limitations |
|---|---|---|---|---|
| Root Cause Analysis (RCA) | Single-point failures, linear causal chains | Events have identifiable root causes | Proximate and distal causes in a chain | Stops at a "root" that is often arbitrary; assumes linearity; can oversimplify complex systems |
| CAST (STAMP-based) | Complex system accidents where control structure failures are likely | Accidents result from inadequate control constraints, not just event chains | Control failures, missing constraints, coordination breakdowns across levels | Requires investment in understanding the full control hierarchy; less familiar to most practitioners |
| FRAM (Functional Resonance) | Incidents in complex sociotechnical systems where variability is normal | Accidents emerge from resonance among normal performance variability | How everyday variability combines in unexpected ways | Output is a model, not a causal story; requires facilitated analysis; results are harder to present to stakeholders |
Selecting a method is not only a technical question — it is also a political and organizational one. RCA produces a narrative that fits most organizational reporting expectations. CAST and FRAM produce insights that may be harder to act on within organizations that reward blame attribution over system redesign.
Frameworks Compared Across Key Dimensions
Sociotechnical Theory: What It Gets Right and Where It Falls Short
Sociotechnical theory represented a fundamental paradigm shift away from technological determinism — the prevailing view that technology drives organizational form and outcomes — toward a perspective recognizing that organizations are adaptive systems where technology and social structure mutually influence each other. This was a genuine breakthrough.
But the theory has known weaknesses. Kelly's 1978 reappraisal identified fundamental flaws in its core claims. The concept of "joint optimization" — simultaneously optimizing the technical and social systems — has little connection with actual sociotechnical practice. In foundational STS interventions, the technical system was not substantively altered. Autonomy granted to work groups remained limited and subordinate to economic objectives. The role of pay incentives in producing reported outcomes was seriously underestimated.
Contemporary STS research acknowledges these limitations while defending the framework's overall value for understanding human-technology interactions. Productivity, quality, and satisfaction improvements are documented despite Kelly's criticisms, suggesting partial validation. The lesson for practitioners is to use STS thinking to guide design questions, not to treat it as a validated optimization formula.
Annotated Case Study
Aviation CRM: The Template for Cross-Industry Transfer
Crew Resource Management (CRM) is the most thoroughly documented example of a safety program designed from cross-industry evidence and deployed at scale. Tracing how it emerged, what it achieved, and how it transferred to healthcare reveals the pattern that other industries — including software — must navigate.
Origins. CRM training traces to a 1979 NASA workshop that identified communication failures as root causes of aviation accidents. Systematic analysis of air traffic accidents revealed that interpersonal communication breakdowns — not technical failures — were the primary culprit in many incidents. This is a Safety II insight avant la lettre: the failure mode was not a component breaking down, but adaptive coordination collapsing under pressure.
Why this matters: The NASA finding reframed aviation safety from a hardware and procedure problem into a human factors problem. It opened the door to team-based interventions targeting psychological safety and blame-free error analysis. Without this reframing, CRM would have had no foundation.
Design. CRM training focuses on non-technical skills: leadership, communication, interpersonal relationships, situational awareness, and error management. The goal is to reduce accidents through improved team dynamics rather than through additional technical defenses.
Why this matters: This is a deliberate choice to intervene at the social system level rather than the technical system level. The intervention is not "add more checklists" — it is "change how the crew reasons together under pressure."
Transfer to healthcare. Healthcare systems adopted "Crisis Resource Management" from aviation to address human factors contributing to medical errors. The training formats focus on the same non-technical skill clusters. Psychological safety in healthcare settings is associated with improved patient safety outcomes, enhanced quality improvement engagement, and increased clinician well-being. Healthcare teams with higher psychological safety are more likely to speak up about errors and engage in quality improvement initiatives.
Why this matters: The transfer worked — partially. Empirical evidence on CRM effectiveness varies across studies and implementation contexts. The training format translates; outcome improvements are real but inconsistent. This is the honest picture of cross-industry transfer: the framework travels, the implementation variability remains.
What transfers to software. The underlying pattern from aviation is:
- Analyze incidents to identify the actual failure mode (communication, coordination, hierarchy, or technical).
- Design the intervention at the system level where the failure actually occurs.
- Build psychological safety as the enabling condition for learning from failure.
- Accept that outcomes will vary with implementation context.
Software engineering teams face analogous dynamics: on-call engineers who do not escalate because they fear appearing incompetent; postmortems that silently attribute blame despite "blameless" labels; architectural decisions made by hierarchy rather than expertise. CRM's core contribution — systematic, repeatable training to address these dynamics — is directly applicable.
Thought Experiment
Designing a Safety Improvement Roadmap
You are brought in as a consultant to a mid-sized engineering organization. It has 120 engineers organized into 12 product teams, each owning services in a shared production infrastructure. The organization has the following characteristics:
- Incident postmortems exist but are rarely completed. When they are completed, findings rarely lead to action.
- Two major outages in the past year. Both were attributed to "human error" in the postmortem summaries. No systemic changes followed.
- Engineers report feeling hesitant to raise concerns about risky deployments during sprint reviews.
- The on-call rotation has high turnover. Engineers rotate off voluntarily whenever possible.
- The platform team has built a comprehensive internal observability stack, but fewer than 30% of product teams use it effectively.
- Leadership talks about "engineering excellence" and "reliability" but does not track leading indicators of safety posture.
Your task. Design a safety improvement roadmap for this organization. Your roadmap must address culture, processes, and organizational design. Use the frameworks from across this learning plan as your analytical lens.
Guiding questions to push your thinking:
-
Both major outages were attributed to "human error." Using STAMP or CAST logic, what questions would you ask to determine whether this attribution was accurate or whether it was masking control structure failures? What would you look for?
-
The completed postmortems are not driving change. Is this primarily a Safety I failure (bad investigation methods), a Safety II failure (not learning from normal performance variability), an organizational design failure, or a psychological safety failure? How would you diagnose which?
-
The observability stack is built but underused. Using Wardley Mapping's resource allocation framework, how would you characterize the maturity of observability as a capability in this organization, and what does that imply about where to invest next?
-
You want to introduce a version of CRM-style team communication training, adapted from aviation. What would you need to establish first — as a prerequisite — before training has any chance of producing durable change? What does the aviation evidence suggest about sequencing?
-
The resilience engineering literature argues that safety requires ongoing investment and active management, not one-time fixes. How would you design a continuous safety improvement mechanism into this organization's operating model, so that improvements do not decay after the initial intervention?
There is no single correct answer. A strong response will explicitly name which frameworks it is drawing on and why, acknowledge where evidence is mixed or uncertain, and sequence recommendations in a way that respects organizational change dynamics — starting where momentum is available, not where the theory says the problem is largest.
Key Takeaways
- Safety is an emergent, dynamic property. It cannot be certified into existence through audits or checklists. It must be actively maintained through ongoing organizational investment in monitoring, learning, anticipation, and adaptation.
- Safety I and Safety II are complements, not competitors. Hazard control and resilience development address different aspects of system safety. Organizations that abandon Safety I controls in favor of Safety II thinking, or vice versa, lose the capabilities that each provides.
- Frameworks are lenses, not algorithms. NAT, HRO, STAMP, Resilience Engineering, and sociotechnical theory each illuminate different aspects of system safety. The appropriate framework depends on what kind of system you have and what kind of question you need to answer. Epistemic humility about each framework's limits is not optional — it is part of applying them correctly.
- Cross-industry transfer works at the principle level, not the practice level. Aviation CRM produced repeatable improvements in communication and team performance. The principle — design team-based interventions to address communication failures — transfers. The specific training format requires adaptation to context, and outcome evidence remains variable.
- Psychological safety is a prerequisite, not an outcome. In both aviation CRM and healthcare quality improvement, psychological safety is what makes it possible for team members to surface problems, speak up about errors, and engage with improvement processes. Without it, other safety investments produce incomplete returns.
Further Exploration
Foundational texts
- Erik Hollnagel: Safety-I and Safety-II, the past and future of safety management — The primary source for the Safety I / Safety II conceptual distinction.
- From Safety-I to Safety-II: A White Paper — Hollnagel's accessible summary of the paradigm shift, written for the NHS.
- The Evolution of Crew Resource Management Training (FAA) — Primary source history of CRM from the FAA's perspective.
Research syntheses
- Normal Accident Theory Versus High Reliability Theory: A Resolution and Call for an Open Systems View of Accidents — The core argument for theoretical complementarity rather than competition.
- Safety and reliability in aviation: A systematic scoping review of NAT, HRT, and resilience engineering — Comprehensive review of how these frameworks have been applied and integrated in aviation.
- Building a safer future: Analysis of studies on Safety I and Safety II — Construction industry synthesis advocating integration.
- Integrating Safety-I and Safety-II: Learning from failure and success — Examines the tension and the integration argument.
Empirical evidence on outcomes
- Essential elements and outcomes of psychological safety in healthcare: A systematic review — What happens to patient outcomes when psychological safety is present.
- Crew Resource Management Systematic Review (PMC) — Evidence base for CRM effectiveness in healthcare.
- Resilience engineering concept in enterprises with and without OSH management systems — Empirical comparison of resilience engineering outcomes.
Critiques worth reading
- A Reappraisal of Sociotechnical Systems Theory — Kelly, 1978 — The foundational critique of STS theory that practitioners rarely encounter but should.
- HRO and RE: A pragmatic perspective — Honest examination of what HRO and Resilience Engineering can and cannot be integrated.