Safety II and Resilience Engineering
From preventing failure to enabling success
Learning Objectives
By the end of this module you will be able to:
- Define Safety II and contrast it with Safety I across at least three dimensions.
- Explain why learning only from failures misses the majority of information about how systems actually function.
- Apply the four resilience capacities — anticipate, monitor, respond, learn — to evaluate the resilience of a production engineering system.
- Describe graceful degradation and explain its relationship to resilience engineering.
- Explain how performance variability functions as a resource rather than a risk in Safety II thinking.
Core Concepts
The Two Safety Paradigms
Most safety practice, and most engineering culture, was built on a single foundational idea: safety means not having accidents. If something bad happens, investigate why and close the gap. If nothing bad happens, you are safe. This is Safety I.
Safety I is defined as a safety management paradigm that treats safety as the absence of accidents and incidents. Its goal is to ensure that "as few things as possible go wrong." The methods that follow from this definition are familiar: root cause analysis after incidents, protective barriers and defenses, counting failures as the primary safety metric.
Safety II, a framework formalized by Erik Hollnagel — Professor at the University of Southern Denmark, with decades of applied work spanning nuclear power, aviation, software engineering, and healthcare — inverts this lens. Safety II treats safety as the ability of a system to succeed under varying conditions. Its goal is to ensure that "as many things as possible go right."
Safety I asks: why did this fail? Safety II asks: why does this succeed so consistently?
This is not merely a reframing. The two paradigms lead to different investigations, different investments, and different models of what humans and organizations contribute to system performance.
Performance Variability as a Resource
In the Safety I model, human variability is a liability. Deviations from procedure are risk factors. The corrective impulse is to standardize, constrain, and audit compliance.
Safety II asserts something counterintuitive: things go right primarily because workers make sensible, situationally-appropriate adaptations and adjustments to procedures and plans, not simply because they follow prescribed procedures. Success is achieved through situational awareness, flexibility, and the ability to adapt to varying conditions — capacities that cannot be substituted with rules alone, and that are often restricted by overly rigid procedural constraints.
Safety II assumes that everyday performance variability provides the adaptations needed to respond to varying conditions, and hence is the reason why things go right. This reframes variability from a noise source to a signal — and from a risk to be suppressed to a resource to be understood and amplified.
Safety II explicitly investigates the gap between how work is designed (work-as-imagined) and how work actually gets done (work-as-done). This gap is not a compliance failure — it is where adaptation lives.
The Theoretical Foundation
Safety II is grounded in resilience engineering, which provides its theoretical foundation. Resilience engineering represents a paradigm shift from traditional safety approaches that prioritize preventing component failures through design redundancy and fault elimination.
Rather than assuming systems can be designed to prevent all failures, resilience engineering acknowledges that complex sociotechnical systems will inevitably encounter unexpected situations and focuses instead on building the capacity to detect, understand, and adapt to disturbances. Crucially, resilience engineering recognizes that organizational variability is unavoidable and beneficial, and should be managed rather than dampened.
This theoretical lineage also connects to the "New View" of human error that emerged from cognitive systems engineering research in the 1980s–1990s — initiated by Rasmussen, Hollnagel, and Woods — which collectively challenged the utility of attributing failure to human error as a root cause.
Learning from Success
A core distinguishing practice in Safety II is learning from successful operations and everyday adaptations that enable systems to succeed, rather than exclusively learning from failures.
Safety II focuses on the overwhelming percentage of good outcomes while understanding why they occur. When a system handles tens of thousands of requests per day without incident, each successful transaction is data — data about what adaptive mechanisms are in place, what margins are being maintained, and what human expertise is quietly compensating for gaps in automated processes.
Learning in Safety II broadens to include everyday work, successes, and adaptations, not just incident analysis. This involves studying the adaptive mechanisms workers make to handle unexpected challenges and varying conditions, thereby detecting gaps between work-as-imagined and work-as-done before those gaps produce incidents.
Organizations that only learn from failures are operating with a systematically incomplete dataset. For every incident, there are thousands of near-misses, close calls, and quiet adaptations that never made it into any incident report. Safety II treats those as equally informative.
The Four Capacities of Resilient Systems
Safety II and resilience engineering identify four core adaptive capacities that characterize resilient systems. These capacities are inter-related and inform each other.
Anticipate — the ability to forecast threats that may occur in the future and prepare proactive responses. This involves scenario planning, horizon scanning, and recognizing emerging patterns that signal potential threats. Anticipation is what distinguishes resilience engineering from purely reactive approaches.
Monitor — the ability to track relevant conditions and changes, maintaining situational awareness about where the system currently stands relative to its boundaries and thresholds. Monitoring enables early detection of drift before it becomes crisis.
Respond — the capacity to adjust operations in real-time when irregular events or unexpected variations occur. Response capability includes both planned responses to anticipated problems and the adaptive flexibility to create novel responses to situations that deviate from expectations. Effective response depends on frontline workers having authority and expertise to make rapid adjustments.
Learn — the ability to acquire knowledge from experience, including from everyday work and successes, not only from failures. A key marker of highly resilient systems is the ability to borrow or transfer adaptive capacity from other units or domains.
System Boundaries and Graceful Degradation
Resilience is not infinite. Resilient systems possess buffering capacity, flexibility, margin, and tolerance — where tolerance refers to system behavior near a boundary characterized by gradual degradation rather than sudden failure. Brittleness is the opposite: sudden system collapse when events push the system beyond its limits.
Resilient systems exhibit graceful degradation — the ability to continue functioning at reduced capacity when critical processes or subsystems are compromised, rather than experiencing sudden catastrophic failure. Graceful degradation is an intentional design property that allows systems to tolerate failures by reducing functionality or performance while maintaining essential operations.
This is where resilience engineering connects directly to distributed systems design. Circuit breakers, load shedding, read-only fallback modes, partial availability under dependency failure — these are engineering implementations of the same principle.
Compare & Contrast
Safety I vs. Safety II
| Dimension | Safety I | Safety II |
|---|---|---|
| Definition of safety | Absence of accidents and incidents | Ability to succeed under varying conditions |
| Primary goal | Ensure as few things as possible go wrong | Ensure as many things as possible go right |
| What is studied | Failures, incidents, near-misses | Successes, adaptations, everyday performance |
| View of human variability | A risk and source of error to be constrained | A resource and the reason things go right |
| Learning source | Root cause analysis of adverse events | Analysis of successful operations and adaptations |
| Measurement of safety | Counting failures (incident rates, MTBF) | Understanding adaptive capacity and margins |
| Human role | Potential failure point; compliance is the goal | Source of adaptability, expertise, and situational judgment |
| When things go wrong | Assign cause, close the gap, prevent recurrence | Understand what changed, what adaptive capacity was missing |
Safety I and Safety II are not adversarial. Investigating failures is still valuable — Safety II does not argue otherwise. The point is that an organization that only investigates failures is looking at a small, unrepresentative slice of its own operational reality. Both lenses are needed.
Resilience Engineering vs. Traditional Safety Engineering
| Dimension | Traditional Safety Engineering | Resilience Engineering |
|---|---|---|
| Core assumption | Failures can be designed out | Unexpected situations are inevitable |
| Primary strategy | Redundancy, fault elimination, barriers | Adaptive capacity, detection, flexible response |
| Attitude to variability | Suppress it; enforce standardization | Manage and guide it; variability enables adaptation |
| Epistemic foundation | Systems-theoretic, rooted in mechanical reliability | Systems-theoretic, rooted in cognitive/organizational science |
| What "safe" looks like | Low incident count, high procedure compliance | System capable of detecting, responding to, and learning from disturbance |
Worked Example
Evaluating an On-Call Incident Response Process Through a Safety II Lens
Consider a team that runs a production API. They have a post-mortem process triggered by incidents (P1/P2 pages) and a runbook library. By Safety I standards, a low incident count over six months is a signal of good safety.
A Safety II analysis asks a different set of questions.
Step 1 — Study what goes right, not just what went wrong.
Review the last 90 days of on-call logs, not just the escalated incidents. What patterns appear? When did engineers resolve issues before they triggered alerts? When did someone recognize an anomaly from a dashboard and act before a customer was affected? These are adaptive successes — and the mechanisms behind them are exactly what Safety II wants to understand.
Step 2 — Assess the four capacities.
Apply the resilience framework:
- Anticipate: Does the team have a practice of reviewing upcoming releases, traffic forecasts, or infrastructure changes for risk? Or does anticipation happen informally through individual engineer experience?
- Monitor: Are dashboards and alerts calibrated to actual system behavior, or are they set to defaults that have accumulated noise over time? Can engineers quickly form a picture of system state?
- Respond: When an incident occurs, do engineers have the authority, tooling, and documented options to act quickly — including degrading gracefully? Or do novel situations require escalation chains that introduce delay?
- Learn: Does the team capture near-misses and "good saves" in addition to incidents? Is learning distributed across the team, or siloed in the engineers who happened to be on call?
Step 3 — Examine the gap between work-as-imagined and work-as-done.
Pull a recent runbook and compare it to how the team actually operates under pressure. Are there steps that require judgment not captured in the documentation? Is there tribal knowledge that lives only in the heads of specific engineers? These gaps are not compliance failures — they are evidence of adaptation. The goal is to understand them, not punish them.
Step 4 — Identify brittleness.
Are there scenarios where the system degrades suddenly rather than gradually? A service that goes from fully operational to completely unavailable without intermediate states is brittle. Can the team identify which dependencies, if degraded, would produce graceful fallback versus hard failure? Resilience engineering would push the team to actively map these boundaries — not to add more redundancy, but to make the boundaries visible and manageable.
Step 5 — Close the learning loop.
Safety management should allocate resources to look at events that go right and try to learn from them. Consider running a monthly "success review" alongside the post-mortem process — a structured discussion of a well-handled operational event that examines why it went right and what conditions enabled it.
Analogy Bridge
The Jazz Band and the Sheet Music
Imagine two ways of evaluating a jazz performance.
The first approach reviews only the mistakes: the missed notes, the timing slips, the moments where a player lost the chord changes. The set is considered safe if the mistake count is low. To improve, you fix each mistake and add more rehearsal constraints.
The second approach asks: how did this band navigate an unexpected tempo change at the start of the third song without falling apart? How did the bassist sense that the pianist was taking an improvised detour and adjust in real time? What collective understanding allowed five people with only a lead sheet to produce something coherent under varying conditions?
The sheet music is work-as-imagined. The performance is work-as-done. The gap between them is not a failure — it is where musicianship lives.
A production system is the same. The runbook is the sheet music. The engineer at 2am navigating a dependency failure they have never seen before is the jazz musician. Safety II prizes situational awareness, flexibility, and a state of readiness to adapt and achieve the best possible outcome for the conditions at hand. You cannot put that in a runbook. But you can study it, amplify it, and build systems and teams that cultivate it.
Key Takeaways
- Safety I and Safety II define safety differently. Safety I is the absence of bad outcomes; Safety II is the presence of adaptive capacity that produces good outcomes consistently under varying conditions. Both lenses are useful; relying only on Safety I leaves most of the system's behavior unexamined.
- Performance variability is not noise — it is signal. The adaptations workers make to handle real conditions are the mechanism by which systems succeed. Safety II treats this variability as a resource to understand and amplify, not a risk to suppress.
- The four resilience capacities — anticipate, monitor, respond, learn — provide a practical framework for evaluating and improving organizational and system resilience. They are interdependent: weak anticipation undermines response; weak learning prevents anticipation from improving.
- Graceful degradation is the engineering expression of resilience. Brittle systems collapse when pushed past their limits; resilient systems degrade gradually, maintaining essential function while reducing capacity. This is a design property that must be built intentionally.
- Learning from success requires a deliberate practice shift. Incident retrospectives are necessary but not sufficient. Understanding why things go right — by studying everyday operations, near-misses handled well, and 'good saves' — is where Safety II generates its most distinctive insights.
Further Exploration
Foundational Frameworks
- From Safety-I to Safety-II: A White Paper — Erik Hollnagel's foundational white paper that formalized the Safety I/Safety II distinction
- Safety-I and Safety-II: The Past and Future of Safety Management — Full-length treatment of the framework for those who want theoretical depth
Resilience Engineering
- Safety II and Resilience Engineering in a Nutshell — Concise synthesis of how Safety II operationalizes resilience engineering principles
- The four cornerstones of resilience engineering — Foundational paper on the anticipate/monitor/respond/learn model
- Building a Safety Program Using Principles of Resilience Engineering — Practical guide to implementing resilience engineering principles
Practical Application
- Resilient Healthcare and the Safety-I and Safety-II Frameworks — Practical framing of these concepts in high-stakes operational environments
- Safety II: A Proactive Approach — Practitioner-oriented summary of Safety II principles and organizational change
- From deficits to possibilities — Short essay on the shift from deficit-focused to capacity-focused safety practice