STAMP and Systems-Theoretic Analysis
Safety as a control problem — and what that means for how you design, analyze, and investigate
Learning Objectives
By the end of this module you will be able to:
- Explain STAMP's central claim that accidents are failures of control, not failures of components.
- Define safety constraints and trace how their violation leads to accidents in the STAMP framework.
- Identify the four types of inadequate control and map them onto software engineering scenarios.
- Apply STPA to identify unsafe control actions in a simplified system design.
- Distinguish CAST from root cause analysis and explain when each approach is appropriate.
Core Concepts
The paradigm shift: from failure prevention to control
Traditional safety engineering asks: what component failed? STAMP asks a different question: what control failed to enforce a safety constraint?
Nancy Leveson's STAMP model reconceptualizes safety as an emergent property of a system — not a property of any individual component. A system is safe when its control structures successfully enforce the constraints that keep the system within safe operating boundaries. When those control structures break down, accidents become possible regardless of whether any individual component has technically "failed."
This is a genuine paradigm shift. A software component can execute correctly, return the right value, and still cause a hazardous state — because it issued a control action at the wrong time, in the wrong context, or with an incorrect model of the system it was controlling. STAMP accounts for this. Traditional failure analysis largely does not.
Software rarely causes accidents by crashing or returning wrong values. It causes accidents by issuing plausible-looking control commands that happen to be unsafe given the system's actual state. STAMP is designed to surface exactly this class of hazard.
Safety constraints
The most fundamental concept in STAMP is the safety constraint — a relationship between system variables that must be maintained to prevent a hazardous state. Constraints define what the system must not do (or must always do) to remain safe.
Accidents in the STAMP model occur when these constraints are violated. The cause is not a broken part — it is inadequate control at one or more levels of the system that fails to enforce the constraint. This means the analyst's job is not to find the failed component but to ask: which constraint was violated, and why did the control structure fail to enforce it?
Hierarchical control structures
STAMP represents systems as hierarchical control structures, where each level imposes constraints on the level below it. Each level contains:
- Control actions — commands sent downward to enforce constraints on the controlled process.
- Feedback mechanisms — information flowing upward about the current state of the controlled process.
- Process models — the controller's internal representation of what the controlled process is doing.
Safety violations occur when any level becomes ineffective at enforcing its constraints on the level below. This means failures are not just operational events — they can originate at management, regulatory, design, or organizational levels that are far removed from the proximate technical event.
Control loops and process models
Each controller in the hierarchy maintains a process model — an internal representation of the state of the process it is controlling. Control actions are issued based on this model. Feedback updates the model.
Safety failures arise when:
- Control actions are inadequate (the controller does something wrong).
- Feedback is inadequate (the controller doesn't learn that something has gone wrong).
- The process model is inaccurate (the controller has a wrong mental model and issues actions based on false beliefs about system state).
This third failure mode is particularly important for software. An automated controller can have a perfectly correct algorithm but operate on stale sensor data, a race condition, or an out-of-sync state — and its control actions will be wrong because its internal model of the world is wrong.
The four types of inadequate control
STAMP identifies four primary factors that cause inadequate control at each level of the hierarchy:
| # | Factor | Description |
|---|---|---|
| 1 | Missing or incomplete constraints | The system was designed without the necessary safety constraints in place. No controller exists for a hazardous state because the hazard was never anticipated. |
| 2 | Inadequate control commands | A controller exists and knows about the constraint, but issues a command that fails to enforce it — wrong timing, wrong value, wrong target. |
| 3 | Commands incorrectly executed | The control command was correct, but the lower-level process failed to execute it as intended — miscommunication, incompatibility, or actuator failure. |
| 4 | Inadequate feedback or information | The controller cannot tell whether its commands are being enforced because the feedback loop is broken, delayed, or misleading. |
These four categories apply at every level of the hierarchy — organizational, managerial, design, and operational. An accident is typically the result of multiple contributing factors distributed across several levels.
Step-by-Step Procedure: Applying STPA
STPA (System-Theoretic Process Analysis) is the proactive hazard analysis method built on STAMP. You apply it during design or development to identify unsafe control actions before an accident occurs. Here are the four steps:
Step 1: Define the purpose of the analysis
Specify the losses you are trying to prevent (e.g., harm to users, data corruption, system unavailability) and the system-level hazards — system states or conditions that could lead to those losses under certain environmental conditions.
Decision point: Be concrete. Vague hazards produce vague results. A hazard like "the system is in an unsafe state" is not useful. A hazard like "the deployment pipeline pushes a broken build to production with no rollback triggered" is.
Step 2: Model the control structure
Draw the hierarchical control structure for the system under analysis. For each controller (human or automated), identify:
- What control actions it issues.
- What feedback it receives.
- What process model it maintains.
Decision point: Include non-technical controllers — product managers, SRE on-call, automated monitoring systems. STAMP's power comes from spanning the whole sociotechnical system.
Step 3: Identify unsafe control actions (UCAs)
For each control action, evaluate it against four STPA UCA types:
| UCA type | Description |
|---|---|
| Not provided | The control action was needed but not issued. |
| Provided unsafely | The control action was issued but caused a hazard. |
| Wrong timing / order | The action was correct in content but issued too early, too late, or out of sequence. |
| Stopped too soon / applied too long | The action started correctly but ended at the wrong time. |
Document each unsafe control action with a reference to the hazard it could contribute to.
Decision point: If you are unsure whether a UCA is realistic, ask: under what system state would this action cause a hazard? That question leads directly to Step 4.
Step 4: Identify causal scenarios
For each UCA, identify the scenarios — combinations of system state, context, and control-loop failures — that could lead to the UCA occurring. Use the four inadequate control factors from the STAMP model to structure this search:
- What constraints are missing from the design?
- What incorrect control commands could be issued?
- What correct commands could be incorrectly executed?
- What feedback failures could prevent detection?
The output is a set of causal scenarios linked to hazards and losses, which can then drive requirements, testing, or design changes.
Worked Example: STPA on a Deployment Pipeline
Consider a simplified continuous deployment pipeline with the following actors:
- CI/CD system — automated controller that merges code, runs tests, and triggers deployments.
- On-call engineer — human controller who monitors alerts and can halt deployments.
- Production environment — the controlled process.
Step 1 — Losses and hazards
- Loss: Production outage causing user-facing errors.
- Hazard: A broken build is deployed to production without triggering an automatic rollback or human escalation.
Step 2 — Control structure
The CI/CD system issues "deploy" and "rollback" control actions to the production environment. The on-call engineer issues "halt deployment" commands to the CI/CD system. Feedback comes from test results, deployment health checks, and alerting.
Step 3 — Unsafe control actions
| Control action | UCA type | Description |
|---|---|---|
| Deploy | Provided unsafely | Deployment is triggered despite test failures because the test gate was bypassed by a configuration flag. |
| Rollback | Not provided | Rollback is not triggered after a health-check failure because the health-check threshold was set too loosely. |
| Deploy | Wrong timing | Deployment is issued during a period when a database migration is still in progress, causing incompatibility. |
| Halt deployment | Not provided | The on-call engineer does not receive an alert because the alerting rule was misconfigured after a tooling migration. |
Step 4 — Causal scenarios (excerpt)
For the UCA "Rollback not provided after health-check failure":
- Missing constraint: No requirement exists that health-check thresholds must be reviewed when error-rate baselines change.
- Inadequate feedback: The health-check reports aggregate error rates, masking a spike on one endpoint.
- Incorrect command execution: The rollback command is issued but targets the wrong deployment artifact because the deployment manifest was not atomically updated.
Notice that in this example, several UCAs trace back not to broken components but to controllers operating on incorrect process models — stale thresholds, wrong deployment artifacts, misconfigured alerts. This is the failure mode STAMP is most valuable for surfacing.
Compare & Contrast: STPA vs. CAST, and STAMP vs. Root Cause Analysis
STPA vs. CAST
| STPA | CAST | |
|---|---|---|
| Purpose | Proactive hazard identification during design | Retrospective analysis after an incident |
| Starting point | System design artifacts, control structure model | An accident or near-miss that has occurred |
| Output | Unsafe control actions and causal scenarios | Explanation of how the control structure failed to prevent the accident |
| Primary use | Requirements, testing, design review | Post-incident review, process improvement |
CAST (Causal Analysis using Systems Theory) applies the same STAMP lens retrospectively. You reconstruct the control structure as it existed at the time of the incident, trace which constraints were violated, and identify which of the four inadequate control factors contributed at each level. Like STPA, CAST explicitly examines both technical and organizational/human factors — not just the proximate technical event.
STAMP vs. traditional root cause analysis
Root cause analysis looks for the broken part. STAMP looks for the broken control loop — and finds that control loops break at every level of the organization, not just at the operational endpoint.
Traditional root cause analysis typically:
- Traces a linear chain of events backward to a "root cause."
- Identifies a proximate failure (a component, a human error, a procedure violation).
- Recommends fixing or replacing the failed element.
STAMP-based analysis:
- Treats the accident as the result of nonlinear interactions across the control hierarchy.
- Identifies constraint violations at multiple levels simultaneously.
- Examines organizational, managerial, and design-level contributors, not just operational ones.
- Produces recommendations distributed across the control hierarchy, not a single "fix."
For software engineers accustomed to post-mortems, CAST is the STAMP equivalent of a post-mortem — except it systematically asks about every level of the system, not just the technical sequence of events.
When does STAMP fit in the broader safety theory landscape?
Contemporary safety research positions STAMP as complementary to other frameworks — Normal Accidents Theory, High Reliability Organizations, and Resilience Engineering — rather than as a replacement. STAMP excels at structured, engineering-facing analysis of control failures. It is most powerful when you need to generate concrete design requirements or conduct rigorous post-incident investigation. It is less suited to understanding organizational culture or the social dynamics that shape safety behavior, which HRO theory addresses more directly.
Active Exercise
Apply a partial STPA to a system you work with or know well. You do not need to complete every step — the goal is to practice the control-structure lens.
Task:
-
Choose a feature or subsystem. Write one sentence each for: (a) a loss you care about preventing, and (b) a system-level hazard — a condition that could lead to that loss.
-
Identify two controllers in the system (they can be automated systems, humans, or both). For each, list: one control action they issue, and what feedback they receive about whether that action was effective.
-
Choose one control action and evaluate it against the four UCA types. For each type, write a one-sentence description of how that failure could occur.
-
Pick the UCA you find most plausible and write down one causal scenario using each of the four inadequate control factors. Which factors actually apply? Which do not? Why?
Debrief question: Did the exercise surface any hazard scenarios you had not previously considered? If so, what was missing from your existing mental model of the system?
Key Takeaways
- STAMP reframes safety as a control problem. Accidents are not caused by broken components — they result from control structures failing to enforce safety constraints. This shift enables analysis of failures that traditional models miss entirely.
- Safety constraints are the unit of analysis. Every accident in STAMP is a constraint violation. Identifying the relevant constraints for a system is the first and most important step in any STAMP-based analysis.
- Control failures happen at every level of the hierarchy. Organizational, managerial, design, and operational failures all contribute. An analysis that stops at the technical proximate event is incomplete.
- STPA is proactive; CAST is retrospective. Both apply the same STAMP framework. Use STPA during design to find unsafe control actions before they cause harm. Use CAST after an incident to understand how the entire control structure — not just the immediate technical failure — contributed.
- Software hazards are control hazards. Software causes accidents through unsafe interactions with the control structure, not primarily through component failures. STPA and CAST are well-suited to this class of hazard in a way that traditional fault-tree or FMEA analysis is not.
Further Exploration
Foundational Texts
- A New Accident Model for Engineering Safer Systems — Leveson — The original STAMP paper. Dense but precise. Read the first three sections to get the core model.
- STPA Handbook — Leveson & Thomas (2018) — The definitive practical guide to running an STPA. Freely available.
- CAST Handbook — Leveson — Step-by-step guidance for post-incident analysis using STAMP.
Application to Software Systems
- A Hazard Analysis Method for Software-Controlled Systems Based on STPA — IEEE — Demonstrates how STPA applies to software-specific hazard categories.
- Cyber-Security Incident Analysis by CAST — IEEE — Applies CAST to a cybersecurity incident, showing how the method generalizes beyond physical safety.
- An Integrated Approach to Safety and Security Based on Systems Theory — ACM — Leveson's argument for using STAMP as a unified lens for safety and security.
Situating STAMP in the Broader Field
- Engineering a Safer World: Systems Thinking Applied to Safety — Leveson
- Normal Accident Theory Versus High Reliability Theory: A Resolution and Call for an Open Systems View of Accidents — Useful for understanding how STAMP fits alongside other safety frameworks.