Human Factors and Frontline Work
Why the gap between procedure and practice is a feature, not a flaw
Learning Objectives
By the end of this module you will be able to:
- Define the work-as-imagined / work-as-done gap and explain why it is structurally inevitable in all complex operations.
- Explain why human variability and adaptation are resources for resilience rather than sources of error.
- Describe what the new view of human error implies for how organizations should respond to mistakes.
- Identify methods for surfacing frontline worker knowledge in engineering organizations.
- Apply variance control principles to evaluate whether work design enables or constrains effective performance.
Core Concepts
Work-as-Imagined vs Work-as-Done
Every engineering organization produces two versions of how work gets done. The first is written down: runbooks, incident response procedures, architecture decision records, on-call playbooks, deployment checklists. The second is what operators actually do when things get complicated at 2am.
These two versions have formal names. Work-as-Imagined (WAI) refers to idealized formal task environments — how procedures are documented and what designers assume will happen. Work-as-Done (WAD) describes what actually occurs in practice as workers continually adapt to unpredictable conditions and variability. The gap between them is not an organizational failure. It is theoretically inevitable in complex sociotechnical systems because procedures cannot anticipate all combinations of contextual conditions, competing demands, and resource limitations.
Safe performance requires understanding the actual daily operations and interactions that achieve desired outcomes, rather than assuming prescribed procedures are followed as written.
One study examined 524 individual procedural steps and found that 66% were performed as prescribed while 34% showed discrepancies between work-as-done and work-as-imagined. That 34% is not a compliance problem. It represents operators adapting to real-world conditions that the procedure authors did not — and could not — anticipate. In many cases, these gaps are known, accepted, and even encouraged at the supervisory and local management levels.
The key insight from Safety-II research is that work-as-imagined represents idealized formal task environments that disregard how performance must be adjusted to match constantly changing real-world conditions. Understanding the WAI/WAD gap is therefore not about identifying deviants — it is about understanding what the system actually requires to function.
Human Variability as a Resource
Traditional safety thinking treats human variability as the enemy. The goal, in that model, is to make humans as machine-like as possible: follow the procedure exactly, every time, without deviation. This is the Safety-I view.
Safety-II and resilience engineering treat human variability and adaptive performance as a resource for resilience rather than a liability to be constrained. The reasoning is straightforward: complex systems encounter novel combinations of conditions that no procedure anticipated. When that happens, the only thing that stands between the system and failure is the adaptive capacity of the humans operating it.
The resilience engineering perspective emphasizes that human performance variability has both positive and negative effects, and that safety is increased by amplifying the positive effects of variability while also adding controls to mitigate negative effects. This is a fundamentally different design posture: instead of suppressing all deviation, you invest in understanding which deviations are adaptive and which are dangerous.
The New View of Human Error
The "new view" of human error emerged from cognitive systems engineering research in the 1980s and 1990s, from work by Rasmussen, Hollnagel, and Woods. It collectively questions how failure should be viewed and whether the concept of "human error" is analytically useful at all.
Safety Differently, Safety-II, and resilience engineering share a theoretical foundation that challenges traditional safety concepts by questioning the utility of blame-based models of human error. A key claim in this body of work is that successful and failed work performance have the same causes — operators making decisions with incomplete information under time pressure, working around system limitations, balancing competing goals. When things go well, we call that skill. When things go wrong, we call it error. The new view treats this asymmetry as analytically incoherent.
The practical implication: when something goes wrong, the question is not "who made the mistake?" but "what conditions made this action seem reasonable to the person who took it?" This shifts investigation from fault-finding to understanding the system context that shapes operator decisions.
Frontline Knowledge and the Limits of Procedure
Human expertise is fundamental to resilience engineering. Organizations build resilience by cultivating and maintaining diversity in expertise and fostering knowledge sharing. Frontline workers — the engineers who actually run the systems — hold practical knowledge about how the system actually behaves that is not written down anywhere. This is not an oversight; much of it cannot be written down. It is tacit, contextual, and learned through operational experience.
Safety Differently advocates engaging frontline workers as knowledgeable professionals in safety improvement, devolving decision-making authority to them, and treating them as the source of insight and wisdom about safety rather than as sources of risk. This is a significant organizational shift. It means moving from top-down compliance models (follow the procedure, escalate deviations) to collaborative models where frontline workers have genuine authority over safety decisions related to their own work.
Safety Differently advocates decluttering procedures, decentralizing authority, and devolving responsibility to frontline workers. The argument is that overly prescriptive procedures do not make systems safer — they push the adaptation underground, where it becomes invisible and unmanageable.
Variance Control: Where to Handle Problems
The principle of variance control from sociotechnical systems design provides a structural complement to the new view. It states that variances — deviations from expected performance or conditions — should be controlled as close as possible to their point of origin rather than being exported across organizational boundaries. Each level of the organization should be capable of coping with variances that arise at its level.
The alternative — a system where every deviation must be escalated up the hierarchy — concentrates knowledge and decision-making authority at the top, away from the people who best understand what is happening. This design creates slower response times, loss of contextual fidelity, and frontline workers who are technically responsible for outcomes but not empowered to address the conditions that produce them.
Autonomous work groups that allocate their own tasks and make day-to-day operational decisions are the organizational form that variance control implies. The whole task principle reinforces this: work should be designed so that a single, small, face-to-face group experiences the entire cycle of operations — enabling the team to see the impact of their work and coordinate across the full work cycle.
Analogy Bridge
If you have been on-call for a production service, you have lived the WAI/WAD gap.
The runbook says: "If the error rate exceeds threshold X, restart the service." What the runbook does not say: what to do when the error rate is right at the threshold and restarts are making things worse, when a downstream dependency is degraded but not dead, when traffic patterns are anomalous for a reason you cannot yet explain, or when the monitoring is itself behaving unexpectedly.
In those moments, you do not follow the runbook. You adapt. You draw on knowledge you accumulated from previous incidents, patterns you noticed but never wrote down, a mental model of the system that no diagram fully captures. That adaptation is not a failure mode. It is the mechanism by which the system survives.
The gap between the runbook (WAI) and what you actually do (WAD) is not evidence that the runbook needs more rules. It is evidence that the system is complex enough that some knowledge can only live in the heads of the people who operate it — and that those people need to be trusted and empowered to use it.
Frontline engineers are the organization's most sensitive instrument for detecting the gap between how systems are designed and how they actually behave. Every workaround, every "it only works if you do it this way," every implicit knowledge transfer during a handoff — these are signals about where WAI and WAD have diverged. Surfacing these signals is the point of post-incident review.
Worked Example
Incident: a stale cache invalidation procedure
A team maintains a microservice with a multi-layer caching strategy. The documented runbook specifies a three-step cache invalidation sequence when deploying configuration changes. The procedure was written during the initial design of the system.
Over eight months, the team quietly stopped following step two of the sequence in most cases, because they had learned through experience that it was only necessary when a specific upstream service was also being redeployed simultaneously. Executing it unnecessarily added four minutes to every deployment and occasionally caused transient errors in an unrelated system.
A new engineer joins, follows the runbook exactly, and the deployment succeeds — but takes longer and causes the transient errors. He files a ticket asking why the procedure causes errors.
This incident reveals several WAI/WAD dynamics:
The gap was adaptive, not negligent. The experienced engineers developed a more accurate mental model of when step two was actually required. Their deviation from the runbook was correct for their operational context.
The gap was invisible to the organization. No one had updated the runbook, and the knowledge about when to skip step two was entirely tacit. It transferred through informal apprenticeship but not through any formal channel.
The gap became a failure surface. When a new engineer arrived without the tacit context, the procedure produced the exact transient errors that the team's adaptation had been avoiding.
The correct response is not stricter compliance. Telling the team to always follow step two would restore the runbook's accuracy at the cost of operational efficiency and would still not surface the underlying knowledge — when is step two actually necessary?
The correct response is a knowledge excavation. A post-incident review designed to surface WAD would ask the experienced engineers to make explicit what they had been doing and why. That knowledge should then be used to rewrite the runbook to reflect actual operational conditions — including decision logic for when to apply step two.
Well-intended shortcuts and deficient workplace practices are routinely not detected during audits, creating an increasing gap between work-as-imagined and work-as-actually-done. If you do not have a practice of regularly comparing WAD to WAI — through observational methods, post-incident reviews, or pre-mortems — your runbooks are probably drifting.
Common Misconceptions
"The gap between procedures and practice means the procedures need to be more detailed."
This is the most common managerial response, and it tends to make things worse. Analysis shows that field expert involvement in procedure development is critical to limiting the gap, as procedures written without operational input often lack practicality. The answer is not more procedures — it is better-informed procedures, and crucially, the organizational acknowledgment that procedures will always be incomplete. Safety relies on the adaptive capacity of operators to bridge that incompleteness.
"If a human made a mistake, we need to train them or remove them."
This is the Safety-I blame reflex. The new view argues that focusing investigation on individual fault assignment systematically misses the systemic conditions that made the error likely. The same conditions that allowed one person to make a particular error will allow someone else to make it later. Replacing or retraining the individual without addressing the system conditions does not improve safety — it merely reassigns the risk.
"Variability in how people do their work is a problem to solve."
While Safety-I seeks to minimize human variability through procedures and controls, Safety-II recognizes that humans bring adaptive capacity to complex sociotechnical systems. Eliminating variability entirely is neither achievable nor desirable. The system depends on human judgment to navigate conditions that procedures do not cover. The goal is to understand variability — distinguish adaptive adjustments from dangerous deviations — not to eliminate it.
"Frontline workers are the source of risk; management is the source of safety."
Safety Differently explicitly inverts this assumption: frontline workers are the source of insight and wisdom about safety, not the source of risk. Management creates the conditions for safe work — resource availability, workload constraints, system design — but the knowledge of whether those conditions are adequate resides at the frontline. Organizations that treat frontline workers primarily as a compliance surface lose access to the most operationally relevant safety information they have.
Active Exercise
Mapping the gap in your own team
This exercise surfaces WAD in your engineering organization using structured inquiry. It takes approximately 45–60 minutes and works best in a team of 3–6 engineers who operate the same system.
Step 1: Select a procedure (10 minutes)
Choose a runbook, deployment procedure, or incident response checklist that your team uses regularly. It should be one that people actually follow — not a theoretical document.
Step 2: Document work-as-imagined (5 minutes)
Have one person read through the procedure step by step and summarize what it prescribes. Do not skip or interpret — just describe what is written.
Step 3: Document work-as-done (15 minutes)
Ask the most experienced operators: "Walk me through what you actually do when you execute this procedure." Capture deviations from the documented steps, including:
- Steps that are skipped under certain conditions
- Steps where the actual action differs from the written instruction
- Informal checks or validations not in the document
- Decision points where experience changes the path
Step 4: Classify each gap (15 minutes)
For each deviation identified, determine:
- Is this a consistent adaptation (everyone does this differently)? Or individual variation?
- Does the adaptation reflect knowledge that was never written down?
- Could the original procedure produce harm if followed exactly in some conditions?
- Could the adaptation produce harm if applied inappropriately (e.g., by a new engineer)?
Step 5: Decide on organizational action (10 minutes)
For each significant gap, decide:
- Update the procedure to reflect actual practice
- Document decision logic that explains when to deviate and why
- Flag knowledge that requires formal training or mentorship (tacit knowledge that cannot easily be encoded)
As you classify gaps, ask whether each variance is being handled as close to its origin as possible. If your procedure requires escalating a decision that experienced frontline engineers routinely handle independently, that is a design problem — not a compliance success. The variance control principle suggests that empowering operators to handle local variances at their level is structurally safer than requiring escalation.
Key Takeaways
- The WAI/WAD gap is structurally inevitable. Procedures cannot anticipate all combinations of conditions, competing demands, and resource limitations in complex systems. The gap is not a compliance failure — it is a consequence of complexity.
- Human variability is a resource, not a liability. Operators adapt to conditions that procedures do not cover. This adaptive capacity is what keeps complex systems functioning. The goal is to understand and amplify it, not eliminate it.
- The new view replaces blame with curiosity. When something goes wrong, the analytically productive question is not who made the mistake but what conditions made this action seem reasonable. Fault assignment does not improve systems — understanding the conditions that shaped decisions does.
- Frontline workers are the primary source of operational knowledge. The knowledge required to safely operate a complex system is distributed throughout the organization, concentrated at the frontline. Surfacing and respecting that knowledge requires deliberate organizational practices — not just post-incident reviews, but ongoing engagement that treats operators as knowledgeable professionals.
- Work design should enable variance control at the source. Procedures and organizational structures that require everything to be escalated upward degrade response time and remove operational knowledge from the people best positioned to act on it. Teams with meaningful task ownership and local decision authority handle variances faster and more accurately.
Further Exploration
Foundational Research
- From Safety-I to Safety-II: A White Paper — Erik Hollnagel; foundational document distinguishing WAI from WAD
- Safety II and Resilience Engineering in a Nutshell — ScienceDirect condensed overview
- The 'new view' of human error — Recent examination of theoretical history and current state
Practitioner Perspectives
- The varieties of human work — Safety Differently practitioner-oriented piece
- Safety II professionals: How resilience engineering can transform safety practice — Implications for safety professionals and frontline worker engagement
Sociotechnical Systems Design
- The Principles of Sociotechnical Design — Albert Cherns, 1976; variance control and whole task principle
- Using ethnographic methodology in the study of complex sociotechnical systems — Research methods for surfacing work-as-done