Engineering

Running a High-Reliability Team

Integrating knowledge transfer, maturity models, and sustainable on-call into a coherent operating model — and contributing it back to the organization.

Learning Objectives

By the end of this module you will be able to:

Build a runbook structure that reduces on-call cognitive load and supports effective onboarding.
Apply knowledge transfer practices that prevent operational knowledge from becoming a silo.
Use a maturity model to assess your team's current state and define the next improvement priority.
Design a sustainable on-call rotation that includes quarterly testing of runbooks and escalation paths.
Describe how to contribute to org-wide operational standards from the team level.
Synthesize SLOs, incident management, kaizen, and safety culture into a coherent team operating model.

Core Concepts

Knowledge as Infrastructure

Software teams often treat knowledge as a byproduct of work — something that accumulates naturally in people's heads and emerges in Slack threads and PR comments. The research on distributed cognition shows this framing is wrong.

Knowledge about a complex software system is fundamentally distributed across multiple substrates: team members, code artifacts, documentation, CI pipelines, communication channels, and organizational structures. This distribution is not a storage problem — it is the actual cognitive architecture of the team. The codebase itself functions as a cognitive artifact: architectural decisions encoded in past commits constrain how engineers reason about changes today.

The implication is direct: when a team lets this distributed knowledge erode — through turnover, undocumented decisions, or lack of deliberate transfer practices — it is not just losing tribal lore. It is degrading the cognitive infrastructure that makes collective work possible.

Knowledge silos take several forms: individual silos (one engineer carries something nobody else knows), departmental silos (teams fail to communicate effectively across boundaries), technological silos (information fragmented across incompatible tools), and cultural silos (norms that subtly discourage sharing). In on-call and incident response contexts, all four forms compound. The engineer who knows why the payment service behaves oddly under a specific load pattern is not always the one on call.

Tacit vs. Explicit Knowledge

The distinction that makes operational knowledge hard to preserve is the gap between tacit and explicit knowledge. Explicit knowledge is the kind you can write down: a runbook step, a deployment checklist, a link to the dashboard that shows the right signal. Tacit knowledge is everything else: the intuition built from having debugged the system fifty times, the mental model of why the architecture was built the way it was, the pattern-matching that happens when something "feels off."

Most software systems carry a substantial tacit knowledge debt. 74% of organizations lack formal methods to capture and retain technical knowledge, resulting in estimated annual losses of $31.5 billion across Fortune 500 companies. Turnover forces that debt into a crisis: knowledge walks out when engineers leave.

The practices that convert tacit to explicit knowledge are not passive. Pair programming, mentorship, and structured code reviews are active mechanisms — deliberate translation work. Runbooks fall into this category too: writing a good runbook for an incident type forces the engineer who knows the system to surface their tacit reasoning into explicit, transferable steps.

Knowledge Islands

Research on "knowledge islands" — areas of code whose knowledge is concentrated in one or two developers — shows that this concentration creates both bottlenecks during onboarding and single points of failure during incidents. A team where the same three engineers are always escalated to at 3am has a knowledge island problem, not just an on-call load problem.

Operational Maturity Models

Maturity models give teams a vocabulary for assessing where they are and choosing what to improve next. CMMI (Capability Maturity Model Integration) is the most formalized of these.

The fundamental operating principle of CMMI combines two things: process standardization (consistent, documented, repeatable practices applied across the organization) with empirical measurement (quantitative data about process performance that drives improvement decisions). These two elements reinforce each other — you cannot improve reliably what you do not measure, and measurement without standardization generates data that is hard to act on.

CMMI-DEV contains 22 process areas, each a cluster of related practices focused on specific organizational domains — project management, engineering practices, support functions. Process areas define what the organization should achieve within a domain; they are not a recipe but a target state.

CMMI v3.0 introduced three new practice areas — Data Management, Data Quality, and Workforce Empowerment — grouped into a new "Managing Data" capability area. Workforce Empowerment is particularly relevant for EMs: it aligns workforce to business objectives through training paths linked to strategy, proficiency criteria by role, peer coaching, and feedback connected to performance metrics. This is essentially what a good on-call investment program looks like when it is done deliberately.

CMMI v3.0 also explicitly integrates Agile and DevOps principles, shifting from purely process-centric framing to outcome-focused models. This means you can borrow CMMI's measurement techniques for monitoring process stability and pipeline efficiency while retaining autonomy in how improvements are implemented — provided intended outcomes are achieved. This is how CMMI becomes useful to a team running CI/CD rather than waterfall.

Separately, continuous improvement methodologies — DMAIC and kaizen — operate independently of organizational maturity level. You do not need to reach a certain maturity stage before running a kaizen. These are parallel tools: maturity models tell you your current capability profile; improvement cycles tell you how to move.

Organizational Safety Culture and HRO Principles

High-Reliability Organization (HRO) theory, developed by Weick, Sutcliffe, and Roberts, identifies five principles that distinguish organizations capable of operating without catastrophic failure:

Preoccupation with failure — actively hunting for weak signals before incidents materialize.
Reluctance to simplify interpretations — resisting the urge to explain anomalies away with the first plausible story.
Sensitivity to operations — maintaining accurate situational awareness of what is actually happening at the system level.
Commitment to resilience — being able to absorb shocks and recover, not just prevent them.
Deference to expertise — routing decision authority to the person with the most relevant knowledge, not the highest title.

These five principles are interdependent. Preoccupation with failure requires refusing to simplify, which requires sensing what is actually happening, supported by resilience capacity and expertise routing. For an engineering team, these are not aspirational values — they are a design specification for how on-call, incident response, and postmortem processes should function.

SRE Engagement Models

When building or growing reliability practice on your team, the embedded SRE model — where SREs participate in the same roadmap and join on-call rotations alongside developers — is particularly effective for time-bounded reliability uplift and for teams new to SRE practices. Its specific strength is knowledge transfer: it develops reliability habits within the team and builds capability gradually before transitioning to a self-service or consulting model. Effectiveness depends on clear incentive alignment; without it, embedded SREs risk being treated as free development headcount.

Platform engineering complements but does not replace SRE. Platform teams reduce cognitive load on developers through shared tooling and self-service infrastructure. SRE enforces reliability standards. These are distinct functions: organizations that collapse them into one role end up with gaps that surface as delayed delivery and eventual restructuring.

Key Principles

1. Runbooks Are Cognitive Offload Devices

A runbook is not documentation for documentation's sake. Its primary function is to reduce the cognitive overload experienced by an on-call engineer at 2am who may not be the person who built the system. Effective runbooks surface the tacit reasoning of the expert into explicit, navigable steps. They should include decision points — not just a linear sequence — and should be written as if the reader is stressed, time-constrained, and possibly unfamiliar with the specific subsystem.

2. Knowledge Transfer Requires Active Practices, Not Hope

New engineers in complex systems face significant cognitive overload when teams lack structured mechanisms for knowledge transfer. Documentation alone is insufficient; effective onboarding requires active guidance and situated learning within authentic project contexts. This means pairing new engineers with experienced ones on actual incidents — not synthetic training — and using onboarding as an opportunity to validate that runbooks and architectural documentation are actually usable by someone without prior context.

3. On-Call Load Is a Retention Variable

High alert volume is correlated with retention crises. Engineers exposed to excessive alert volume seek new opportunities, creating costly recruitment cycles that further degrade team knowledge. Alert hygiene is not an efficiency optimization — it is a worker-protection control. Reducing noise is a retention intervention.

On-call shifts are a sustained job demand that requires protected recovery time. Engineers who work full project load during and after on-call shifts accumulate cumulative exhaustion. Organizations operationalizing this principle provide lighter workload expectations in the week following weekend on-call duty as an operational necessity. Google SRE targets no more than 25% of SRE time spent on-call, with at least 50% reserved for engineering work.

4. Measurement Standardization Enables Cross-Team Comparison

Standardized, documented, consistently applied processes paired with quantitative measurement are what make improvement empirical rather than anecdotal. Without standardization, measurements are not comparable across teams or time periods. This is why org-wide standards for how incidents are categorized, how SLO burn is tracked, and how postmortem action items are recorded matter: they are the precondition for identifying which team practices are actually working.

5. Leadership Commitment Is a Prerequisite for Checklist and Standards Adoption

Studies of checklist adoption barriers — including the WHO Surgical Safety Checklist — show that incomplete adherence is the primary failure mode. Staff fail to fully complete the checklist not because the checklist is bad, but because leaders failed to publicly embrace the principle that the checklist represents and adapt it to local routines. This is directly applicable to runbook adoption, incident review compliance, and postmortem action item completion. The EM's visible commitment to the practice is load-bearing.

6. Contribute Standards Upward, Not Just Comply Downward

The value of org-wide operational standards comes from network effects: teams learn from each other through shared formats, shared metrics, and shared postmortem findings. An EM who only consumes org standards misses the second half of the deal. Internal style guides, playbooks, and shared terminology systems achieve better consistency when maintained as living documents validated by actual usage — not handed down once and left to rot. Your team's hard-won runbook patterns, alert tuning heuristics, and escalation policy structures are inputs to the org's operational corpus.

Worked Example

Assessing and Improving a Team's Operational State

Scenario: An EM inherits a team of seven engineers. The team has on-call rotations, but two engineers have quietly disengaged from the rotation, citing "too many noisy alerts." There is a wiki with some runbooks, but they are outdated and the team has no consistent format. There are no formal SLOs. The team ships code frequently but incidents are addressed reactively.

Step 1: Locate the maturity baseline

Start with an assessment across the CMMI-relevant process areas that apply to operational work. For this team, the key domains are: incident management, measurement, requirements (SLOs), and configuration management. The honest assessment is foundational-to-initial: processes exist but are not standardized, measurement is absent, and documentation is inconsistent.

Do not treat this as a crisis. Continuous improvement methods like kaizen can be applied at any maturity level. The maturity assessment gives you a target, not a verdict.

Step 2: Identify the highest-leverage knowledge silo

Using the five HRO principles, map where the team has the highest concentration of tacit knowledge. Who gets escalated to? Which subsystems have the thinnest runbook coverage? These are the knowledge islands. The two engineers who disengaged from on-call are likely also knowledge islands — their departure risk is compounded.

Step 3: Run a runbook audit

Take the five most frequently triggered incident types and evaluate each runbook against a simple rubric: Does it include a diagnosis decision tree, not just steps? Is it usable by someone who did not build the system? Has it been validated in an actual incident in the last six months?

Flag the ones that fail. Schedule focused sessions where the system expert walks through their actual reasoning during an incident — capture that tacit knowledge in the runbook, not as a lecture.

Step 4: Redesign the on-call rotation

Apply the recovery pattern: minimum eight engineers for sustainable single-site coverage (Google SRE guidance), structured lighter workload in the week following weekend on-call, and escalation paths validated quarterly. Build the rotation schedule so that primary on-call engineers are never paired with engineers who are also in recovery from the previous rotation.

Step 5: Contribute upward

Once the runbook format has been validated internally, propose it as the default format for the organization. Share the escalation policy testing cadence with peer teams. Submit the most instructive postmortem from the last quarter to the org's shared learning channel with the reasoning made explicit, not just the timeline.

The team's operational knowledge is an organizational asset. Runbook patterns, escalation heuristics, and alert-tuning decisions that stay inside one team are a waste.

Active Exercise

Purpose: Apply maturity assessment and improvement prioritization to your own team's current state.

Part 1: Identify three knowledge islands

Write down three areas of your team's operational scope where, if a specific person left tonight, the team would be materially impaired in the next incident. For each one, assess: Is there a runbook? Has it been validated in the last six months? Could a new team member use it under pressure?

Part 2: Score your runbooks against the cognitive offload criterion

Take your most-used runbooks and ask: Does this runbook include decision points, or only linear steps? Would an engineer unfamiliar with this system be able to use it at 2am? Rate each as sufficient, needs work, or does not exist.

Part 3: Map your current state to a maturity dimension

Choose one of the following operational domains and assess where your team sits: incident categorization (standardized vs. ad hoc), alert hygiene (tuned and reviewed quarterly vs. accumulated), postmortem follow-through (tracked to completion vs. logged and forgotten), or on-call sustainability (structured recovery time vs. full load continuity).

Write one sentence stating where you are and one sentence stating the next specific thing you would need to change to move one step forward.

Part 4: Identify one contribution to org standards

What does your team know that other teams in your organization do not — either because you solved a hard problem or because you made a mistake worth sharing? Identify the format through which that knowledge could be transferred upward: a postmortem, a runbook template, an escalation policy proposal, a shared alert tuning heuristic.

Stretch Challenge

The five HRO principles (preoccupation with failure, reluctance to simplify, sensitivity to operations, commitment to resilience, deference to expertise) are described as interdependent.

Take a real incident from your team's recent history. Map it against all five principles: which ones were active, which were absent, and which would have changed the outcome if they had been stronger? Then describe one structural change to your team's operating model — not a training — that would strengthen the weakest principle identified.

Key Takeaways

Knowledge is distributed cognitive infrastructure. It lives in code, documentation, tools, and people. Letting it erode through turnover or undocumented decisions degrades the team's collective reasoning capacity, not just its documentation.
Tacit knowledge requires active conversion. Pair programming, structured code review, and well-maintained runbooks are the mechanisms that translate what experts know into something transferable. This is deliberate work, not a byproduct.
Maturity models give you a baseline, not a verdict. CMMI process areas and DORA capability clusters tell you where you are. Kaizen and DMAIC tell you how to move. Neither replaces the other.
On-call load is a retention and health variable. Alert noise and recovery deficit are not efficiency problems — they are worker-protection problems. Structuring recovery time after on-call shifts is operational necessity, not generosity.
Your team's operational knowledge belongs to the organization. Runbook formats, escalation policy structures, and postmortem learnings that stay inside one team represent a cost to the organization. Contributing them upward completes the loop.

Further Exploration

Operational Models & SRE

Google SRE Book — Being On-Call — Google's operational model for on-call, including the 25% cap and engineering time ratios
Google SRE Workbook — Engagement Model — Embedded SRE, consulting SRE, and self-service models and when each applies

Maturity & Process Standards

CMMI Institute — Levels of Capability — Official CMMI level definitions for team-level assessments
ISACA — CMMI v3.0 Release Announcement — The three new practice areas in CMMI v3.0 including Workforce Empowerment

Safety Culture & Knowledge

Weick & Sutcliffe — Managing the Unexpected — Primary source for HRO theory and the five principles framework
Knowledge Islands: Visualizing Developers Knowledge Concentration — Empirical research on knowledge concentration patterns and onboarding bottlenecks
PMC — WHO Surgical Safety Checklist: Barriers to Universal Acceptance — Research on checklist adoption; findings transfer to runbook and postmortem follow-through