Incident Severity and On-Call Design

How to classify incidents, wire alerts to SLOs, and build a rotation your team can sustain

Learning Objectives

By the end of this module you will be able to:

Define five severity tiers for incidents and explain the response expectations for each.
Configure SLO-based burn-rate alerting as the trigger for on-call pages.
Design a primary/secondary on-call structure appropriate for a team's size.
Identify the failure modes of poorly-designed escalation policies, including severity gaming.
Explain how alert fatigue forms and describe three practices to reduce it.
Articulate your responsibility as an EM for on-call sustainability as a retention and wellbeing issue.

Core Concepts

The Five-Tier Severity System

The software industry has broadly converged on a five-level severity classification — SEV1 through SEV5, where lower numbers indicate higher severity. This structure is now the de facto standard across major incident-management platforms (PagerDuty, Atlassian, incident.io, FireHydrant, Splunk) and in SRE literature. Some organizations use three or six tiers, but the five-level ordinal shape dominates.

What has not converged is the precise definition of each tier. "SEV1" at a critical infrastructure provider looks nothing like "SEV1" at a ten-person startup. The convergence is on the shape, not the thresholds. Your job is to define what each level means in your context, document it, and keep it stable enough that engineers don't have to improvise during an incident.

What to calibrate per tier

When writing your severity definitions, anchor each tier to three things: impact scope (who is affected and how), response time expectation, and response infrastructure activated (e.g., does this wake someone up?). Without all three, the tier is an empty label.

Tier Definitions and Response Expectations

Each tier carries a binding response contract. The binding is what makes classification meaningful — without pairing a tier to a specific response protocol, severity labels drive no organizational behavior.

SEV1 represents the highest severity. SEV1 incidents trigger immediate 24/7 paging, with response time mandated at 15 minutes or less. This is the only tier where the response infrastructure activates at any hour, no exceptions.

SEV2 activates the same response structure as SEV1 — incident commander, war room, executive communication, mandatory postmortem — but with a relaxed timeline: paging within 30 minutes rather than immediately, and business-hours war-room formation rather than 24/7 activation. SEV1 and SEV2 together define the "major incident" threshold.

SEV3 marks the transition from synchronous to asynchronous handling. SEV3 incidents skip immediate paging and resolve through ticketing systems without waking on-call engineers. No war room. No postmortem required. Engineers address the issue during working hours.

SEV4 and SEV5 follow the same asynchronous pattern with progressively lower urgency — cosmetic issues, minor degradations, and informational signals. These rarely require explicit response SLAs.

Fig 1

Severity tiers and their response contracts

The Major Incident Threshold

SEV1 and SEV2 constitute "major incidents" that trigger organization-level response infrastructure. SEV3 and below remain within normal team capacity. This threshold is consequential: incidents above it consume executive attention and war-room coordination — scarce, expensive resources that you do not want invoked for a degraded feature affecting 3% of users.

As EM, you will be asked to make or ratify this call during active incidents. The question to ask is not "is this bad?" but "does this require cross-functional, executive-visible coordination right now?"

The Incident Commander Role

For major incidents, Google SRE and similar organizations use an Incident Command System (ICS) structure, with designated roles: Incident Commander (IC), Communications Lead, and Operations Lead. These roles activate for SEV1 and SEV2 only.

The IC role is meaningless without actual authority. An IC who has to seek approval before assembling a war room or making a remediation call is not functioning as an IC — they are a coordinator with no power. When defining your escalation policy, be explicit about what the IC is authorized to do without additional sign-off.

Burn-Rate Alerting as the Trigger for Pages

The connection between your SLOs (from module 03) and on-call pages runs through burn-rate alerting. Error budget burn rate provides a mathematical, deterministic basis for severity assignment that removes subjective judgment from incident triage.

The calculation: burn rate = error budget consumed ÷ time window, relative to the total SLO target. A burn rate of 1.0 means you are on pace to exactly exhaust your monthly error budget. Anything above 1.0 means you will overshoot it if the rate holds. Rather than asking "how bad is this?", you ask: at this burn rate, what percentage of the monthly error budget is consumed in 1 hour? If the answer exceeds your SEV1 threshold, a page fires — automatically, consistently, without debate.

This approach also eliminates a class of on-call fatigue: pages that fire because someone hard-coded an absolute threshold that became stale. Burn-rate thresholds stay relevant as long as your SLO is correctly defined.

The prerequisite: monitoring quality

Burn-rate-driven severity is entirely dependent on accurate, consistent observability. If your SLIs do not reflect actual user experience, or if monitoring has gaps, the severity output will be misleading. Fix the monitoring before relying on automated triage.

Impact Assessment as Structured Judgment

For incidents that fall into gray zones — or where automated triage hasn't been configured yet — impact assessment is the structured methodology for quantifying consequences. It connects observability data, SLO governance, dependency knowledge, and business context into a repeatable classification decision.

A useful shorthand for impact assessment: scope (how many users or systems?), depth (total outage or degraded?), rate (is it getting worse or stabilizing?), and business context (is this peak traffic? is a launch in progress?). Documenting this reasoning at incident open produces better postmortems and better reclassification after the fact.

On-Call Structure: Primary and Secondary

Industry best practice is to schedule primary and secondary on-call responders simultaneously. The primary handles initial triage with a 5-minute response SLA. The secondary provides backup with a 15-minute response window and activates automatically if the primary does not acknowledge.

The secondary role is not just a fallback. For major incidents, the secondary provides domain expertise and a second set of eyes, reducing the isolation that makes high-severity response disproportionately stressful.

Escalation Policy Mechanics

Escalation policies are the operational bridge between tier classification and human notification. Without them, a severity tier is a label that does nothing.

For SEV1, a canonical escalation sequence looks like: page the primary on-call immediately → if no acknowledgment in 5 minutes, page the secondary → if no acknowledgment in 10 minutes, page the service owner or senior engineer. For SEV3, the policy might be: create a ticket, notify the next business day.

Escalation policies implement two distinct mechanisms:

Hierarchical escalation: passes the incident up the seniority chain (junior → senior → engineering lead).
Functional escalation: routes the incident to a domain expert regardless of seniority (application engineer → database specialist → platform team).

Most real-world policies combine both. The choice of which mechanism to use at each escalation step is a design decision, not a default.

Each escalation level should have explicit decision authority — what this person is authorized to do, without needing secondary approval. Ambiguous authority forces engineers to either delay response seeking approval, or act without clear backing — both outcomes increase stress and slow resolution.

Minimum Team Size for Sustainable On-Call

Sustainable on-call for a single-site, 24/7 service requires a minimum of eight engineers, assuming week-long shifts, primary and secondary coverage simultaneously, and each engineer rotating once per month. For multi-site configurations, the minimum drops to approximately six per site due to coordination benefits.

These are floors, not targets. A team of eight with high alert volume and complex services will not be sustainable at eight. The minimum is a structural prerequisite, not a guarantee.

If your team is smaller than eight and carrying 24/7 on-call, you have a structural problem that rotation scheduling cannot solve. The right conversations to have: sharing a rotation with an adjacent team, reducing scope of 24/7 coverage, or hiring.

Step-by-Step Procedure

Designing Your Team's Severity and On-Call System

Step 1 — Anchor each tier to response expectations. Write your five tiers. For each, specify: (a) impact criteria, (b) response time window, (c) response infrastructure activated (page, ticket, backlog), and (d) whether a postmortem is required. Do not leave any of these fields vague. "Significant impact" is not an impact criterion — "more than 5% of active users unable to complete checkout" is.

Step 2 — Review your SLOs and define burn-rate thresholds. For each SLI/SLO pair you defined in module 03, calculate what burn rate corresponds to each severity tier. A common starting point: burn rate > 14.4 over 1 hour → SEV1; burn rate > 6 over 6 hours → SEV2. Configure these as alerts in your monitoring stack, wired to your escalation policy. The thresholds are adjustable — start somewhere, then tune.

Step 3 — Build the escalation policy. For each severity tier, write out the escalation sequence: who gets paged, in what order, after how many minutes of non-acknowledgment. Specify whether each step is hierarchical or functional. Publish this as a reference document your entire team can read before they are on-call.

At each level of the policy, note what the person at that level is authorized to decide without escalating further. An on-call engineer who does not know whether they can restart a service at 2 AM without manager approval will hesitate. Remove the hesitation.

Step 4 — Schedule the rotation. Verify you have enough people in the rotation (minimum 8 for single-site 24/7). Assign primary and secondary roles for each shift. Configure your alerting platform to auto-escalate to secondary after a defined acknowledgment window. Publish the schedule with enough lead time for people to plan around their on-call weeks.

Step 5 — Run a rotation review weekly. After each on-call rotation, hold a 30-minute handoff meeting. Cover: active incidents and their state, upcoming risky deployments or changes, and any alerts that fired but should not have (candidates for tuning or deletion). The handoff is also where you catch alert fatigue building before it reaches a breaking point.

Step 6 — Track workload trends. Monitor incident frequency, severity distribution, MTTR, and per-engineer on-call load over time. These trends are leading indicators of rotation sustainability. A sustained increase in alert volume or an unequal distribution of high-severity pages across the rotation is a signal to act — not to wait until someone burns out or leaves.

Common Misconceptions

"Severity classification is a judgment call, so we should leave it to the responder." Leaving classification entirely to real-time judgment produces inconsistency and gaming. The responder's judgment should operate within defined criteria, not replace them. The criteria exist precisely so that a fatigued engineer at 2 AM does not have to reason from first principles about whether this is a SEV1.

"If we page less often, on-call will be less stressful." Alert volume is one factor, but not the only one. The anxiety of carrying a pager — even with no active incidents — disrupts sleep and creates persistent psychological stress. Rotation design (shift length, frequency, time of day) and clarity of escalation authority matter as much as alert volume. Reducing alert noise is necessary but not sufficient.

"The secondary is just a backup in case the primary is sick." The secondary role has a broader function: providing support and domain expertise during major incidents, preventing the isolation that makes high-severity response disproportionately taxing. Under-resourcing secondary coverage (or treating it as optional) degrades incident quality and concentrates stress on the primary.

"If a team keeps filing SEV2s for a database issue, they're over-escalating." Not necessarily. Teams strategically inflate severity classifications to bypass slow or unresponsive cross-team processes. When the only reliable path to get the database team's attention is a SEV2, the problem is a broken prioritization structure, not a classification error. Treat repeated severity gaming as a signal about broken inter-team SLAs, not a training problem.

"Once you set up burn-rate alerting, you're done." Burn-rate alerting is only as good as its monitoring foundation. If SLIs are misaligned with actual user experience, or if observability has gaps, the severity assignments will mislead. Alerting setup is not a one-time configuration event; it requires regular review as service behavior and team context change.

Boundary Conditions

Small teams below the minimum rotation size. The eight-engineer minimum for single-site 24/7 on-call is a structural floor. Teams below this cannot distribute the load without imposing unsustainable rotation frequency on individuals. Severity tiers and escalation policies do not fix an under-staffed rotation — they just clarify the problem. If you are in this position, the design decision is whether to share a rotation, reduce 24/7 coverage scope, or explicitly accept the risk.

Services with no meaningful SLOs. Burn-rate alerting requires calibrated SLOs to function. If your SLOs are aspirational placeholders rather than measured against real traffic, the burn-rate numbers will be noise. The SLO work from module 03 is a genuine prerequisite, not a checkbox.

Severity definitions that cross team boundaries. Your team's tier definitions exist within a larger organizational framework. If your company has a standard severity taxonomy (common in platform-oriented engineering orgs), your definitions must be compatible with it — particularly the major-incident threshold. When in doubt, anchor your SEV1/SEV2 definitions to whatever triggers the company-wide incident response process.

Follow-the-sun vs. single-site rotations. The minimum team size and rotation design assume a single-site, weekly-shift model. If your team is distributed across enough time zones to support follow-the-sun coverage, the structural calculus changes: per-site minimums drop to approximately six, and night-shift coverage disappears as a source of sleep disruption. However, follow-the-sun introduces handoff complexity and requires tighter documentation of incident state at shift boundaries.

High-frequency, low-severity alert environments. Teams that receive many SEV4/SEV5 alerts risk treating all alerts as low-signal noise, which then delays response when a genuine SEV1 fires. Teams receiving more than 40 alerts per shift experience 3x higher MTTR compared to teams receiving fewer than 10. Excessive alert volume is a signal about instrumentation and alerting rule quality, not a rotation capacity problem. Resolve it at the source.

Key Takeaways

The five-tier framework is the industry shape; your definitions fill it. The SEV1-SEV5 structure is converged. What varies by organization is what each tier means in your context — who is affected, what response fires, whether a postmortem is required. Write these down. Vague definitions produce inconsistent behavior during incidents.
Severity tiers only work when bound to response protocols. A tier without a binding escalation policy is an empty label. The escalation policy — who gets paged, in what order, after how long — is what makes classification actionable. Every tier needs an explicit escalation sequence.
Burn-rate alerting makes severity deterministic. Connecting your SLOs to on-call pages through burn-rate thresholds replaces ad-hoc human judgment with a mathematical, repeatable classification. It requires well-calibrated SLOs and clean monitoring, but once in place, it eliminates a major source of on-call inconsistency.
Eight engineers is the structural floor for single-site 24/7 on-call. Primary/secondary coverage, sustainable rotation frequency, and adequate shift spacing require a minimum headcount. Below this, rotation design changes cannot solve the structural deficit.
On-call sustainability is your responsibility as EM. Incident load trends, alert distribution, and rotation frequency are leading indicators of whether your team's on-call is sustainable. Tracking them is part of your operational management job, not a wellness initiative. Severity gaming, escalation ambiguity, and alert fatigue are each solvable problems — but only if you are watching for them.

Further Exploration

Severity Levels — PagerDuty Incident Response Documentation — Canonical reference for tier definitions and escalation mechanics from the PagerDuty SRE practice.
Alerting on SLOs — Google SRE Workbook — The original treatment of burn-rate alerting, with worked numeric examples for threshold configuration.
Being On-Call — Google SRE Book — Covers minimum team size, rotation design principles, and the human cost of on-call load.
Escalation Policies for Effective Incident Management — Atlassian — Practical guide to hierarchical and functional escalation, with examples.
On-Call Rotations — Datadog Engineering Blog — How Datadog structures primary/secondary rotations and weekly handoffs in practice.
Improving On-Call: A Manager's Guide — Atlassian — EM-focused framing of on-call health, covering alert tuning, rotation review, and team wellbeing.