SLO Target Setting

From default targets to enforceable commitments — setting SLOs that actually change behavior

Learning Objectives

By the end of this module you will be able to:

  • Classify your services by criticality tier and map appropriate SLO target ranges to each tier.
  • Use a company-wide default SLO as a safe starting point and explain when and how to refine it.
  • Define an error budget policy that specifies concrete consequences at different burn levels.
  • Explain how error budget feature freezes work and how to frame them for a product manager.
  • Distinguish real SLO enforcement from SLO theater.
  • Describe the conditions under which a legitimate SLO exception should be granted.

Core Concepts

Service Criticality Tiers

Not every service deserves the same SLO target. Before you pick a number, decide what the service is. A practical three-tier structure covers most teams:

TierDescriptionTypical SLO target range
Critical pathDirectly in the user-facing transaction flow. An outage is a product outage.99.9% – 99.99%
SupportingEnhances or accelerates the experience, but degraded mode is tolerable for a period.99.5% – 99.9%
Internal toolingDeveloper productivity, internal portals, CI pipelines. Users are internal.99% – 99.5%

The right tier drives the right target range. A team that sets 99.99% on an internal tooling service is creating maintenance obligations and incident pressure it does not need. A team that sets 99% on a critical payment path is under-promising to users while over-tolerating failure.

SRE teams can scale their operational responsibility to many services when those services share a unified SLO structure and aligned business goals. The tier structure is what makes that alignment possible across services of very different risk profiles.

Company-Wide SLO Defaults

Most teams lack months of clean SLI data when they first set targets. Attempting to derive a precise target from noisy or incomplete historical data typically leads to either an unrealistically tight SLO (which will immediately blow the budget) or an artificially loose one (which commits nothing).

Company-wide SLO defaults solve this cold-start problem. A default is not an aspirational number; it is a policy anchor that reflects the organization's baseline reliability expectation for each service tier. Your first SLO should come from the default unless you have specific evidence to override it. The default also ensures every team starts from a consistent foundation, which matters when SLO data feeds into cross-team reliability reviews.

SRE maturity frameworks describe how organizations progress from basic monitoring and manual incident response toward structured SLOs and predictive practices — defaults are how you plant the flag at the appropriate starting point for your current maturity stage without pretending you have more data than you do.

When to revisit the default

Revisit the default after three to six months of real production data. Move the target only when you have a specific reason: the default is consistently too tight for your service's actual baseline, or the default is too loose and your users are actually sensitive to failures the default permits.

Error Budget Policy

An error budget is the allowable unreliability implied by your SLO. A 99.9% availability SLO over a 28-day window gives you roughly 40 minutes of budget. The policy defines what happens as that budget burns.

Error budgets function as a coordination mechanism between product development and SRE by converting SLOs into a shared metric that determines release velocity. As long as budget remains, releases continue. When budget is depleted, releases halt. The mechanism replaces political negotiation between "move fast" and "stay stable" with an objective, data-driven decision rule.

A practical three-level policy structure:

Budget remainingStatusResponse
> 25%GreenNormal operations. Feature development and releases proceed.
< 25%YellowWarning. Reliability work increases. Non-critical deployments reviewed.
0% (exhausted)RedFeature freeze. Only reliability work and P0/security fixes deploy.
The budget only works if teams actually slow down when it runs out. If you set SLOs but never act on budget exhaustion, the whole system is theater.

Step-by-Step Procedure

Setting Your First SLO

Step 1 — Classify the service. Assign the service to a criticality tier (critical path, supporting, or internal tooling). If it sits at a boundary, ask: "If this service is degraded for two hours, does a user transaction fail visibly?" Yes → critical path.

Step 2 — Start with the company-wide default. Apply the default SLO target for your tier. Do not invent a number. Document the default you are using and which policy version it comes from.

Step 3 — Measure your current baseline. Run the SLI you defined in the previous module for four weeks without acting on it. Record actual availability (or latency, or error rate — whatever your SLI measures).

Step 4 — Compare baseline to default. If your baseline is comfortably above the default (say, 30+ bps of headroom), the default stands as the official target. If your baseline is below the default, you have two options: accept that you are already in breach and treat your first weeks as a reliability sprint, or escalate to leadership for a temporary adjusted target with a remediation plan attached.

Step 5 — Write the error budget policy. Define what the team will do at each burn level (see Core Concepts above). Get sign-off from your PM before publishing it. Alignment in advance prevents the feature-freeze conversation from becoming a crisis.

Step 6 — Schedule weekly budget reviews. Weekly error budget reviews allow teams to detect depletion and course-correct before situations escalate. Monthly reviews create response lag that makes intervention harder before hard freezes trigger. Put a recurring review on the team calendar. Make the dashboard the first thing you look at in the review.

Step 7 — Enforce the policy when it triggers. The first time the budget hits the yellow or red threshold, follow the policy exactly. Credibility is established the first time, not in how you wrote the document.

Executing a Feature Freeze

When the error budget is exhausted:

  1. Notify your PM immediately, citing the budget number, not your opinion.
  2. Move all in-flight feature work to a holding state. Do not cancel, just pause.
  3. Pull reliability work up: known stability investments, tech debt that has been tagged as reliability-relevant, and any open postmortem action items.
  4. Review budget status weekly. When the service returns above the SLO threshold and some budget has recovered, resume feature development.
  5. After the freeze lifts, run a lightweight retrospective on what consumed the budget.
Frame the freeze around the data, not the rule

When informing your PM, lead with the metric: "We have consumed our 28-day error budget. Our SLO target is 99.9% and we have been at 99.6% this month." Then describe the policy consequence. This keeps the conversation factual rather than procedural and avoids the perception that you are applying a rule for its own sake.


Worked Example

Setup. You own a checkout confirmation service. It is tier 1 (critical path). The company default SLO for critical-path services is 99.9% availability over a 28-day window. You have been running SLIs for six weeks. Your measured availability is 99.87%.

Step 1 — Apply the default. You adopt 99.9% as your official target. Your six-week baseline of 99.87% shows you are 13 bps below target. This means you are currently in breach of the SLO you just adopted.

Step 2 — Start with a frank PM conversation. You present the numbers. The error budget for 99.9% over 28 days is 40.3 minutes. Your current burn rate would exhaust it in roughly 19 days. You propose: for the next four weeks, the team pauses two feature projects and focuses on the top three causes of availability loss identified in your last incident review.

Step 3 — Publish the error budget policy. Your policy specifies: yellow at < 25% remaining (about 10 minutes left), red at 0%. The PM signs off. Everyone agrees before the policy is needed.

Step 4 — Four weeks later. After focused reliability work, your availability climbs to 99.93%. You are 3 bps above target. The budget has partially recovered. Feature work resumes.

Step 5 — Six weeks later. A dependency outage causes a 35-minute availability gap in a single week. Budget drops to 0%. You trigger the feature freeze. The outage was caused by a company-wide networking failure — outside your team's control.

Step 6 — Apply the legitimate exception. Error budget policies include explicit criteria for legitimate exceptions: teams may continue non-reliability work if the budget was consumed by a company-wide networking problem, a failure in a service maintained by another team, or traffic outside the SLO's defined scope. You document the exception, note the root cause, and resume limited feature work while tracking the recovery.

Step 7 — Report the exception. You document the exception in writing (a brief Slack post in your team channel and a note in the budget review doc) and flag it to the SRE platform team as input for improving dependency-caused budget exclusions.


Common Misconceptions

"Set the SLO tight to push the team."

Setting an aggressive target as a motivational device is a reliable way to exhaust your budget immediately and burn team credibility on a policy that was never realistic. SLOs are a precision instrument, not a stretch goal. The target should reflect the reliability your users actually need, not the reliability you wish you had. Start from the default; raise the target only when data supports it.

"The feature freeze is a punishment."

Feature freezes are not disciplinary. They are a neutral policy consequence triggered by a metric. Framing them as punishment creates defensiveness and resistance. Framing them as "reliability data is telling us to redirect effort" keeps the conversation productive. The error budget mechanism is designed to reduce political friction between innovation and stability by replacing opinion with an objective shared metric.

"We can override the freeze if there's enough business pressure."

If executives can easily override feature freezes under business pressure, the policy has zero value. The override mechanism must be made expensive and rare. One operational model (sometimes called the "silver bullet" model) grants product teams a limited number of exceptions per period, but each use triggers a mandatory postmortem to understand why the override was necessary and how to prevent the need for future overrides. This makes exceptions visible and structurally costly without eliminating them entirely.

"Monthly budget reviews are enough."

Weekly budget reviews allow teams to course-correct before the situation becomes a postmortem. Monthly reviews are almost always too slow. By the time a monthly review surfaces a budget problem, the team is often already in hard freeze territory with no room to maneuver.

"A postmortem only triggers after a major incident."

A single incident that consumes more than 20% of the error budget in a four-week window automatically triggers a mandatory postmortem, regardless of how the incident was classified in the moment. An incident does not need to be labeled SEV-1 to warrant a postmortem. Budget consumption is the operative trigger, not severity assignment.


Active Exercise

Exercise: Write your team's error budget policy.

This exercise produces a document your team and PM can sign off on.

  1. Pick one service you own. Assign it a criticality tier.
  2. Identify or look up the company-wide SLO default for that tier. If no default exists, propose one based on the tier ranges in the Core Concepts section and note that it is a proposal.
  3. Define the three-level policy (green / yellow / red) for that service. For each level, write:
    • The threshold (% budget remaining)
    • What work is allowed
    • Who is notified
    • Who has authority to call a legitimate exception
  4. List two conditions that would qualify as legitimate exceptions for your specific service. Be specific. "Dependency failure" is too vague. "Failure in the authentication service maintained by the platform team, where our SLO explicitly excludes auth availability" is the right level of specificity.
  5. Define the override process. Who can approve a business-driven override? What documentation is required? What follow-up is mandatory?

Share the draft with your PM before publishing. Their willingness to sign is itself a signal about whether your SLOs have organizational backing.

Key Takeaways

  1. Classify services by criticality tier first. The tier determines the appropriate SLO target range; picking a number without the tier step produces arbitrary targets.
  2. Use company-wide defaults as your starting point. They solve the cold-start problem and ensure cross-team consistency. Override them only with specific data.
  3. An error budget policy only has value if it is enforced. Define consequences at each burn level in advance, get PM sign-off, and follow the policy the first time it triggers.
  4. Feature freezes at budget exhaustion are the mechanism that converts an SLO from a dashboard metric into a real signal. Without enforcement, the system is theater.
  5. Legitimate exceptions exist and should be written into the policy. But they must be scoped to external causes outside the team's control. Business pressure is not a legitimate exception.
  6. Review error budget status weekly, not monthly. Weekly cadence gives you enough time to intervene before the policy forces your hand.

Further Exploration

Core References

Implementation Guides