SLIs, SLOs, and Error Budgets
Building the reliability measurement layer that makes everything else work
Learning Objectives
By the end of this module you will be able to:
- Distinguish SLI, SLO, and SLA with concrete examples from your own services.
- Identify the most common SLI types (availability, latency, error rate, throughput) for a given service.
- Explain error budgets as a policy tool that governs the speed-reliability tradeoff.
- Describe how rolling windows differ from calendar windows and why rolling is generally preferred.
- Outline the steps to establish a baseline SLO for a service with no prior targets.
- Explain why unrealistic SLOs erode trust and how to avoid them.
The Three-Layer Model: SLI → SLO → SLA
These three terms describe the same reliability story at different levels of formality and consequence.
An SLI measures what is happening. An SLO sets a target for what should happen. An SLA commits to what will happen — with consequences if it doesn't.
SLI (Service Level Indicator) is the quantitative measurement of service performance at a point in time. It is a raw ratio: good events divided by total events, computed continuously. Examples: the fraction of HTTP requests that return a 2xx response; the proportion of API calls completing in under 200ms.
SLO (Service Level Objective) is the internal target for an SLI, expressed as a percentage over a time window. A typical SLO reads: "99.5% of requests will complete successfully over a rolling 30-day window." The SLO is an operational commitment made between engineering and the business. It carries no legal weight, but it does carry operational weight — it is the instrument that drives prioritization decisions. SLOs are foundational to all SRE practices: without them, error budgets cannot function as levers, and incident management loses its anchor.
SLA (Service Level Agreement) is a contractual agreement with external customers that includes defined penalties for missing targets. Where SLOs are internal and adjustable, SLAs carry legal and financial consequences. SLAs are typically set more conservatively than internal SLOs to preserve a safety margin.
SLO breach → engineering response. SLA breach → legal and commercial response. The SLO is your early-warning system designed to prevent the SLA from ever being breached.
SLI Types: What You Can Actually Measure
The four most commonly implemented SLI types are:
| SLI Type | What it measures | Typical expression |
|---|---|---|
| Availability | Fraction of time the service is usable | % of requests that succeed |
| Latency | How fast requests complete | % of requests under a threshold (e.g., P95 < 300ms) |
| Error rate | Proportion of failed requests | % of requests returning errors |
| Throughput | Volume of work the service can handle | Requests processed per second |
Every SLI is a ratio with a numerator and a denominator. Defining the denominator clearly is critical: if you cannot agree on what counts as a "total request," your SLI is ambiguous and your SLO is unmeasurable. For error rate, the denominator is total requests. For throughput, it is requests per unit time. The numerator is the count of events that meet the "good" criterion.
SLIs must be quantifiable, user-focused, and collected consistently over time. Practical collection means instrumenting your service to emit the raw metrics — Prometheus histograms, structured logs, or tracing spans — and configuring aggregation queries that compute the ratio.
Critical User Journeys: Measuring What Users Actually Care About
A technical metric that does not map to user experience provides no operational signal worth acting on. SLIs should be defined around critical user journeys — sequences of interactions a user takes to accomplish a business outcome — rather than on isolated component metrics.
A critical user journey often spans multiple services or APIs. That means the SLI should measure the end-to-end experience, not just the piece of infrastructure your team owns directly. A practical starting point is to identify one to three SLIs per critical user journey.
Error Budgets: Reliability as a Consumable Resource
An error budget is calculated as 1 minus the SLO percentage. It represents the allowable amount of degraded service within the measurement window.
Concretely: a 99.9% availability SLO yields a 0.1% error budget — approximately 43.2 minutes of allowable downtime per month. A 99.95% SLO yields 21.6 minutes. A 99.99% SLO yields 4.3 minutes.
The error budget's operational power is that it functions as a control mechanism. When the service is performing well above its SLO, the team has budget to deploy features, run experiments, and take calculated risks. When the budget is exhausted, feature work pauses and stability work takes priority. This converts a subjective argument ("we should slow down on releases") into an objective, data-driven policy.
Error budgets only work as decision tools when the underlying SLOs have been explicitly approved by relevant stakeholders. Without stakeholder buy-in on what the SLO represents and how the organization will respond to exhaustion, the error budget becomes a dashboard number rather than a policy instrument. Organizations that skip this alignment step report that error budgets fail to influence development prioritization.
Time Windows: Rolling vs. Calendar
SLOs are expressed over a time window. The two main options are:
Calendar windows reset at fixed intervals — typically the start of a calendar month. They are simple to communicate but introduce a distortion: an outage on the 2nd of the month has nearly a full month to recover; an outage on the 28th triggers an immediate freeze with almost no time to recover before the window closes.
Rolling windows compute compliance continuously over the last N days (most commonly 30 days). They provide more stable enforcement behavior and are preferred in SRE practice. The Google SRE Workbook states: "A rolling period for error budget measurement is less prone to varying reaction depending on the date of an outage."
Start with a 30-day rolling window. It is the most common choice, maps naturally to monthly operational cadences, and avoids the calendar distortion problem. Switch to 7-day windows only if your deployment frequency and incident response require tighter feedback loops.
Step-by-Step Procedure: Establishing a Baseline SLO
This sequence applies to any service that has no prior reliability targets.
Step 1: Identify your critical user journey. Name the sequence of interactions a user takes to complete the most important business action your service supports. Define the start and end of the journey. Resist the temptation to measure everything — start with the one journey whose failure hurts users most.
Step 2: Choose one to two SLI types. For most HTTP services, start with availability (fraction of successful requests) and optionally latency (fraction of requests under a defined threshold). Define the numerator and denominator explicitly in writing. Example: numerator = HTTP requests with 2xx response codes; denominator = all HTTP requests excluding health checks.
Step 3: Instrument and collect data. Wire up your SLI metrics through your existing observability stack (metrics, logs, or tracing). SLIs must be collected consistently over time; one-off snapshots are not usable. Do not set an SLO target yet.
Step 4: Run a 30-day observation phase. Measure actual system performance for at least 30 days before setting a target. This baseline reveals real operational patterns: deployment-induced dips, traffic spikes, dependency failures. A monitoring-only phase of 2–4 weeks is explicitly recommended before any enforcement action — it prevents setting targets that immediately trigger false emergencies.
Step 5: Propose a target jointly. Review the baseline data together with the relevant stakeholders (product manager, SRE lead if present, dependent team leads). Joint SLO definition between product and engineering is necessary for error budget credibility. When one side defines it alone, the budget always feels unfair to the other. Set the initial SLO slightly below the measured baseline to create realistic headroom.
Step 6: Compute the error budget and define the policy. Calculate the error budget (1 − SLO). Document what the team will do when the budget is 50% consumed (investigate), 75% consumed (review release cadence), and exhausted (feature freeze until budget recovers). Without a written policy, the budget is inert.
Step 7: Enable SLO-based alerting. Wire alerts to fire when the service is burning its error budget — not when infrastructure metrics cross thresholds. SLO-based alerting reduces noise by ensuring pages only fire when user-visible degradation is occurring. This is covered in depth in the on-call module, but configuring it is the natural final step of SLO setup.
Worked Example
Service: An internal checkout API handling e-commerce payment submissions.
Critical user journey: A logged-in user submits a payment and receives a confirmation — start at POST /checkout, end at the confirmation response.
SLI definition:
- Numerator: requests to
POST /checkoutthat return HTTP 200 within 1000ms - Denominator: all requests to
POST /checkout
Baseline measurement (30 days): After instrumenting the endpoint and running for 30 days, the team observes an average success rate of 99.7%, with three incidents causing brief drops to 98.5%.
SLO proposal: The team and PM review the data together and agree on a 30-day rolling window SLO of 99.5% of checkout requests succeed within 1000ms. This is set below the observed 99.7% baseline to allow operational headroom for normal variance without constant policy triggers.
Error budget:
- SLO: 99.5%
- Error budget: 0.5% = approximately 216 minutes per 30-day period
- Weekly budget: ~50 minutes
Policy:
-
50% budget consumed in the window: engineering reviews recent deployments.
-
75% consumed: release cadence slows; no new external dependencies deployed.
- Budget exhausted: feature releases halt until the 30-day rolling window recovers.
Alerting: A burn-rate alert fires when the service is consuming budget more than 2× the expected rate — an early signal of an ongoing incident, not a historical accounting.
The checkout API depends on a payment processor (external, 99.9% SLA) and an inventory service (internal, current observed 99.95%). Combined ceiling before the checkout API's own reliability: ~99.85%. The 99.5% SLO is comfortably achievable given this constraint.
Common Misconceptions
"Higher SLOs are always better." A 99.99% SLO allows only 4.3 minutes of downtime per month. For most internal services, that target is both unnecessary and punishing — every minor incident immediately exhausts the budget and triggers a feature freeze. Unrealistic SLO targets are a primary failure mode: teams breach them frequently, become demoralized, and stop trusting the mechanism. The right SLO is the one that reflects what users actually need and that the system can realistically deliver.
"Error budgets are a tool for SREs, not EMs." Error budgets translate reliability into a resource that product and engineering can negotiate over together. Without the EM's involvement in the policy definition, the budget becomes a technical artifact that no one acts on. The EM's role is to own the policy — what happens when the budget is exhausted — not just the measurement.
"We can set the SLO now and measure later." Setting an SLO target before you have 30 days of baseline data is likely to produce an aspirational number rather than an operational one. The target will either be breached immediately (if set too tight) or never provide signal (if set too loose). Measure first, commit second.
"Calendar windows are fine because they match business reporting." Calendar windows introduce an asymmetry: an outage early in the month has time to recover; an outage late in the month causes an immediate and disproportionate response. Rolling windows eliminate this distortion and are recommended unless external contractual alignment requires calendar-based reporting.
"An SLO breach means someone failed." An exhausted error budget is information, not blame. When budget exhaustion triggers blame, teams hide problems or manipulate metrics. The goal of the policy is to redirect priority, not to assign fault. Frame it that way in every conversation that references error budget status.
Key Takeaways
- SLI measures, SLO targets, SLA contracts. The SLO is your operational instrument; the SLA is your external commitment. Keep internal SLOs conservative enough that SLA breaches are structurally unlikely.
- Error budgets convert reliability into a shared resource. The budget is the gap between 100% and the SLO. When it is healthy, the team has permission to move fast. When it is exhausted, stability takes priority. This transforms a subjective debate into a data-driven policy.
- Measure before you commit. Thirty days of baseline data before setting an SLO target is the minimum. A 2–4 week monitoring-only phase before enforcement prevents targets that immediately trigger false emergencies.
- Joint definition is not optional. SLOs defined by one team alone always feel arbitrary to the other. Product and engineering must agree on targets, scope, and the policy that follows exhaustion.
- Rolling windows beat calendar windows. Rolling windows provide stable enforcement behavior regardless of where in the month an incident falls. Default to 30-day rolling unless external constraints require otherwise.
Further Exploration
Core References
- Google SRE Workbook: Implementing SLOs — The canonical implementation guide, including the monitoring-only phase recommendation and the full SLO lifecycle.
- Google SRE Workbook: Error Budget Policy — Covers joint definition, the unfair budget failure mode, and how to write an error budget policy document.
- Google SRE Workbook: Alerting on SLOs — The standard reference for burn-rate alerting and reducing page noise.
Critical User Journeys & Dependencies
- Google Cloud: Practical Guide to Setting SLOs — Critical user journey approach to SLI selection.
- ACM Queue: The Calculus of Service Availability — The mathematical treatment of dependency-constrained availability, including the Rule of the Extra 9.
- Michelin Engineering: SLI/SLO on Critical User Journeys — A practitioner walkthrough of mapping business processes to SLIs across multiple services.
Time Windows & Compliance
- Alex Ewerlöf: SLO Compliance Period — A clear explanation of rolling vs. calendar windows and the boundary distortion problem.