Winning Buy-In for Operational Excellence

How to turn SLOs, error budgets, and reliability work into shared business concerns — and build the coalitions that keep them funded.

Learning Objectives

By the end of this module you will be able to:

  • Frame SLOs and error budgets as alignment tools for conversations with PMs, not constraints imposed on them.
  • Explain reliability and on-call costs to non-engineering stakeholders using business language.
  • Identify which stakeholders need to be brought in before launching a new operational practice.
  • Describe when to conform to org-wide standards and when to advocate for a team-level exception.
  • Apply coalition-building principles to secure sustained investment in operational excellence.
  • Explain why joint SLO definition with product creates shared ownership of reliability targets.

Core Concepts

The reliability buy-in problem

Operational excellence work — SLOs, on-call rotations, incident response, toil reduction — does not sell itself. From the outside it looks like engineering overhead: time that is not shipping features, budget that is not growing the product. The people who control priorities and resources (PMs, skip-level managers, sales leadership) operate on a different set of incentives, and unless the framing of reliability work changes, the path of least resistance is to de-prioritize it.

Research on organizational change consistently identifies middle management buy-in as one of the primary determinants of whether initiatives succeed or fail. The same mechanism applies here: when the EM can act as a credible intermediary — translating reliability concerns into language that resonates with each stakeholder group — the chances of sustained investment improve substantially. When that translation does not happen, reliability work gets treated as optional.

SLOs as a shared contract, not a technical artefact

An SLO is a target for user-facing service behavior over a time window. The error budget is what remains of that target before it is exhausted. On their own, these are engineering metrics. In stakeholder conversations, they become something different: a shared contract about acceptable risk.

The error budget mechanism only works as a decision-making tool when the underlying SLO has been explicitly agreed to by relevant stakeholders. Organizations that skip stakeholder alignment in SLO definition report that error budgets fail to influence development prioritization — the metrics become reporting artifacts rather than operational levers. The reason is simple: if a PM never agreed to what "99.9% availability" means for their product, they have no reason to accept that a depleted error budget should pause feature work.

Joint definition is the fix. Google SRE warns explicitly: "When product and SRE teams define SLOs independently, the budget will always feel unfair or arbitrary to one party." Co-definition is not just about accuracy — it is about creating the shared ownership that makes the budget a legitimate decision-making lever rather than an engineering team's unilateral constraint.

Reliability framed as a business problem

Non-engineering stakeholders — product managers under feature delivery pressure, sales teams quoting SLAs to customers, finance deciding headcount — do not process on-call burden or MTTR as meaningful inputs. They process customer impact, revenue risk, and contract exposure.

The translation is not cosmetic. When you frame an on-call rotation as "N engineers carrying pager duty" you are speaking an engineering language. When you frame it as "our current incident response capacity means a P1 event hits customers for an average of X minutes before remediation begins, which maps to Y churn risk per quarter," you are speaking a business language. The underlying reality is the same; the stakeholder's ability to act on it is not.

Narrative structure in technical communication enables diverse stakeholders to extract relevant understanding without requiring shared technical expertise. A well-structured reliability narrative — one that connects technical events to customer and business consequences — allows a PM, a VP, and an account executive to read the same document and each extract the decision-relevant context they need.

Org-wide standards vs. team autonomy

Most engineering organizations operate with some version of a tension between org-wide operational standards and team-level flexibility. Shared tooling, shared observability platforms, shared on-call protocols, and shared SLA policies reduce coordination costs across teams. Standardized tooling fosters collaboration and knowledge sharing, enables engineers to understand and contribute to each other's systems, and simplifies management by reducing customization needs.

Successful toil automation and reliability improvements also require organization-wide agreement on shared tools and processes — not isolated individual team efforts. A team that automates a manual process using a tool no other team uses may eliminate their own toil but creates a new coordination burden whenever that system touches others.

The question for an EM is not "standards vs. autonomy" in the abstract. It is: when does my team's situation genuinely differ from what the standard handles, and is that difference worth the coordination cost of an exception?

The bar for an exception should be high. The right argument is not "this standard is inconvenient for us" but "the standard as written creates a specific problem that the standard's designers did not anticipate, and here is a narrowly scoped variant that preserves the shared benefits." Advocating for exceptions requires the same stakeholder communication skills as any other reliability conversation — and the same credibility building over time.

Key Principles

1. Stakeholder buy-in precedes practice launch

Research on organizational change failures identifies lack of employee involvement as the dominant cause — 70% of change initiatives fail because people affected by the change were not part of designing or piloting it. The same dynamic applies when launching a new operational practice (SLOs, error budget policy, new on-call process): if the stakeholders whose priorities and resources are affected by the practice were not consulted before it launched, resistance is the predictable outcome, not an anomaly.

The practical implication: map your stakeholders before you write any process doc. For a new SLO with an associated error budget policy, the minimum set is the PM who owns the product area, the skip-level manager who owns the funding, and any peer engineering teams whose services interact with yours.

2. Participation is uncomfortable — and still necessary

Participatory approaches to change create a paradox: while participation reduces resistance and builds coalitions of supporters, it increases the change agent's feelings of vulnerability and loss of control. Change agents often misinterpret stakeholder input as resistance rather than legitimate voice.

Be alert to this pull in yourself. When a PM pushes back on an SLO target during a joint definition session, the instinct is to read it as obstruction. Often it is information: about what users actually care about, about where the product-reliability tradeoff sits from the business side, about what the PM has already committed to upward. That input makes the resulting SLO more credible, not less technically rigorous.

3. Convert fence-sitters through relationships, not only arguments

Affective cooptation — leveraging strong personal ties and emotional bonds to create commitment — converts ambivalent stakeholders into supporters more reliably than rational argument alone. This does not mean bypassing logic. It means that logical arguments land differently depending on the pre-existing relationship. The same case for an SLO policy, presented to a PM you have worked closely with vs. one you have no relationship with, will have different outcomes.

Coalition-building for reliability work is therefore also relationship-building work — investing in regular, low-stakes communication with product, sales, and business peers before a reliability crisis makes a high-stakes ask necessary.

4. Leadership visibility is necessary but not sufficient for practice adoption

Successful deployment of shared practices — OKRs, SLOs, operational standards — requires executive commitment: champions who participate in the practice themselves, not just mandate it downward. But executive visibility without middle-layer support does not create operational change. The EM's role is in that middle layer — translating strategic intent into team practice, and translating team-level operational data back upward.

Credibility in upward communication builds through repeated interactions with consistent, high-quality communication over time. Communication quality matters more than frequency. An EM who provides crisp, accurate reliability updates — in business language, without hyperbole — builds the organizational standing to advocate for investments that peers who only communicate during crises do not have.

5. Trust is built through sustained practice, not structure

Trust develops over time through successful shared experiences and deliberate actions that demonstrate consistent commitment. A new SLO process announced at a team meeting does not create trust. A quarter of joint SLO reviews with a PM counterpart, where the data consistently reflects reality and the conversation is honest about trade-offs, does.

This is important for expectation-setting. Coalition-building is not a one-time event before a process launch. It is a practice maintained over time — and its dividends are reaped when something goes wrong and you need a stakeholder to back a reliability investment under pressure.

Worked Example

Scenario. You are an EM for a payment processing service. The team has been running with informal reliability targets for two years. Incidents are escalating — two P1s in the last quarter, each causing customer-visible payment failures. You want to introduce formal SLOs and an error budget policy.

Step 1: Stakeholder map before any document. The affected parties are: the PM who owns the payment product roadmap, the VP Engineering who controls headcount (and is already asking about the incidents), the Sales team that quotes uptime SLAs in contracts, and two peer engineering teams whose services call into payment processing. Write down what each party's primary concern is. The PM cares about feature velocity and quarterly commitments. The VP cares about risk exposure and engineering reputation. Sales cares about what they can promise customers without getting into trouble. Peer teams care about not being blocked by your reliability decisions.

Step 2: Joint SLO definition with the PM. Do not arrive with a proposed SLO and ask for sign-off. Arrive with data: the current observed availability, the two P1s and their customer impact (in business terms — payment failures, churn risk, support volume), and a question: "What level of reliability does this product need to deliver on its business purpose?" Start from customer need, work backward to a number. The resulting target will be one the PM co-owns rather than one handed down from engineering.

Step 3: Error budget policy as a shared agreement. Draft the error budget policy together with the PM, not unilaterally. The policy should specify: what happens when the budget is 50% consumed (a conversation), what happens when it is exhausted (feature work pauses to address reliability), who is in the room for those conversations. Having both signatures on the document is not bureaucratic — it is what makes the policy actionable when the budget is actually depleted.

Step 4: Business-language communication to the VP and Sales. Translate the SLO and error budget into the framing each audience needs. For the VP: "We now have an explicit reliability target and a mechanism for surfacing when the system is at risk. The two P1s last quarter consumed 40% of what would have been our annual error budget. Here is what that means for staffing and roadmap trade-offs." For Sales: "Our formal availability commitment is X%, measured over rolling 30 days. Here is how that maps to what you can quote in contracts."

Step 5: Standards check. Before finalizing, check whether the SLO definition, measurement approach, and error budget policy align with any existing org-wide reliability standards. If they do — great, you are contributing a compliant implementation. If they do not, document the specific divergence and why it exists, and route it to whoever owns the standards to either get an exception or align to the standard.

The joint definition shortcut

When time is short, a faster path to joint SLO definition is to bring the PM two or three candidate targets — one aspirational, one realistic given current system behavior, one conservative — with the customer-impact cost of each. This gives the PM something to react to rather than a blank page, while keeping the decision genuinely theirs.

Active Exercise

This exercise applies the stakeholder mapping and framing skills from the module.

Setup. Choose a reliability practice you want to introduce or improve on your team — a new SLO, an on-call rotation change, a new incident review process, or a shift to a shared tooling standard.

Part 1: Stakeholder map (15 minutes)

List every person or group whose priorities, workload, or commitments are affected by this change. For each, write:

  • Their primary concern (what do they care about most?)
  • Their likely objection to this change
  • What would make this change attractive to them in their own terms

Part 2: Business-language translation (20 minutes)

Write a two-paragraph explanation of the reliability problem your change addresses. The constraint: you may not use any of the following words — availability, SLO, SLA, error budget, MTTR, toil, alert, on-call. Translate everything into customer impact, business risk, or cost terms.

If you cannot do it, that is useful information: it means the business case for the change is not yet clear in your own head.

Part 3: Participation design (10 minutes)

Identify the one or two stakeholders from Part 1 who should be co-designers of this change rather than recipients of it. Write one sentence describing what you would ask them to contribute and at what point in the process.

Reflection question. What is the risk that you misinterpret their input as resistance rather than information?

Key Takeaways

  1. SLOs without stakeholder alignment are reporting artifacts, not decision tools. Error budget policies only create operational leverage when the underlying SLO targets were explicitly agreed to by the parties who control feature velocity and resource allocation.
  2. Joint SLO definition is a coalition move. When product and SRE/engineering define SLOs separately, the budget always feels unfair to one party. Co-definition converts the SLO from an engineering constraint into a shared agreement about acceptable risk.
  3. Participation reduces resistance but requires tolerating loss of control. Including stakeholders in designing a new practice will surface input that changes the practice. This is the point — and it is also uncomfortable. Change agents who interpret stakeholder input as obstruction lose the coalition benefits of participation.
  4. Credibility builds through consistent, high-quality communication over time, not through crisis escalation. The EM who provides clear, accurate, business-language reliability updates routinely earns the organizational standing to advocate for investments when they are needed. Frequency matters less than quality and consistency.
  5. Org-wide standards are a coordination benefit, not a bureaucratic tax. The bar for exception should be high and narrowly scoped. Conforming to shared operational tooling and processes reduces toil at the org level, enables cross-team knowledge sharing, and makes reliability improvements portable beyond your team.

Further Exploration

SLO and error budget foundations

Organizational change and coalition-building

Communication and credibility