DORA Metrics and Continuous Delivery

How to read your team's delivery health as a system, and which engineering practices actually move the needle

Learning Objectives

By the end of this module you will be able to:

  • Define the four (now five) DORA key metrics and explain what each signals about delivery health.
  • Explain why deployment frequency and change failure rate tend to improve together, not trade off.
  • Describe the CI/CD pipeline stages and what each stage contributes to safety.
  • Explain why small-batch work reduces both risk and cycle time simultaneously.
  • Identify the failure modes of formal change approval boards on delivery performance.
  • Use a maturity framework to assess your team's current CD capability and identify one improvement step.

Core Concepts

The Four (and Now Five) DORA Metrics

The Accelerate framework defines software delivery performance through four outcome metrics. Crucially, they are outcome metrics — they tell you what your system is producing, not what inputs to tune.

Fig 1
Throughput Deployment Frequency How often code reaches production. Proxy for batch size and flow. Lead Time for Changes Commit → production. Measures pipeline friction and batch size. Stability Change Failure Rate % of deploys that cause a failure requiring remediation. Time to Restore Service (MTTR) How fast you recover from a production incident.
The DORA metrics and what they measure

These four metrics are used to validate whether capability improvements actually impact business outcomes, and they anchor the DORA assessment.

In 2025, DORA added a fifth metric: Rework Rate — the proportion of engineering activity spent on reactive work versus planned feature development. This addition explicitly acknowledges that the original four metrics were insufficient for diagnosing root causes of performance degradation. A team whose deployment frequency drops can now ask: is this a capacity problem, or are we drowning in reactive firefighting?

Rework Rate in practice

Rework Rate is the newest metric and the least standardized. The methodology for calculating it across teams is still being established. Use it directionally — a rising rework rate is a signal worth investigating — rather than as a precise benchmark.

Reading the Metrics as a System

The instinctive read is: deployment frequency = speed, change failure rate = safety. Speed and safety trade off, so you optimize one at the expense of the other.

The data refutes this. Elite performers deploy multiple times per day with a change failure rate of 7.5%–15%. Low performers deploy once per month to once every six months — and have a change failure rate of 45%–60%.

High-frequency deployment, paired with strong automation and peer review, does not increase risk. It reduces it — by catching problems earlier and enabling faster feedback loops.

This is not a coincidence of causation. It reflects the mechanics of batch size. Teams that deploy rarely accumulate large, tangled changes. When something goes wrong in a large batch, the blast radius is high and the root cause is harder to isolate. Teams that deploy frequently work in small batches, make individual changes observable, and restore faster because the candidate cause set is small.

What DORA Measures — and What It Doesn't

DORA measures technical delivery outcomes. It answers what your pipeline is producing. It does not answer why a team is struggling, or whether individuals are satisfied, or whether collaboration patterns are healthy.

DORA and the SPACE framework are explicitly complementary: DORA for the delivery signal, SPACE for the human and systemic layer (Satisfaction, Performance, Activity, Communication & Collaboration, Efficiency & flow). In practice, start with DORA to establish a delivery baseline, then layer SPACE dimensions when your team grows past 20–30 engineers and coordination challenges emerge.


Step-by-Step Procedure: The CI/CD Pipeline

Continuous Integration (CI) is the prerequisite and foundational layer. Both Continuous Delivery and Continuous Deployment build on CI. The reverse is not true: CI does not require CD.

A standard CI/CD pipeline follows this sequence, where each stage is a quality gate:

Fig 2
Trigger Build Test Security Package Deploy + Monitor/Rollback
CI/CD pipeline stages — each stage is a quality gate; failure stops progression

Stage by stage:

  1. Trigger — A push, pull request, or merge starts the pipeline.
  2. Build — The application is compiled and assembled. This is also where the artifact is created. Best practice: artifacts are immutable. The same binary that passes tests in staging is the binary that ships to production. No rebuilds between environments.
  3. Test — Unit tests, then integration tests. Failures stop progression here, cheaply.
  4. Security scanning — Static analysis, dependency vulnerability checks.
  5. Package and publish — Artifact is tagged, versioned, and pushed to a registry. Artifact SHAs should be tied to runbooks to reduce ambiguity during incident response.
  6. Deploy to environments — Staging first, then production. Continuous Delivery stops here: production deploy requires an explicit manual approval or business decision. Continuous Deployment automates the final step.
  7. Monitor and rollback — Production metrics are observed post-deploy. Automated rollback can be triggered by error rate increases, latency spikes, or conversion rate drops, independent of manual incident response.

Where this requires judgment:

  • The pipeline defines what can be automated. Which environments require manual approval is a business decision, not a technical one.
  • Security scanning tools generate noise. Teams need explicit triage policies, otherwise flagged builds pile up and engineers learn to ignore the signal.

Worked Example: Diagnosing a Stalled Team

A team ships a major feature once every three weeks. Recently two releases in a row caused production rollbacks. The manager's instinct: slow down, add more review gates.

Let's trace the DORA signal.

Step 1: Read the metrics.

MetricObserved valueElite benchmark
Deployment frequency~1x/3 weeksMultiple/day
Lead time for changes~12 days< 1 day
Change failure rate~40%7.5–15%
MTTR~4 hours< 1 hour

This is a low-performer profile. The change failure rate and MTTR are high, but so is the lead time — which points to large batches.

Step 2: Diagnose the batch size problem.

A 12-day lead time means changes accumulate before they ship. When the release goes out, it contains a week of interleaved changes from multiple engineers. When something breaks, the blast radius is large and root cause is non-obvious. Rollbacks restore the system but don't isolate what failed.

Step 3: Identify the vicious cycle risk.

The instinctive response to two rollbacks is to add a change advisory board sign-off. DORA research found that organizations using external approval processes were 2.6 times more likely to be low performers. Adding the CAB would increase lead time further, force even larger batches, and compound the blast radius — producing more rollbacks, not fewer.

Step 4: Identify the productive intervention.

Peer review during development, enforced at code check-in time and supported by automated tests, is more effective than formal approval boards. The team needs smaller batches (decompose features into independently mergeable units), faster CI feedback (test suite that runs in < 10 minutes), and feature flags to decouple deployment from activation.

Step 5: Pick one improvement.

Start with one practice, validate its usefulness, then layer. For this team: enforce a pull request size norm (max 200 lines of logic) and measure whether lead time shortens over the next four weeks. Don't restructure the whole pipeline at once.


Compare & Contrast: Formal Change Boards vs. Peer Review + Automation

These serve the same stated purpose — catching bad changes before they reach production — but produce opposite outcomes.

Formal CAB / Change BoardPeer Review + CI/CD
When does review happen?Late, after work is doneEarly, during development
Who reviews?External committeeTeam members with context
What does it inspect?Documentation and intentActual code and automated test results
Effect on batch sizeForces batching (reduce approval overhead)Enables small batches (cheap to submit)
Effect on lead timeIncreases (scheduling lag)Decreases (async, continuous)
Evidence of safety improvement?No evidence CFR is lower with CABsHigher deploy frequency + no CFR increase
The vicious cycle

When teams adopt heavier approval processes in response to production incidents, they increase lead time and batch size. Larger batches have a higher blast radius, causing more incidents, which motivates more approval formality. This cycle degrades all four DORA metrics simultaneously. DORA research found no evidence that formal external approval is associated with lower change failure rates.

The peer review model does not mean no oversight. It means oversight moves earlier, gets closer to the code, and is supported by automation that can't be gamed or skipped for scheduling convenience.


Boundary Conditions

When small batches are hard to enforce. Some work genuinely requires large coordinated changes — database schema migrations, protocol version upgrades, platform refactors. Feature flags let you deploy the code while deferring activation, which reduces the blast radius even when the change itself is large. The deploy and the release become separate events.

When DORA metrics plateau before the real problem is solved. The four original metrics measure delivery throughput and stability. They don't tell you whether the right things are being built, whether engineers are burning out, or whether technical debt is compounding. The Rework Rate metric and SPACE dimensions were added specifically to address this gap. A team that optimizes purely for DORA numbers while ignoring rework rate can look healthy while quietly accumulating unsustainable reactive load.

When your team is too small to instrument everything. For small teams (under 20 engineers), DORA alone may be sufficient. Adding SPACE, Rework Rate tracking, and comprehensive observability simultaneously creates measurement overhead that crowds out the work it is supposed to measure. Start with deployment frequency and lead time — they are the easiest to instrument and the most diagnostic.

When canary releases don't apply. Canary releases route 1–5% of traffic to a new version and require traffic splitting infrastructure. For internal tooling, batch jobs, or systems with no meaningful traffic distribution, the blast radius controls look different. Feature flags remain applicable; canaries do not.

When automated rollback fails silently. Automated rollback requires robust observability across infrastructure, application, and business metrics. If your observability stack doesn't capture the failure mode (e.g., a data corruption that doesn't show up in error rates), automated rollback won't fire. The safety mechanism is only as good as the signal it monitors.

Key Takeaways

  1. The four DORA metrics are a system. Deployment frequency, lead time, change failure rate, and MTTR form a coherent signal. High performers score well on all four simultaneously. Low performers typically fail all four.
  2. Deployment frequency and change failure rate improve together, not at each other's expense. Elite teams deploy multiple times per day with a 7.5–15% change failure rate. Infrequent deployment correlates with 45–60% failure rates.
  3. The CI/CD pipeline is a sequence of quality gates. Artifacts are immutable and promoted unchanged through all stages. What passes testing is what ships to production.
  4. Formal change approval boards create a vicious cycle. They are associated with lower deployment frequency and no improvement in change failure rate. They increase batch sizes, which increases blast radius and causes more incidents.
  5. Peer review at check-in time, backed by automated tests, is the evidence-based alternative to CABs. Oversight moves earlier, gets closer to the code, and is supported by automation that can't be gamed or skipped for scheduling convenience.
  6. Small batches, feature flags, canary releases, and automated rollback are the runtime controls that replace pre-deployment approval gates. They make deployment strategies observable and reversible.
  7. Maturity improvement is incremental. Pick the one constraint most limiting your metrics, intervene on it, measure the result, then iterate.

Further Exploration

Foundational research

Pipeline mechanics

Measurement and maturity