Temporal Isolation and Flow Control
Circuit breakers, bulkheads, backpressure, and load shedding as a coherent system — not a menu of options
Learning Objectives
By the end of this module you will be able to:
- Explain the circuit breaker pattern as a feedback mechanism and describe its three states.
- Distinguish bulkhead isolation from circuit breaking and describe how they compose.
- Describe backpressure as a coordination signal and contrast it with load shedding.
- Apply retry budget thinking to prevent retry storms in the face of upstream degradation.
- Classify traffic as essential or non-essential and derive a load shedding strategy from that classification.
- Recognize the two-stable-states model and explain how flow control mechanisms tip a system toward recovery.
Core Concepts
The two-stable-states model
Before examining individual patterns, it helps to understand the system dynamics they are fighting against.
Research from Bronson et al. (HotOS 2021) characterizes overloaded distributed systems as bistable: they have exactly two stable states — working normally and stuck-broken. Once a triggering event pushes a system into the overloaded state, it does not automatically return to normal when the trigger disappears. The feedback loop sustains the broken state on its own.
The sustaining feedback loop — not the initial trigger — is the true root cause of the prolonged outage.
The loop runs like this: timeouts cause retries, retries amplify load, amplified load extends timeouts, which cause more retries. Capacity that would have been sufficient for steady-state traffic is consumed fighting the amplification, so the system stays stuck even after the original cause is gone. As described in Metastable Failures in the Wild, recovery requires explicitly disrupting this loop — it will not self-heal.
This is the design space all flow control patterns live in: how do you prevent the system from entering the broken attractor, and if it gets there, how do you force it back?
Metastable failures are specific to open systems — services exposed to an external load source that cannot be directly controlled. Closed systems with bounded input rates cannot exhibit this failure mode for the same reasons. This is why public-facing services at scale face a qualitatively different resilience problem than internal batch processors.
Circuit breakers: feedback-loop disruptors
The circuit breaker pattern monitors failure rates between services and, once failures exceed a configured threshold, opens the circuit — halting requests to the failing downstream service and returning errors immediately rather than waiting for timeouts.
The three-state machine is the core mechanism:
- Closed: requests flow normally; failure rate is tracked.
- Open: requests are blocked immediately (fail-fast); the downstream service sees zero additional load.
- Half-open: a probe request is allowed through; if it succeeds the circuit closes, if it fails it reopens.
From a cybernetics perspective, the circuit breaker acts as a governor: it applies negative feedback to the calling system. By stopping the flow of requests to a failing service, it prevents the retry-load-timeout cycle from self-sustaining. Empirical results showed the circuit breaker reduced error rates by 58% in tested microservice systems.
The pattern is most powerful when combined with graceful degradation. When the circuit opens, there should be a fallback: a cached response, a default value, a degraded-mode response. Circuit breakers detect the failure; graceful degradation decides what to do next. AWS Well-Architected explicitly positions these as layered, not alternative, mechanisms.
Bulkheads: spatial isolation under shared infrastructure
The bulkhead pattern takes a different angle. Where a circuit breaker is a temporal gate (it stops requests over time when failure accumulates), a bulkhead is a spatial partition: it assigns dedicated resources — thread pools, connection pools, memory — to each dependency, preventing one dependency's failure from exhausting shared resources and cascading to others.
When all services share a single thread pool, a slow or failing dependency can block every available thread. The next request — to a completely healthy service — finds no thread to run on. This is how a partial failure becomes a total outage.
Bulkheads limit the blast radius. A non-essential dependency can saturate its own pool and time out without touching the resources reserved for the critical path. This connects directly to the essential vs. non-essential classification: bulkheads are the mechanism by which you operationalize that classification at the infrastructure level.
The two patterns compose naturally:
- Bulkheads prevent failure from spreading across dependencies (spatial isolation).
- Circuit breakers prevent failure from propagating over time within a dependency (temporal isolation).
Applied together, a failing non-essential dependency saturates its dedicated pool (bulkhead), the failure rate trips the circuit (circuit breaker), and essential traffic continues unaffected on its own resources.
Backpressure: coordinating across service boundaries
Backpressure is a coordination signal, not a rejection. When a downstream service cannot keep up with incoming work, it propagates a "slow down" signal upstream rather than silently dropping requests or failing. This allows producers to buffer and pace their output to match what consumers can actually process.
Implementations include token bucket algorithms, leaky bucket rate limiting, and message broker queues (RabbitMQ, Kafka) — all of which smooth input spikes while maintaining throughput at the consumer's sustainable rate.
The Reactive Streams specification standardized demand-driven backpressure across async stream libraries (RxJava, Project Reactor, Kafka). The core mechanism is request(n) signaling: subscribers declare exactly how many elements they can process before requesting more, preventing unbounded buffering and memory exhaustion. Libraries like RxJava distinguish Flowable (backpressure-aware) from Observable (no backpressure), where the latter throws MissingBackpressureException if producers outrun consumers.
For backpressure to work across service boundaries, the path must speak a common language about limits. Concurrency limits per service layer and admission control prevent the system from entering a metastable overloaded state in the first place — they are preventive, not reactive.
Real systems combine three strategies simultaneously rather than picking one:
- Bounded buffers absorb small transient spikes.
- Intentional shedding drops requests with explicit acknowledgment when bounds are exceeded.
- Upstream propagation lets components receiving backpressure reduce their own input rate.
Load shedding: controlled degradation under overload
Load shedding is the flip side of backpressure. Where backpressure says "slow down, I'll take it eventually," load shedding says "I cannot take this at all — drop it." The AWS Builders' Library on load shedding frames it as a tool to prevent overload by refusing work that the system cannot complete in time.
The key mechanism that makes load shedding principled rather than random is deadline propagation. By attaching an end-to-end deadline (a TTL) to each request as it enters the system, every service in the call stack can answer the question: "if I process this request now, will the result arrive before the caller's deadline has already passed?" A request whose deadline has expired is useless work. Shedding it frees capacity for requests that can still succeed.
Deadline-driven load shedding also prevents a specific anti-pattern: retrying against a service that has already rejected the original request as too-late. Without deadline propagation, a retry that arrives after the caller has given up adds load without producing value.
Essential vs. non-essential traffic
Load shedding requires knowing what to shed. This is not an operational decision — it is a design decision made before incidents happen.
Graceful degradation requires explicit upfront classification of features and dependencies as essential or non-essential:
- Essential: must complete even if non-essential dependencies fail. These get priority resources, protected thread pools, and are never shed.
- Non-essential: can be gated, disabled, or degraded. These are shed first under load, and their dependencies are candidates for conversion to soft dependencies.
The fundamental mechanism is converting hard dependencies into soft ones. A hard dependency is one whose failure causes an immediate outage; a soft dependency is one whose failure reduces non-essential functionality but preserves core service. This conversion is enforced through explicit fallback paths — cached responses, default values, feature flags that disable non-essential features under load.
Production outage analysis shows that inbound overload, resource exhaustion, and dependency failures account for the majority of service failures. For each failure mode, the essential/non-essential classification determines the response: shed low-priority traffic for overload, reduce feature set for resource exhaustion, return cached responses for dependency failures.
Retry budgets and exponential backoff
Retries are not free. At scale, naive retry logic is one of the primary mechanisms by which a partial failure becomes a sustained metastable outage. Every retry is an additional request to an already-struggling service.
Exponential backoff addresses one dimension of this: instead of retrying immediately or at fixed intervals, wait times increase exponentially (1s, 2s, 4s, 8s…), giving the downstream service time to recover. But exponential backoff alone is not sufficient.
Jitter — adding randomness to backoff intervals — is essential to prevent synchronized retry storms. Without jitter, all clients that failed at the same moment will retry at the same moment (1s later, 2s later, 4s later), producing a synchronized thundering herd that hammers the service repeatedly. Full jitter, equal jitter, and decorrelated jitter are the main variants, each with different latency and storm-prevention trade-offs.
But backoff and jitter solve the timing problem, not the volume problem. This is where retry budgets come in. A retry budget is a hard limit on the total number of retries allowed within a window. Once the quota is exhausted, further retries are blocked. This directly addresses the amplification phase of metastable failures: retries are a bounded resource, not an unlimited one.
Retry amplification compounds across service layers. If service A retries 3 times on failure, and service B (which A calls) also retries 3 times per attempt, B sees up to 9 requests for each original request from A. In a deep dependency chain, this multiplier grows rapidly. Retry budgets must be designed with the full call graph in mind, not just the immediate caller.
Finally, safe retries depend on idempotency. Idempotency keys — caller-provided request identifiers — ensure that a retry for an operation that may have already completed returns the original result without re-executing side effects. Without idempotency, retries risk double-processing: charging a payment twice, creating a duplicate record, or sending a notification twice.
Key Principles
1. Design the essential/non-essential classification before the incident, not during it. The line between essential and non-essential traffic is a product and architecture decision. Under pressure during an outage is the worst time to debate whether the recommendation service is critical. Make the classification explicit, encode it into your load shedding policy, and reserve resources accordingly.
2. Flow control works as a system, not as individual components. Circuit breakers prevent feedback loops from sustaining. Bulkheads contain blast radius. Backpressure coordinates across boundaries. Load shedding maintains throughput for essential traffic under overload. Retries with budgets and jitter avoid adding to the problem. These patterns are designed to compose — layering them provides defense in depth that any single pattern cannot.
3. The feedback loop, not the trigger, is the real adversary. Removing the original cause of an incident is necessary but not sufficient. As long as the retry-load-timeout feedback loop is active, the system remains in the broken stable state. Flow control mechanisms succeed by disrupting or preventing this loop.
4. Backpressure preserves work; load shedding discards it. Choose based on whether the work has a deadline. Backpressure is preferable when all requests will eventually be processed and latency is acceptable. Load shedding is correct when expired-deadline requests produce zero value and holding them consumes capacity needed for fresh requests.
5. Retries are a bounded resource; treat them that way. Unlimited retries under sustained overload are the mechanism by which a recoverable partial failure becomes a prolonged outage. Retry budgets are the hard boundary that prevents this amplification from running indefinitely.
Worked Example
Scenario: An e-commerce checkout service depends on three downstream services: a payments service (essential), a recommendations engine (non-essential), and a fraud-check service (essential). Traffic spikes 5x during a flash sale. The fraud-check service begins timing out under the load.
What happens without flow control:
- Checkout requests to fraud-check time out after the configured timeout (say, 5s).
- Checkout threads are held for 5 seconds per failed request.
- Clients retry immediately. Retries consume additional fraud-check capacity.
- Thread pool saturation spreads — checkout itself starts timing out from the perspective of callers.
- The load spike ends, but the retry storm sustains the overload. The system stays in the broken state.
What happens with layered flow control:
First, design decisions made before the incident:
- Fraud-check is classified as essential; it gets a dedicated thread pool (bulkhead) and the checkout service has an explicit fallback: flag the order for manual review if fraud-check is unavailable.
- Recommendations is classified as non-essential; it is in a separate pool and load shedding policy drops recommendation requests first.
- Each checkout request carries a deadline (e.g., 10s from entry) propagated into all downstream calls.
During the incident:
- Fraud-check begins timing out. The circuit breaker on the fraud-check client tracks the failure rate.
- After the threshold is crossed, the circuit opens. Fraud-check requests fail-fast; threads are not held. The fallback activates: orders are flagged for review rather than rejected.
- Fraud-check gets zero additional retry traffic. It begins recovering.
- Retries from checkout callers hit exponential backoff with jitter; the retry budget is not exhausted because fail-fast returns quickly.
- The recommendations pool is already at capacity; load shedding drops those requests explicitly, preserving threads for checkout and payment flows.
- After the recovery timeout, the circuit moves to half-open. A probe request succeeds. The circuit closes. Normal processing resumes.
The system degraded gracefully — checkouts continued, some without fraud scoring — rather than collapsing entirely. Essential traffic was protected throughout. The feedback loop never got traction because retries to fraud-check were stopped at the circuit breaker.
Common Misconceptions
"Circuit breakers protect the caller." They protect both sides. The caller benefits from fail-fast (threads are not blocked waiting for a dead service), but the downstream service benefits equally — it receives zero retry traffic while the circuit is open, giving it room to recover. Framing the circuit breaker as only caller-side protection misses half of its function.
"Backpressure and load shedding are alternatives — pick one." They solve different problems and are designed to compose. Real systems employ all three buffering strategies simultaneously: bounded buffers absorb spikes, shedding handles exceeded bounds, upstream propagation allows rate reduction. The question is not which to pick but how to layer them.
"Adding retries improves reliability." Retries improve reliability for transient failures in a healthy system. Under sustained overload, retries without budgets and backoff are the mechanism of metastable failure amplification. The conditions that make retries safe (the service is recovering, the load will decrease) are precisely the conditions that fail to hold during a metastable failure. Retries require budgets, jitter, and idempotency to be safe.
"Removing the trigger ends the incident." As described in Bronson et al., metastable failures persist after the trigger is removed. The feedback loop is self-sustaining. Ending the incident requires actively disrupting the loop — typically by shedding load, throttling retries, or temporarily reducing concurrency — not just fixing the original cause.
"A circuit breaker per service is sufficient." In sharded or cell-based architectures, a per-service circuit breaker applies a binary decision across all shards. A failure in one shard can open the circuit for all shards, cutting traffic to healthy shards and defeating the isolation the architecture was designed to achieve. In these systems, per-shard circuit breakers or load shedding with deadline propagation are more appropriate.
Boundary Conditions
Circuit breakers misfire on partial failures in sharded systems. As described above, a per-service breaker applies an all-or-nothing decision. If 10% of shards are unhealthy, the breaker may stay closed (not enough errors) or open entirely (cutting 90% of healthy capacity). Neither outcome is correct. Load shedding with deadline propagation or per-shard circuit breakers handle partial failures more precisely.
Backpressure requires end-to-end protocol support. Backpressure only works when every layer in the call path speaks the same language about load limits. A single layer that ignores the signal breaks the chain. HTTP/1.1 has no native backpressure mechanism; implementing it requires application-level signaling or protocol upgrades.
Retry budgets must account for call depth. A retry budget that looks reasonable at the leaf service becomes dangerous amplification mid-chain. A service with 3 retries, called by a service with 3 retries, produces up to 9 requests at the leaf. Budget design requires reasoning about the full call graph.
Load shedding without classification is random damage. Shedding requests without a priority ordering is as likely to drop essential traffic as non-essential traffic. The essential/non-essential classification is the prerequisite that makes load shedding purposeful rather than arbitrary.
Idempotency is a correctness requirement for retries, not an optimization. Without idempotency, retries after network partitions or timeout ambiguity can duplicate side effects. This is especially dangerous for state-modifying operations (payments, order creation, inventory reservation). Backpressure and dead-letter queue recovery both depend on idempotency to be safe.
Dead-letter queues require operational ownership. A DLQ isolates poison messages and enables recovery without blocking the main flow. But messages in a DLQ require human or automated intervention to process or discard. An unmonitored DLQ accumulates work that will never be completed — it defers the failure rather than handling it.
Key Takeaways
- Systems under load tend toward one of two stable states: recovering or collapsing. Flow control mechanisms exist to ensure the system tips toward recovery rather than getting stuck in the broken state.
- The circuit breaker's value is as a feedback-loop disruptor. By halting requests to a failing service, it prevents retries from sustaining the load-timeout-retry cycle. Its three states (closed / open / half-open) implement a probe-and-recover protocol, not just an on/off switch.
- Bulkheads isolate failure spatially; circuit breakers isolate failure temporally. They compose: bulkheads contain blast radius per dependency, circuit breakers stop feedback loops over time. Neither alone is sufficient.
- Load shedding requires an upfront essential/non-essential classification. This is a design decision, not an operational one. Without it, shedding is random; with it, shedding preserves the critical path while discarding work that cannot complete in time.
- Retries are a bounded resource. Retry budgets, exponential backoff, and jitter are not optional polish — they are the difference between a self-limiting retry strategy and an amplification mechanism that sustains metastable failures.
Further Exploration
Foundational Research
- Metastable Failures in Distributed Systems (Bronson et al., HotOS 2021) — The foundational paper on bistability and feedback loops in distributed systems.
- Analyzing Metastable Failures (Isaacs & Alvaro, HotOS 2025) — Follow-up analysis of real-world metastable failure patterns.
Retry and Backoff Strategies
- AWS Builders' Library: Timeouts, Retries and Backoff with Jitter — Practical, opinionated guidance on retry discipline from an operator at scale.
- AWS Builders' Library: Making Retries Safe with Idempotent APIs — Why idempotency is a correctness requirement, not an optimization, for retry-safe systems.
Load Shedding and Backpressure
- AWS Builders' Library: Using Load Shedding to Avoid Overload — Load shedding strategy and the reasoning behind deadline-driven approaches.
- Backpressure by Design: Concurrency Limits and Admission Control — Contemporary patterns for implementing backpressure at the infrastructure level.
- Queues Don't Fix Overload (Fred Hebert) — Why unbounded queues are not a backpressure strategy, and what to do instead.
Circuit Breakers and Sharded Systems
- Circuit Breaker Pattern (Microsoft Azure Architecture Center) — Reference implementation guidance with state machine detail.
- Will circuit breakers solve my problems? (Marc Brooker) — A critical and nuanced examination of where circuit breakers fail, especially in sharded systems.
Architecture and Graceful Degradation
- AWS Well-Architected REL05: Graceful Degradation — The essential/non-essential classification framework and the hard-to-soft dependency conversion.