Pattern

Event-driven / saga

Distributed transactions do not work. Two-phase commit is a throughput killer, and no production system at scale uses it across service boundaries. The saga pattern replaces global ACID with a chain of local transactions, each paired with a compensating action that undoes its effect when a later step fails.

Coordinate multi-service workflows via compensating transactions instead of distributed locks — choreography for simple flows, orchestration for everything else.

Architecture diagram· Saga skeleton — Order → Payment → Inventory → Shipping

Each service owns one step. Compensating arrows show the rollback path when a later step fails.

You’re looking at this pattern when

Multi-service workflow (order → payment → inventory → shipping)
Independent deploy + failure isolation per service
Audit trail is a product requirement

Shows up in

E-commerce checkout (order / payment / inventory / shipping)
Ride-hailing request lifecycle (match / dispatch / trip / payment / rating)

What most people get wrong

All writes are within one service and one DB — just use a local transaction

When to reach for this

Reach for this when…

Multi-service workflow (order → payment → inventory → shipping)
Independent deploy + failure isolation per service
Audit trail is a product requirement
Workflow steps owned by different teams
Downstream consumers vary over time (add analytics, search indexer without touching upstream)

Not really this pattern when…

All writes are within one service and one DB — just use a local transaction
Workflow must be strongly consistent with no tolerance for temporary inconsistency
Team cannot yet operate eventually-consistent systems or run a workflow engine

Good vs bad answer

Interviewer probe

“Design a checkout flow across order, payment, inventory, and shipping services.”

Weak answer

"Order service calls payment synchronously, payment calls inventory, inventory calls shipping. If any call fails, the caller handles the error and tells the upstream to roll back. We use HTTP for everything."

Strong answer

"I would use a saga orchestrator — Temporal — to drive the checkout workflow. The workflow function calls four activities in sequence: reserve_inventory, capture_payment, create_shipping_label, confirm_order. Each activity is a gRPC call to the owning service. Temporal persists the workflow state after each activity completes.

If capture_payment permanently fails (card declined), Temporal invokes the compensation for reserve_inventory (release the reservation). The compensation is idempotent — it checks whether the reservation has already been released before acting.

Every service uses the outbox pattern: business write + outbox row in one DB transaction. A Debezium CDC connector publishes outbox events to Kafka. This eliminates dual writes.

For side-effects (send confirmation email, update analytics), I publish domain events to Kafka from the outbox. These consumers are choreographic — they subscribe independently and do not affect the saga's critical path.

Observability: the saga_id propagates as the OpenTelemetry trace_id across all services. A dashboard shows sagas by state, p99 completion time, and compensation rate. A timeout sweeper moves sagas stuck longer than 30 minutes to COMPENSATING. DLQ depth is monitored — any message there triggers a PagerDuty alert."

Why it wins: Names the orchestration engine, designs compensations with idempotency, uses outbox to eliminate dual writes, separates critical path (orchestration) from side-effects (choreography), and adds observability with specific metrics.

Cheat sheet

•Saga = N local transactions + N-1 compensating actions. No distributed locks.
•Orchestration for >=4 steps or complex compensations. Choreography for 2-3 step reactive flows.
•The hybrid model: orchestration for critical path, choreography for side-effect fanout.
•Outbox pattern: business write + outbox row in one DB tx. Publisher tails outbox. No dual writes.
•Every step and every compensation must be idempotent. At-least-once delivery is the norm.
•Non-compensable steps go last (send email, push notification cannot be undone).
•Compensation is a new forward action, not a rollback. Payment captured → issue refund.
•State machine with conditional updates prevents double-processing: WHERE state = expected.
•Timeout sweeper catches dead sagas: non-terminal state > SLA → COMPENSATING.
•DLQ catches poison messages. Monitor DLQ depth — should be zero in steady state.
•saga_id as OpenTelemetry trace_id. One query → complete timeline across all services.
•Compensations retry with backoff until success. Permanent failure → DLQ + human review.

Core concept

Why distributed transactions are impossible — and what sagas replace

In a monolith, a checkout is one database transaction: BEGIN, debit the account, decrement inventory, create the shipment record, COMMIT. If any step fails, the database rolls everything back. That model breaks the moment data lives in more than one database.

Two-phase commit (2PC) is the textbook answer: a coordinator asks every participant to vote PREPARE, then broadcasts COMMIT or ABORT. In theory it guarantees atomicity. In practice it is a throughput disaster. Every participant holds locks from PREPARE until COMMIT — across a network round-trip, across services you do not control. One slow participant blocks every other participant's writes. One crashed coordinator leaves every participant hanging with locks held indefinitely. Google Spanner solved this with atomic clocks and TrueTime; you are not Google.

CAP context: In a partitioned network, 2PC's coordinator cannot reach a participant, so the protocol blocks. You cannot have both availability and strong consistency across service boundaries when the network is unreliable. Sagas choose availability and eventual consistency.

The saga: local transactions + compensations

A saga replaces one global ACID transaction with N local transactions, each executed within a single service's database. Each local transaction has a paired compensating action that semantically undoes its effect. If step K fails, the saga invokes compensations for steps K-1 down to 1, in reverse order.

Architecture diagram· Saga skeleton — Order → Payment → Inventory → Shipping

Each service owns one step. Compensating arrows show the rollback path when a later step fails.

Choreography vs orchestration. These are the two ways to wire a saga:

Choreography (event-driven, decentralised): each service publishes a domain event when its step completes; the next service subscribes and reacts. No coordinator — the workflow is an emergent property of event subscriptions. Works for simple 2-3 step flows. Breaks down when you need to answer "what is the current state of order #12345?" because no single service owns the answer.

Orchestration (coordinator-driven, centralised): a dedicated saga orchestrator (Temporal, AWS Step Functions, Cadence, Conductor) drives each step, holds the workflow state, retries transient failures, and invokes compensations on permanent failure. The workflow is code you can read top-to-bottom. Non-negotiable for anything beyond 3 steps or when compensations are complex.

Compensating transactions — the saga's rollback

A compensation is not an "undo" — it is a new forward action that corrects a previous action's effect. Payment was captured → issue a refund. Inventory was reserved → release the reservation. Shipping label was created → cancel the label. Key properties:

1Idempotent. The compensation may be invoked more than once (retries, replays). It must produce the same outcome regardless of how many times it runs.
2Commutative with late arrivals. A compensation may execute before the original step's side effects have fully propagated. Design for this.
3Never failing permanently. If a compensation cannot complete, it retries with backoff — or lands in a dead-letter queue for human review. A compensation that itself fails leaves the system in a permanently inconsistent state.

The outbox pattern — atomic write + publish

The most dangerous anti-pattern in event-driven systems is the dual write: the service writes to its database and publishes an event to the bus. If the process crashes between the two, the database has the state but the event is lost — or the event was sent but the database write failed. The system drifts silently.

The outbox pattern fixes this. The service writes the business row and an outbox row in the same database transaction. A separate process (CDC connector, poller) tails the outbox table and publishes events to the bus. If the publisher crashes, it resumes from the last published outbox sequence number. Exactly-once semantics at the publish boundary, at-least-once delivery to consumers — consumers must still be idempotent.

Idempotency is not optional

At-least-once delivery is the norm in every production message bus (Kafka, SQS, Pub/Sub). Every consumer, every saga step, every compensation must be idempotent. Common strategies: deduplicate by event_id in a seen-events table, use database upserts keyed on business ID, or design operations that are naturally idempotent (setting a status to CANCELLED is idempotent; incrementing a counter is not).

Canonical examples

→E-commerce checkout (order / payment / inventory / shipping)
→Ride-hailing request lifecycle (match / dispatch / trip / payment / rating)
→Banking transfer with rollback on AML flag
→Travel booking across airline + hotel + car rental

Variants

Monolith transaction

Single process, single database, one ACID transaction wraps everything.

Architecture diagram· v1 — Monolith transaction

Single database, single process. BEGIN → order + payment + inventory + shipping → COMMIT.

The monolith transaction is not a saga at all — it is the starting point you are evolving away from. A single relational database handles all writes inside one BEGIN/COMMIT block. If any step fails, the database rolls back atomically. No compensations needed, no event bus, no distributed state.

Architecture diagram· v1 — Monolith transaction

Single database, single process. BEGIN → order + payment + inventory + shipping → COMMIT.

Why it works. Relational databases have spent 40 years perfecting ACID transactions. Write-ahead logs, MVCC, lock managers — the entire stack exists to make BEGIN/COMMIT reliable. At moderate scale (single-digit thousands of writes per second), this is the right answer. The monolith gives you strong consistency, simple debugging (one stack trace, one log stream), and trivial rollback.

Why it stops working. The moment you need independent deployment per team, independent scaling per component, or different storage engines per domain (SQL for orders, NoSQL for recommendations), the monolith transaction breaks. You cannot wrap a Postgres INSERT and a DynamoDB PutItem in the same BEGIN/COMMIT. You also cannot hold database locks across a network call without killing throughput.

When to stay here. Until your team is large enough that deploy coupling is the bottleneck, or until your read/write patterns diverge enough that one database cannot serve all access patterns. Many startups should stay here longer than they think.

Pros

+Strong consistency by default
+Simple debugging and rollback
+No operational overhead

Cons

−Deploy coupling — one release for all code
−Single DB bottleneck
−Cannot span heterogeneous storage

Choose this variant when

Small team, single deployment unit
All data fits in one relational database
Throughput is well within one DB's capacity

Choreography saga

Services react to each other's domain events — no central coordinator.

Architecture diagram· v2 — Choreography

Services publish events to a shared bus. Each service reacts independently. No coordinator.

In a choreographed saga, each service listens for events from upstream services and publishes its own domain events when its step completes. The workflow emerges from the union of all subscriptions. There is no single process that "knows" the full workflow.

Architecture diagram· Choreography event flow

Order → bus → Payment listens → bus → Inventory listens → bus → Shipping listens.

Mechanics. Order service publishes OrderPlaced. Payment service subscribes, captures payment, publishes PaymentCaptured. Inventory service subscribes to PaymentCaptured, reserves stock, publishes StockReserved. Shipping service subscribes to StockReserved, creates the label, publishes ShipmentCreated. Each service owns its own database and its own event publication.

Compensations in choreography. If Inventory fails to reserve stock, it publishes StockReservationFailed. Payment service subscribes to that event and issues a refund. Order service subscribes and marks the order as failed. Each compensation is triggered by an event, not by a coordinator — so every service must know which failure events to listen for and what to do.

The debugging problem. "What is the state of order #12345?" To answer, you must query Order, Payment, Inventory, and Shipping services and piece together the timeline. In practice, teams build a saga-status aggregator that materialises a view from all events — which is halfway to building an orchestrator anyway. This is why choreography works for simple flows and collapses at 4+ steps.

When choreography fits. Simple reactive workflows: on user signup, send welcome email and create analytics profile. Two consumers, no compensation needed, stable event schemas. The moment you need compensations or the workflow has branches, switch to orchestration.

Pros

+No single point of failure for the coordinator
+Easy to add new consumers (no upstream changes)
+Decoupled deploys

Cons

−Workflow is implicit — exists only as emergent behavior
−Debugging requires correlating events across N services
−Compensating cascades are subtle and error-prone

Choose this variant when

2-3 step linear workflows
No complex compensations
Team cannot operate a workflow engine yet

Orchestrator saga

A dedicated coordinator drives each step and handles rollback.

Architecture diagram· v3 — Orchestrator

Central coordinator calls each service in sequence and handles rollback on failure.

An orchestrated saga uses a central coordinator — Temporal, AWS Step Functions, Cadence, Netflix Conductor — to drive the workflow. The coordinator calls each service's activity endpoint in sequence, waits for the result, decides the next step, and invokes compensations in reverse order on permanent failure.

Architecture diagram· Orchestrator call sequence with rollback

Coordinator calls each step. On step 3 failure, it walks backwards invoking compensations.

The workflow is code. In Temporal, a saga is a function: call reserve_inventory(), then capture_payment(), then create_shipping_label(). If capture_payment throws a non-retryable error, the framework invokes release_inventory(). You can read the workflow top-to-bottom, unit-test it with mocked activities, and deploy it independently from the services.

State is durable. The orchestrator persists the saga state to its own database after every step. If the orchestrator process crashes, a new instance picks up from the last persisted state. This is fundamentally different from choreography, where a crash means the workflow is partially complete with no single owner to resume it.

Retry semantics. Temporal distinguishes transient errors (retry with backoff) from permanent errors (invoke compensations). Step Functions use a Catch/Retry DSL. In either model, the orchestrator handles retries so individual services do not need their own retry loops — they fail fast and let the coordinator decide.

Cost and coupling. The orchestrator is a piece of infrastructure you must operate: database, workers, monitoring. Each service couples to the orchestrator's activity interface (gRPC, HTTP, or SDK-specific). For high-request-volume workflows (>10K/s), orchestrator throughput and database write amplification become sizing concerns.

Pros

+Workflow is readable, testable code
+Built-in retry, timeout, compensation
+Saga state is queryable
+Time-travel debugging in Temporal/Cadence

Cons

−Operational overhead of the workflow engine
−Services couple to the engine's activity interface
−Write amplification at very high request rates

Choose this variant when

4+ step workflows
Complex compensations with branching logic
Need to query saga state by business ID

Orchestrator + state machine + outbox

Production-hardened orchestration with explicit state machine, outbox pattern, and DLQ.

Architecture diagram· v4 — Orchestrator + state machine + outbox + DLQ

Production-grade: state machine tracks saga, outbox guarantees delivery, DLQ catches poison pills.

The production-grade orchestrator adds three concerns on top of the basic coordinator pattern: an explicit state machine, the outbox pattern for every event publication, and a dead-letter queue for poison messages.

Architecture diagram· v4 — Orchestrator + state machine + outbox + DLQ

Production-grade: state machine tracks saga, outbox guarantees delivery, DLQ catches poison pills.

Explicit state machine. Each saga instance has a state: STARTED → PAYMENT_CAPTURED → INVENTORY_RESERVED → SHIPPING_CREATED → COMPLETED (happy path) or → COMPENSATING → COMPENSATED → FAILED (rollback path). The orchestrator persists this state in a database row keyed by saga_id. Every transition is a conditional UPDATE (WHERE state = expected_state) — so concurrent duplicate deliveries cannot advance the saga past a step it has already completed.

Outbox for every event. When the orchestrator advances the state machine, it writes the new state AND an outbox event in the same database transaction. A CDC connector (Debezium) or a poller reads the outbox and publishes to the event bus. This eliminates the dual-write hazard. If the orchestrator crashes between the DB write and the event publish, the outbox retains the event and the publisher picks it up on restart.

Dead-letter queue. Messages that cannot be processed after N retries land in a DLQ. A human or automated remediation process inspects them. Without a DLQ, a poison message blocks the entire partition. The DLQ also serves as an audit trail of infrastructure failures.

Saga timeouts. A background sweeper queries for sagas stuck in a non-terminal state beyond a configurable TTL (e.g., 30 minutes for checkout). It moves them to COMPENSATING. This catches coordinator crashes, network black holes, and services that accepted a request but never responded. The sweeper must be idempotent — running it twice on the same saga must not double-compensate.

Pros

+No dual writes — outbox guarantees delivery
+State machine prevents double-processing
+DLQ prevents partition blocking
+Timeout sweeper catches dead sagas

Cons

−Significant infrastructure: state DB + outbox + CDC + DLQ
−Operational complexity — more components to monitor
−Higher latency per step due to outbox relay

Choose this variant when

Mission-critical workflows where data loss is unacceptable
Team has SRE capacity to operate CDC and DLQ infrastructure
Regulatory requirements demand an audit trail of every state transition

Scaling path

Monolith transaction

Ship the feature. One database, one process, one transaction.

Architecture diagram· v1 — Monolith transaction

Single database, single process. BEGIN → order + payment + inventory + shipping → COMMIT.

All writes happen inside a single BEGIN/COMMIT block. Simplest possible approach. Works until deploy coupling or single-DB throughput becomes the bottleneck.

What triggers the next iteration

Deploy coupling — all teams release together
Single DB cannot handle mixed workloads at scale
Cannot use heterogeneous storage

Choreography

Decouple services. Each owns its data and publishes events.

Architecture diagram· v2 — Choreography

Services publish events to a shared bus. Each service reacts independently. No coordinator.

Services communicate via domain events on a shared bus. No coordinator. Each service subscribes to the events it cares about. Compensations are event-driven too.

What triggers the next iteration

Workflow is implicit — debugging requires correlating events across services
Compensation cascades are error-prone at 4+ steps
"What is the state of saga X?" has no single owner

Orchestrator

Make the workflow explicit. Central coordinator drives steps and rollback.

Architecture diagram· v3 — Orchestrator

Central coordinator calls each service in sequence and handles rollback on failure.

A Temporal/Step Functions coordinator calls each service in sequence, persists state, retries transient failures, and invokes compensations on permanent failure. Workflow is readable code.

What triggers the next iteration

Dual-write hazard if orchestrator writes DB then publishes event separately
Dead sagas when coordinator crashes or services go dark
No audit trail of state transitions

Orchestrator + observability

Production-harden with outbox, state machine, DLQ, timeout sweeper, and tracing.

Architecture diagram· v4 — Orchestrator + state machine + outbox + DLQ

Production-grade: state machine tracks saga, outbox guarantees delivery, DLQ catches poison pills.

Outbox eliminates dual writes. Explicit state machine prevents double-processing. DLQ catches poison messages. Timeout sweeper detects dead sagas. Distributed tracing (saga_id as trace context) gives end-to-end visibility.

What triggers the next iteration

Operational complexity of CDC, DLQ, sweeper
Outbox relay adds latency per step
State DB becomes a throughput bottleneck at extreme scale

Deep dives

Choreography vs orchestration — when each wins

Architecture diagram· Choreography event flow

Order → bus → Payment listens → bus → Inventory listens → bus → Shipping listens.

The choreography-vs-orchestration decision is the first fork in saga design. Getting it wrong costs 6-12 months of migration.

Architecture diagram· Choreography event flow

Order → bus → Payment listens → bus → Inventory listens → bus → Shipping listens.

Choreography wins when the workflow is simple (2-3 linear steps), the event schemas are stable, and no compensation is needed. Example: on UserRegistered, send a welcome email and create an analytics profile. Two subscribers, both idempotent, no rollback scenario. Choreography also wins when you need extreme decoupling — adding a 10th consumer to UserRegistered requires zero changes to the publisher.

Orchestration wins the moment any of these are true: (a) the workflow has 4+ steps, (b) compensations exist, (c) the workflow has conditional branches (if payment fails, do X; if inventory fails, do Y), (d) you need to query the saga state by business ID, (e) the workflow must complete within an SLA (timeout detection). In practice, this is most production workflows.

Architecture diagram· Orchestrator call sequence with rollback

Coordinator calls each step. On step 3 failure, it walks backwards invoking compensations.

The hybrid model. Some teams use orchestration for the core saga (order → payment → inventory → shipping) and choreography for side-effects (publish OrderConfirmed to Kafka for analytics, email, search indexing). The orchestrator drives the critical path; the event bus fans out to non-critical consumers. This is the most common production pattern at scale.

Migration path. Starting with choreography and migrating to orchestration is painful — you must introduce a coordinator, replay in-flight sagas, and deprecate the old event subscriptions without losing messages. Starting with orchestration and adding choreographic side-effects is cheap. When in doubt, start with orchestration.

Interviewer signal. If a candidate says "choreography" for a 5-step order workflow, the interviewer will probe: "How do you know the current state of order #12345?" If the candidate cannot answer without describing an orchestrator, they have proved the point.

Compensating transactions — designing the undo

Architecture diagram· Compensating transaction cascade

Steps 1 ✓, 2 ✓, 3 ✗ → compensate step 2 → compensate step 1 → saga FAILED.

A compensating transaction is not a database rollback. It is a new forward action that semantically reverses a previous action's business effect. The original action's database writes are committed and visible; the compensation writes new data that corrects the outcome.

Architecture diagram· Compensating transaction cascade

Steps 1 ✓, 2 ✓, 3 ✗ → compensate step 2 → compensate step 1 → saga FAILED.

Design rules for compensations:

1Every forward step must have a defined compensation before you build it. If you cannot articulate the compensation, the step is not saga-safe. Example: sending a push notification has no compensation — you cannot un-notify a user. Put non-compensable steps last in the saga so they execute only after all compensable steps have succeeded.

1Compensations must be idempotent. The compensation may be invoked more than once due to retries, replays, or duplicate event delivery. A refund compensation must check whether a refund has already been issued (by idempotency key) before issuing another one.

1Compensations must be ordered. If the saga executed steps 1, 2, 3 and step 3 failed, compensations run in reverse: compensate 3, compensate 2, compensate 1. The orchestrator guarantees this ordering. In choreography, ordering is implicit in the event graph — and much harder to verify.

1Compensations must handle concurrent state. Between the original step and the compensation, other processes may have read or acted on the intermediate state. A stock reservation that is compensated (released) may conflict with another saga that reserved the same stock. The compensation must handle this — typically via optimistic concurrency (version checks) or pessimistic locks at the service level.

1Compensations must never fail permanently. A compensation that throws a non-retryable error leaves the system in a partially compensated state with no automated recovery. Design compensations to retry with exponential backoff. If they still fail after max retries, route to a dead-letter queue for human intervention.

The semantic gap. Some business actions have imperfect compensations. A payment capture can be refunded, but the refund may take 5-10 business days to appear on the customer's statement. An email confirmation was sent and cannot be un-sent. These are "imperfect compensations" — they correct the system state but not the user experience. Acknowledge this in your design and handle it with customer communication (apology email, status page).

The outbox pattern — no more dual writes

Architecture diagram· Outbox pattern for atomic write + publish

Service writes business row + outbox event in one DB tx. Poller/CDC publishes to event bus.

The dual-write problem is the #1 source of silent data drift in event-driven systems. The service writes to its database, then publishes an event to the message bus. If the process crashes between the two operations, the database has the new state but the event was never published — downstream services are permanently out of sync, and nothing detects the drift.

Architecture diagram· Outbox pattern for atomic write + publish

Service writes business row + outbox event in one DB tx. Poller/CDC publishes to event bus.

How the outbox works:

1The service starts a database transaction.
2It writes the business data (e.g., INSERT INTO orders).
3In the same transaction, it writes an outbox row (INSERT INTO outbox(event_id, topic, payload, created_at)).
4It commits the transaction. Both writes succeed or both fail — atomicity is guaranteed by the local database.
5A separate process — a CDC connector (Debezium on Postgres WAL, DynamoDB Streams, MongoDB change streams) or a polling loop — reads unpublished outbox rows.
6It publishes each event to the message bus (Kafka, SQS, Pub/Sub).
7It marks the outbox row as published (or deletes it).

CDC vs polling. CDC (Change Data Capture) tails the database's write-ahead log and emits change events in real time. Debezium + Kafka Connect is the canonical stack. Latency: single-digit milliseconds. Polling queries the outbox table on a schedule (every 100ms-1s). Simpler to operate but higher latency and more database load. Start with polling; switch to CDC when latency matters.

Exactly-once at the publish boundary. Because the publisher marks rows as published after a successful bus write, a crash between publish and mark means the event is published again on restart. This is at-least-once at the publisher. Consumers must still be idempotent.

Outbox table schema. Minimal: event_id (UUID), aggregate_type, aggregate_id, event_type, payload (JSONB), created_at, published_at (nullable). Index on (published_at IS NULL, created_at) for the poller. Partition by created_at for cleanup.

Cleanup. Outbox rows are write-once, read-once artifacts. Delete or archive rows where published_at is not null and created_at is older than the retention window (e.g., 7 days). Without cleanup, the outbox table grows unbounded.

Saga state machine — tracking progress and enabling recovery

Architecture diagram· v4 — Orchestrator + state machine + outbox + DLQ

Production-grade: state machine tracks saga, outbox guarantees delivery, DLQ catches poison pills.

Every production saga needs an explicit state machine. Without one, the system cannot answer "what step is this saga on?" and cannot recover from crashes.

Architecture diagram· v4 — Orchestrator + state machine + outbox + DLQ

Production-grade: state machine tracks saga, outbox guarantees delivery, DLQ catches poison pills.

State transitions. A checkout saga might have these states: CREATED → PAYMENT_PENDING → PAYMENT_CAPTURED → INVENTORY_RESERVED → SHIPPING_CREATED → COMPLETED. On failure: any state → COMPENSATING → COMPENSATED → FAILED. Each transition is a conditional database update: UPDATE sagas SET state = 'PAYMENT_CAPTURED' WHERE saga_id = ? AND state = 'PAYMENT_PENDING'. The WHERE clause ensures that concurrent duplicate deliveries cannot advance the saga past a state it has already left.

State machine + outbox. The state transition and the outbox event are written in the same database transaction. This means the state machine is always consistent with the events published. If the orchestrator crashes, the new instance reads the saga state from the database and resumes from the last completed step.

Saga table schema. Columns: saga_id (UUID, PK), saga_type, state, payload (JSONB — the input data), created_at, updated_at, completed_at, error (nullable text). Index on (state, updated_at) for the timeout sweeper. Index on (saga_type, state) for operational dashboards.

Timeout sweeper. A background process runs on a schedule (every 30s) and queries: SELECT * FROM sagas WHERE state NOT IN ('COMPLETED', 'FAILED', 'COMPENSATED') AND updated_at < NOW() - INTERVAL '30 minutes'. It moves these sagas to COMPENSATING. This catches: (a) orchestrator crashes, (b) services that accepted a request but never responded, (c) network partitions that silently dropped the response.

Concurrency control. When the sweeper moves a saga to COMPENSATING, it uses the same conditional update pattern: UPDATE ... WHERE state = 'INVENTORY_RESERVED'. If the saga advanced between the sweeper's read and write, the update affects zero rows, and the sweeper skips it. This prevents double-compensation.

Operational dashboards. With an explicit state machine, you can build dashboards: sagas by state (how many are in COMPENSATING?), sagas by age (any older than 1 hour in a non-terminal state?), compensation success rate. These are impossible with implicit choreography state.

Idempotency at every step — at-least-once is the norm

Architecture diagram· Saga timeline — success path + failure + compensation

Step 1 start→complete, step 2 start→complete, step 3 start→FAIL, compensate 2, compensate 1.

Every production message bus delivers at-least-once. Kafka consumers that crash after processing but before committing the offset will re-receive the message. SQS visibility timeout expiry causes re-delivery. Pub/Sub acknowledgement failures cause re-delivery. Temporal retries activities on transient errors. Every saga step and every compensation must be idempotent.

Architecture diagram· Saga timeline — success path + failure + compensation

Step 1 start→complete, step 2 start→complete, step 3 start→FAIL, compensate 2, compensate 1.

Strategy 1: Idempotency key in a seen-events table. Before processing, the consumer checks: SELECT 1 FROM seen_events WHERE event_id = ?. If the row exists, skip. Otherwise, process and INSERT the event_id in the same transaction. This is the most general approach and works for any operation. The seen_events table needs a TTL-based cleanup (delete rows older than the message bus retention period).

Strategy 2: Database upsert keyed on business ID. Instead of INSERT, use INSERT ... ON CONFLICT (order_id) DO UPDATE SET status = 'CAPTURED'. The upsert is naturally idempotent — replaying the same message produces the same database state. This works when the operation is setting a value, not incrementing one.

Strategy 3: Conditional updates with version checks. UPDATE orders SET status = 'SHIPPED', version = version + 1 WHERE order_id = ? AND version = ?. If the version has already advanced, the update affects zero rows. This is optimistic concurrency control and works for state machine transitions.

Strategy 4: Naturally idempotent operations. Some operations are idempotent by design. Setting a boolean flag (is_paid = true) is idempotent. Sending a DELETE for a resource that may already be deleted is idempotent (return 204 either way). Decrementing a counter is NOT idempotent — use a reservation model instead (reserve a specific reservation_id, not "decrement stock by 1").

The increment trap. A common bug: a payment step increments a running total. Replayed, it increments again. Fix: use a ledger model. Each payment event creates a ledger entry keyed by (saga_id, step). The balance is SUM(ledger entries). Replaying the event tries to insert a duplicate ledger entry; the unique constraint rejects it.

Testing idempotency. Every saga step should have a test: execute the step, then execute it again with the same input. Assert the side effects are identical. This is the single most important test in a saga codebase.

Observability — tracing sagas and detecting dead ones

Architecture diagram· v4 — Orchestrator + state machine + outbox + DLQ

Production-grade: state machine tracks saga, outbox guarantees delivery, DLQ catches poison pills.

A saga spans multiple services, databases, and message brokers. Without purpose-built observability, debugging a failed checkout requires SSH into 4 services, grepping 4 log streams, and mentally correlating timestamps. At 1000 sagas per second, this is impossible.

Architecture diagram· v4 — Orchestrator + state machine + outbox + DLQ

Production-grade: state machine tracks saga, outbox guarantees delivery, DLQ catches poison pills.

Distributed tracing. Propagate a saga_id (or use it as the trace_id in OpenTelemetry) across every service call, event header, and log line. When a saga fails, a single query by saga_id returns the complete timeline: which steps succeeded, which failed, what the compensation outcome was, and how long each step took. Without this, you are blind.

Saga state dashboard. With an explicit state machine, build a real-time dashboard: (a) sagas by state — pie chart of COMPLETED / COMPENSATING / FAILED / in-progress, (b) sagas by age — histogram of time-in-flight for non-terminal sagas, (c) saga throughput — completions per second, (d) compensation rate — what percentage of sagas trigger compensations. Alert on: compensation rate > 5% (something upstream is broken), sagas in non-terminal state > 30 minutes (dead saga).

Dead saga detection. A saga is "dead" when it is stuck in a non-terminal state beyond its expected SLA. The timeout sweeper moves dead sagas to COMPENSATING, but the alert fires first so an engineer can investigate. Common causes: a service is down and not processing messages, a message was routed to the DLQ and nobody noticed, the orchestrator crashed and the new instance has not yet resumed (rare with Temporal, which handles this automatically).

DLQ monitoring. The dead-letter queue is a leading indicator of system health. A message in the DLQ means a saga step failed after max retries. Monitor: DLQ depth (should be zero in steady state), DLQ age (oldest message — indicates how long the issue has been unnoticed), DLQ event types (which step is failing?). Auto-alert on any DLQ message.

Structured logging. Every log line from a saga step includes: saga_id, step_name, step_number, saga_state, event_id, idempotency_key. Use structured JSON logging (not printf-style). Ship to a centralised log aggregator (Elasticsearch, Datadog, Loki). A query like saga_id = "abc123" returns the complete saga history across all services.

Metrics to own. saga_completed_total (counter, by saga_type), saga_failed_total (counter, by saga_type and failure_step), saga_duration_seconds (histogram, by saga_type), saga_compensation_total (counter, by saga_type and compensated_step), saga_inflight (gauge, by saga_type and current_state). These six metrics give complete operational visibility.

Case studies

Uber

Trip lifecycle saga

Uber's trip lifecycle is a multi-step saga spanning rider matching, driver dispatch, trip tracking, fare calculation, payment, and rating. Each step is owned by a different service.

Orchestration. Uber built Cadence (later open-sourced as Temporal) specifically to orchestrate these workflows. A trip workflow in Cadence is a Go function: match a driver, wait for driver acceptance (with timeout), start the trip, track GPS updates, end the trip, calculate the fare, capture payment, prompt for rating. Each step is an activity with its own retry policy.

Compensations. If payment capture fails after the trip is completed, the workflow retries with backoff. If it permanently fails (card declined), the workflow marks the trip as payment-failed and queues it for alternative payment methods or debt collection. The trip itself is not "rolled back" — it already happened. This illustrates that compensations are business-logic-specific, not mechanical undos.

Scale. Uber processes millions of trips per day. Cadence handles the workflow persistence, retry, and timeout detection. Each activity service (matching, payment, etc.) scales independently. The saga state is durable — if Cadence workers restart, in-flight workflows resume from the last completed activity.

Idempotency. Every activity is idempotent by workflow_id + activity_name. Cadence may replay activities during history reconstruction; the activity checks whether the work has already been done before executing.

Takeaway

Uber built Temporal/Cadence because choreography could not handle the complexity of trip lifecycle sagas at scale. If your workflow has 5+ steps with timeouts and compensations, orchestration is not optional.

Stripe

Payment + refund saga

Stripe's payment processing is a saga with strict idempotency and compensation guarantees. A payment intent progresses through states: created → processing → succeeded (or failed). A refund is the compensating action for a succeeded payment.

Idempotency keys. Every Stripe API call accepts an Idempotency-Key header. Stripe stores the result keyed by (API key, idempotency key). Replaying the same request with the same key returns the stored result without re-executing the operation. This is strategy 1 (seen-events table) implemented at the API gateway level.

The payment saga. A checkout saga calls Stripe to create a payment intent, then confirms it. If confirmation fails, the intent expires automatically (no explicit compensation). If confirmation succeeds but a downstream step (inventory reservation) fails, the saga issues a refund. The refund is itself idempotent — calling it twice with the same idempotency key returns the same refund object.

Event webhooks. Stripe publishes events (payment_intent.succeeded, charge.refunded) to configured webhook endpoints. These events are at-least-once — Stripe retries failed deliveries for up to 72 hours. Every webhook consumer must be idempotent. Stripe recommends checking the event ID against a seen-events table before processing.

Partial refunds. A refund does not have to undo the full payment. Stripe supports partial refunds, which means the compensation is parameterized — the saga decides how much to refund based on which downstream steps completed. This is more nuanced than a simple "undo."

Takeaway

Stripe proves that idempotency keys are not a nice-to-have — they are the foundation of every saga step. At-least-once delivery is the norm; idempotency is the contract.

Netflix

Conductor — microservices orchestration

Netflix built Conductor to orchestrate complex microservices workflows across its content platform. Conductor is an open-source, general-purpose workflow engine that drives sagas via JSON-defined workflow definitions.

Workflow model. A Conductor workflow is a DAG of tasks. Each task is executed by a worker that polls for work, executes, and reports the result. The conductor server persists workflow state in a database (Dynomite/Redis + Elasticsearch). Workers are stateless and horizontally scalable.

Content encoding saga. When Netflix ingests a new title, a Conductor workflow orchestrates: validate the source media → transcode to multiple resolutions → generate thumbnails → run quality checks → update the content catalog → make available in CDN. If quality checks fail, the workflow compensates by removing partial transcodes and notifying the content team.

Scale. Netflix runs millions of workflow executions per day across hundreds of workflow definitions. Conductor handles task routing, retry, timeout, and state persistence. Workers are polyglot — Java, Python, Go — communicating via HTTP/gRPC.

Observability. Conductor provides a built-in UI showing workflow execution history, task status, input/output payloads, and retry attempts. This is the "time travel" that orchestration gives you for free — choreography has no equivalent without building a custom correlation service.

Failure handling. Tasks have configurable retry counts, timeout policies, and failure workflows. A failed task can trigger an alternative workflow (e.g., re-encode with different settings) or escalate to a human. The DLQ equivalent is a "failed workflow" state that operators can inspect and retry from the UI.

Takeaway

Netflix Conductor demonstrates that orchestration engines scale to millions of workflows per day. The built-in observability (execution history, task-level debugging) is what makes orchestration operationally viable at scale.

Decision levers

Choreography vs orchestration

Start with orchestration when the workflow has >3 steps or compensations are complex. Use choreography only for truly simple 2-step reactive flows. The hybrid model — orchestration for the critical path, choreography for side-effects — is the most common production pattern.

Outbox implementation (CDC vs polling)

Polling is simpler to operate: a cron job queries the outbox table every 100ms-1s. CDC (Debezium) tails the database WAL for real-time event publishing. Start with polling; switch to CDC when publish latency matters or polling load becomes significant.

Compensation ordering

Compensations run in reverse order of the original steps. Non-compensable steps (send email, push notification) must be placed last in the saga so they execute only after all compensable steps succeed. If a non-compensable step must come early, design an "apologize" compensation instead.

Saga timeout policy

Every saga needs a timeout. A checkout saga might allow 30 minutes; a ride-matching saga might allow 60 seconds. The timeout sweeper moves expired sagas to COMPENSATING. Set timeouts conservatively at first and tighten based on p99 completion times.

Idempotency strategy per step

Choose per step: seen-events table for general operations, database upsert for state-setting operations, conditional version checks for state machine transitions, ledger model for financial operations. Never use bare increment/decrement — always use a reservation or ledger.

Failure modes

Dual-write (DB + event bus)

App writes DB then publishes event — crash between the two means the DB has the state but the event is lost. Downstream services drift permanently. Fix: outbox pattern. One DB transaction writes business data + outbox row. Publisher tails the outbox.

Cascading compensation failure

Step 5 fails → compensate 4, 3, 2, 1. Compensation for step 3 itself fails. The system is now partially compensated with no automated recovery. Fix: compensations must retry with exponential backoff until success. After max retries, route to DLQ for human intervention.

Lost saga state in choreography

"What is the status of order #12345?" requires querying N services and replaying events. No single owner knows the answer. Fix: add a saga-status materializer — which is halfway to building an orchestrator. Or use orchestration from the start.

Non-idempotent step replays

A payment step increments a running total. Replayed due to redelivery, it increments again. Customer is double-charged. Fix: use idempotency keys (seen-events table) or ledger model (one entry per saga_id+step, balance = SUM).

Dead saga — stuck in non-terminal stateAdvanced

Orchestrator crashes, service goes dark, or response is dropped. The saga sits in INVENTORY_RESERVED forever. Fix: timeout sweeper queries for stale sagas and moves them to COMPENSATING. Alert on sagas older than SLA.

Poison message blocking partitionAdvanced

A malformed event cannot be deserialized. Consumer retries indefinitely, blocking all subsequent messages on the partition. Fix: after N retries, route to DLQ. Monitor DLQ depth — it should be zero in steady state.

Event schema drift

Producer renames a field; 10 consumers break silently. Fix: schema registry (Confluent, Apicurio) + backwards-compatible changes only. Breaking change = new event type with a migration path.

Decision table

Choreography vs orchestration decision matrix

Dimension	Choreography	Orchestration
Workflow steps	2-3 linear	4+ or branching
Compensations	Simple or none	Complex, ordered
State visibility	Reconstruct from events	Query by saga_id
Debugging	Correlate N log streams	Single workflow trace
New consumers	Subscribe — zero upstream changes	Add activity + update workflow
Infrastructure	Event bus only	Workflow engine + DB
Coupling	Loose (event schemas)	Medium (activity interface)

Most production systems at scale use orchestration for the critical path and choreography for side-effect fanout.

Worked example

Worked example: e-commerce order saga

Prompt: Design a checkout system where placing an order involves payment capture, inventory reservation, and shipping label creation — each owned by a different service.

Step 1: Identify the saga steps and compensations

Step	Service	Forward action	Compensation
1	Inventory	Reserve stock	Release reservation
2	Payment	Capture payment	Issue refund
3	Shipping	Create label	Cancel label
4	Order	Confirm order	(terminal — no comp)

Ordering matters. Reserve inventory before capturing payment — it is cheaper to release a reservation than to refund a charge. Confirm order last because it is non-compensable (the customer sees the confirmation).

Architecture diagram· Saga skeleton — Order → Payment → Inventory → Shipping

Each service owns one step. Compensating arrows show the rollback path when a later step fails.

Step 2: Choose orchestration

This is a 4-step workflow with ordered compensations. Choreography would require every service to know which failure events to listen for and in what order to compensate. Orchestration gives us one place to read the workflow, one place to add timeout detection, and one place to query saga state.

Engine choice: Temporal. Alternatives: AWS Step Functions (if fully on AWS), Conductor (if Netflix stack). The workflow is a Python function:

python

@workflow.defn
class CheckoutWorkflow:
    @workflow.run
    async def run(self, order: OrderInput) -> OrderResult:
        reservation = await workflow.execute_activity(
            reserve_inventory, order, start_to_close_timeout=timedelta(seconds=30))
        try:
            payment = await workflow.execute_activity(
                capture_payment, order, start_to_close_timeout=timedelta(seconds=30))
        except Exception:
            await workflow.execute_activity(release_inventory, reservation)
            raise
        try:
            shipment = await workflow.execute_activity(
                create_shipping_label, order, start_to_close_timeout=timedelta(seconds=60))
        except Exception:
            await workflow.execute_activity(refund_payment, payment)
            await workflow.execute_activity(release_inventory, reservation)
            raise
        await workflow.execute_activity(confirm_order, order)
        return OrderResult(reservation, payment, shipment)

Step 3: Add the outbox pattern

Each service writes its business data + an outbox event in one database transaction. The outbox poller publishes events to Kafka for side-effect consumers (email, analytics, search indexer).

Step 4: Idempotency

reserve_inventory: upsert reservation by (order_id). Re-reserving the same order_id is a no-op.
capture_payment: Stripe idempotency key = order_id + "capture". Replaying returns the same PaymentIntent.
create_shipping_label: upsert by (order_id). Re-creating returns the existing label.
All compensations: check current state before acting. If already released/refunded/cancelled, return success.

Step 5: Observability

saga_id = order_id, propagated as trace_id in OpenTelemetry.
Dashboard: sagas by state, p99 completion time (target: < 5s), compensation rate (target: < 2%).
Timeout sweeper: sagas in non-terminal state > 30 minutes → COMPENSATING.
DLQ depth alert: any message → PagerDuty.

Numbers

Metric	Value
Saga steps	4
p99 latency (happy path)	~3s
Compensation rate	< 1% (mostly card declines)
Saga timeout	30 minutes
Outbox publish lag (CDC)	< 100ms

This design gives us independent service deploys, atomic event publishing, automatic retry and compensation, queryable saga state by order_id, and end-to-end tracing — all without a single distributed transaction.

Interview playbook

Interview playbook10-12 minutes in a 45-minute interview — saga design is often the core of the answer, not a tangent.

When it comes up

Multi-service workflow (checkout, booking, trip lifecycle)
Interviewer asks "how do you handle failures across services?"
Prompt mentions "distributed transactions" or "eventual consistency"
Any workflow with 3+ services that must coordinate

Order of reveal

1
1. Name the saga pattern. This is a saga — a sequence of local transactions with compensating actions instead of a distributed transaction.
2
2. Choose orchestration. For 4+ steps with compensations, I would use an orchestrator like Temporal. The workflow is code I can read top-to-bottom.
3
3. List steps + compensations. Here are the forward steps and their compensations: reserve→release, capture→refund, label→cancel. Non-compensable steps go last.
4
4. Outbox pattern. Every service uses the outbox pattern — business write + outbox row in one DB transaction — to eliminate dual writes.
5
5. Idempotency. Every step is idempotent. Upserts for state-setting, idempotency keys for payment, seen-events table for general operations.
6
6. Observability. saga_id as trace_id across all services. Dashboard with sagas by state, compensation rate, and timeout sweeper for dead sagas.
7
7. Timeouts + DLQ. Timeout sweeper moves stale sagas to COMPENSATING. DLQ catches poison messages. Both alert on arrival.

Signature phrases

“Compensation, not rollback”

“Outbox pattern eliminates dual writes”

“At-least-once means every step must be idempotent”

“Orchestration for the critical path, choreography for side-effects”

“Non-compensable steps go last”

“saga_id as trace_id”

“Compensation, not rollback” — Shows understanding that sagas do not undo — they issue new forward actions that correct the effect.
“Outbox pattern eliminates dual writes” — Names the #1 anti-pattern in event-driven systems and the canonical fix.
“At-least-once means every step must be idempotent” — Shows awareness that duplicate delivery is the norm, not an edge case.
“Orchestration for the critical path, choreography for side-effects” — Demonstrates the hybrid model that most production systems use.
“Non-compensable steps go last” — Shows compensation ordering awareness — you cannot un-send an email.
“saga_id as trace_id” — Ties observability to the saga identity — one query returns the full timeline.

Likely follow-ups

?“Why not use 2PC?”Reveal

2PC holds locks across a network round-trip. One slow participant blocks all others. One crashed coordinator leaves all participants with locks held indefinitely. It kills throughput and availability. Sagas give us eventual consistency without distributed locks.

?“When would you use choreography instead?”Reveal

For simple 2-3 step reactive flows with no compensations. Example: on user signup, send welcome email and create analytics profile. The moment you need compensations, state visibility, or branches — switch to orchestration.

?“What if a compensation fails?”Reveal

Compensations retry with exponential backoff until success. After max retries, the message lands in a DLQ for human intervention. A compensation that permanently fails leaves the system in a partially compensated state — this must be treated as a critical incident.

?“How do you handle duplicate events?”Reveal

Every step is idempotent. Strategy depends on the operation: seen-events table for general ops, database upserts for state-setting, conditional version checks for state machines, ledger model for financial operations. Never bare increment — always reservation or ledger.

?“How do you debug a failed saga?”Reveal

saga_id propagates as trace_id across all services via OpenTelemetry. One query returns the complete timeline. With Temporal, you get time-travel debugging — replay the workflow step by step. Without orchestration, you are grepping N log streams and correlating timestamps manually.

?“What about the outbox table growing unbounded?”Reveal

Delete or archive outbox rows where published_at is not null and created_at is older than the retention window (7 days). Partition by created_at for efficient cleanup. The outbox is a write-once, read-once artifact — not a permanent store.

Code snippets

pythonSaga orchestrator in Temporal (Python SDK)

from temporalio import workflow, activity
from dataclasses import dataclass
from datetime import timedelta

@dataclass
class OrderInput:
    order_id: str
    user_id: str
    items: list
    total_cents: int

@workflow.defn
class CheckoutSaga:
    """Orchestrates checkout: reserve → pay → ship → confirm.
    Compensations run in reverse on permanent failure."""

    @workflow.run
    async def run(self, order: OrderInput) -> str:
        # Step 1: Reserve inventory
        reservation_id = await workflow.execute_activity(
            reserve_inventory,
            order,
            start_to_close_timeout=timedelta(seconds=30),
        )

        # Step 2: Capture payment (compensate step 1 on failure)
        try:
            payment_id = await workflow.execute_activity(
                capture_payment,
                order,
                start_to_close_timeout=timedelta(seconds=30),
            )
        except Exception:
            await workflow.execute_activity(
                release_inventory, reservation_id,
                start_to_close_timeout=timedelta(seconds=30),
            )
            raise

        # Step 3: Create shipping label (compensate steps 2, 1)
        try:
            shipment_id = await workflow.execute_activity(
                create_shipping_label,
                order,
                start_to_close_timeout=timedelta(seconds=60),
            )
        except Exception:
            await workflow.execute_activity(
                refund_payment, payment_id,
                start_to_close_timeout=timedelta(seconds=30),
            )
            await workflow.execute_activity(
                release_inventory, reservation_id,
                start_to_close_timeout=timedelta(seconds=30),
            )
            raise

        # Step 4: Confirm order (non-compensable — last)
        await workflow.execute_activity(
            confirm_order,
            order,
            start_to_close_timeout=timedelta(seconds=10),
        )
        return f"saga completed: {order.order_id}"

pythonCompensating transaction handler with idempotency

import psycopg2
from uuid import UUID

def compensate_payment(saga_id: UUID, payment_id: str, db_conn) -> None:
    """Idempotent refund: checks whether already refunded before acting."""
    with db_conn.cursor() as cur:
        # Check if already compensated
        cur.execute(
            "SELECT 1 FROM compensations WHERE saga_id = %s AND step = 'payment'",
            (str(saga_id),),
        )
        if cur.fetchone():
            return  # Already compensated — idempotent no-op

        # Issue refund via payment provider (idempotency key = saga_id)
        stripe.Refund.create(
            payment_intent=payment_id,
            idempotency_key=f"{saga_id}-refund",
        )

        # Record compensation in same DB transaction
        cur.execute(
            """INSERT INTO compensations (saga_id, step, completed_at)
               VALUES (%s, 'payment', NOW())""",
            (str(saga_id),),
        )
        db_conn.commit()

pythonOutbox publisher — polls outbox table and publishes to Kafka

import json
import time
from kafka import KafkaProducer

def outbox_publisher(db_conn, kafka_producer: KafkaProducer, poll_interval: float = 0.1):
    """Polls the outbox table and publishes unpublished events to Kafka.
    Marks each event as published after successful send."""
    while True:
        with db_conn.cursor() as cur:
            cur.execute(
                """SELECT event_id, topic, payload
                   FROM outbox
                   WHERE published_at IS NULL
                   ORDER BY created_at
                   LIMIT 100"""
            )
            rows = cur.fetchall()

            for event_id, topic, payload in rows:
                kafka_producer.send(
                    topic,
                    key=event_id.encode(),
                    value=json.dumps(payload).encode(),
                )
                cur.execute(
                    "UPDATE outbox SET published_at = NOW() WHERE event_id = %s",
                    (event_id,),
                )
            db_conn.commit()

        if not rows:
            time.sleep(poll_interval)

pythonIdempotent step executor using seen-events table

from uuid import UUID

def execute_idempotent_step(event_id: UUID, step_fn, db_conn, *args, **kwargs):
    """Wraps any saga step with idempotency via a seen-events table.
    If the event has already been processed, returns without re-executing."""
    with db_conn.cursor() as cur:
        # Attempt to insert — fails on duplicate (unique constraint)
        try:
            cur.execute(
                """INSERT INTO seen_events (event_id, processed_at)
                   VALUES (%s, NOW())""",
                (str(event_id),),
            )
        except Exception:  # UniqueViolation
            db_conn.rollback()
            return  # Already processed — idempotent skip

        # Execute the actual step
        result = step_fn(*args, **kwargs)

        db_conn.commit()
        return result

pythonSaga timeout sweeper — detects and compensates dead sagas

from datetime import timedelta

TERMINAL_STATES = {'COMPLETED', 'FAILED', 'COMPENSATED'}
SAGA_TIMEOUT = timedelta(minutes=30)

def sweep_dead_sagas(db_conn, compensate_saga_fn):
    """Finds sagas stuck in non-terminal states beyond the timeout.
    Moves them to COMPENSATING via conditional update (prevents races)."""
    with db_conn.cursor() as cur:
        cur.execute(
            """SELECT saga_id, state
               FROM sagas
               WHERE state NOT IN %s
               AND updated_at < NOW() - INTERVAL %s
               FOR UPDATE SKIP LOCKED""",
            (tuple(TERMINAL_STATES), str(SAGA_TIMEOUT)),
        )
        stale = cur.fetchall()

        for saga_id, state in stale:
            # Conditional update prevents double-compensation
            cur.execute(
                """UPDATE sagas
                   SET state = 'COMPENSATING', updated_at = NOW()
                   WHERE saga_id = %s AND state = %s""",
                (saga_id, state),
            )
            if cur.rowcount == 1:
                compensate_saga_fn(saga_id)

        db_conn.commit()
    return len(stale)

Drills

A saga step captures payment and then reserves inventory. The inventory reservation fails. What happens?Reveal

The orchestrator invokes the compensation for the payment step — a refund. The refund uses an idempotency key (saga_id + "refund") so replays are safe. After the refund succeeds, the saga transitions to FAILED. The order service marks the order as payment-reversed. Lesson: order your steps so the cheapest-to-compensate steps run first. Reserving inventory before capturing payment is cheaper to compensate (release vs refund).

Your team uses choreography for a 5-step checkout workflow. The PM asks "what is the current state of order #12345?" How do you answer?Reveal

You cannot answer from a single service. You must query all 5 services and correlate their state. In practice, you build a saga-status materializer that consumes events from all services and builds a projected view — which is an orchestrator by another name. This is the canonical argument for switching to explicit orchestration.

Explain why a payment increment step (balance += amount) is dangerous in a saga and how to fix it.Reveal

Increment is not idempotent. If the message is redelivered, the balance is incremented again (double charge). Fix: use a ledger model. Each payment creates a ledger entry keyed by (saga_id, step_name). The balance is SUM(amount) over all ledger entries. Replaying the message tries to insert a duplicate entry; the unique constraint rejects it.

What is the dual-write problem and how does the outbox pattern solve it?Reveal

Dual write: the service writes to its DB then publishes an event to the bus. A crash between the two means the DB has the state but the event is lost — downstream services are permanently out of sync. Outbox: write business data + outbox row in one DB transaction (atomic). A CDC connector or poller publishes from the outbox. If the publisher crashes, it resumes from the last published sequence number.

A compensation itself fails after 3 retries. What should happen?Reveal

After max retries, the compensation message lands in a dead-letter queue. The saga state transitions to COMPENSATION_FAILED. An alert fires (PagerDuty). A human investigates — they may manually complete the compensation or apply a corrective action. This is a critical incident because the system is in a partially compensated state. Design compensations to be as simple and reliable as possible to minimize this risk.

When would you use choreography over orchestration?Reveal

Simple reactive flows with no compensations: on UserRegistered, send welcome email and create analytics profile. Two consumers, both idempotent, no rollback. Also when extreme decoupling matters — adding a 10th consumer to an event requires zero upstream changes. Switch to orchestration at the first sign of: 4+ steps, compensations, conditional branches, or the need to query workflow state.

6% complete

Current

When to reach for this

Step 1 of 17

Good vs bad answer

Jump to next

Dimension

Choreography

Orchestration

Workflow steps

2-3 linear

4+ or branching

Compensations

Simple or none

Complex, ordered

State visibility

Reconstruct from events

Query by saga_id

Debugging

Correlate N log streams

Single workflow trace

New consumers

Subscribe — zero upstream changes

Add activity + update workflow

Infrastructure

Event bus only

Workflow engine + DB

Coupling

Loose (event schemas)

Medium (activity interface)

Step

Service

Forward action

Compensation

Inventory

Reserve stock

Release reservation

Payment

Capture payment

Issue refund

Shipping

Create label

Cancel label

Order

Confirm order

(terminal — no comp)

@workflow.defn class CheckoutWorkflow: @workflow.run async def run(self, order: OrderInput) -> OrderResult: reservation = await workflow.execute_activity( reserve_inventory, order, start_to_close_timeout=timedelta(seconds=30)) try: payment = await workflow.execute_activity( capture_payment, order, start_to_close_timeout=timedelta(seconds=30)) except Exception: await workflow.execute_activity(release_inventory, reservation) raise try: shipment = await workflow.execute_activity( create_shipping_label, order, start_to_close_timeout=timedelta(seconds=60)) except Exception: await workflow.execute_activity(refund_payment, payment) await workflow.execute_activity(release_inventory, reservation) raise await workflow.execute_activity(confirm_order, order) return OrderResult(reservation, payment, shipment)

Metric

Value

Saga steps

p99 latency (happy path)

~3s

Compensation rate

< 1% (mostly card declines)

Saga timeout

30 minutes

Outbox publish lag (CDC)

< 100ms

from temporalio import workflow, activity from dataclasses import dataclass from datetime import timedelta @dataclass class OrderInput: order_id: str user_id: str items: list total_cents: int @workflow.defn class CheckoutSaga: """Orchestrates checkout: reserve → pay → ship → confirm. Compensations run in reverse on permanent failure.""" @workflow.run async def run(self, order: OrderInput) -> str: # Step 1: Reserve inventory reservation_id = await workflow.execute_activity( reserve_inventory, order, start_to_close_timeout=timedelta(seconds=30), ) # Step 2: Capture payment (compensate step 1 on failure) try: payment_id = await workflow.execute_activity( capture_payment, order, start_to_close_timeout=timedelta(seconds=30), ) except Exception: await workflow.execute_activity( release_inventory, reservation_id, start_to_close_timeout=timedelta(seconds=30), ) raise # Step 3: Create shipping label (compensate steps 2, 1) try: shipment_id = await workflow.execute_activity( create_shipping_label, order, start_to_close_timeout=timedelta(seconds=60), ) except Exception: await workflow.execute_activity( refund_payment, payment_id, start_to_close_timeout=timedelta(seconds=30), ) await workflow.execute_activity( release_inventory, reservation_id, start_to_close_timeout=timedelta(seconds=30), ) raise # Step 4: Confirm order (non-compensable — last) await workflow.execute_activity( confirm_order, order, start_to_close_timeout=timedelta(seconds=10), ) return f"saga completed: {order.order_id}"

import psycopg2 from uuid import UUID def compensate_payment(saga_id: UUID, payment_id: str, db_conn) -> None: """Idempotent refund: checks whether already refunded before acting.""" with db_conn.cursor() as cur: # Check if already compensated cur.execute( "SELECT 1 FROM compensations WHERE saga_id = %s AND step = 'payment'", (str(saga_id),), ) if cur.fetchone(): return # Already compensated — idempotent no-op # Issue refund via payment provider (idempotency key = saga_id) stripe.Refund.create( payment_intent=payment_id, idempotency_key=f"{saga_id}-refund", ) # Record compensation in same DB transaction cur.execute( """INSERT INTO compensations (saga_id, step, completed_at) VALUES (%s, 'payment', NOW())""", (str(saga_id),), ) db_conn.commit()

import json import time from kafka import KafkaProducer def outbox_publisher(db_conn, kafka_producer: KafkaProducer, poll_interval: float = 0.1): """Polls the outbox table and publishes unpublished events to Kafka. Marks each event as published after successful send.""" while True: with db_conn.cursor() as cur: cur.execute( """SELECT event_id, topic, payload FROM outbox WHERE published_at IS NULL ORDER BY created_at LIMIT 100""" ) rows = cur.fetchall() for event_id, topic, payload in rows: kafka_producer.send( topic, key=event_id.encode(), value=json.dumps(payload).encode(), ) cur.execute( "UPDATE outbox SET published_at = NOW() WHERE event_id = %s", (event_id,), ) db_conn.commit() if not rows: time.sleep(poll_interval)

from uuid import UUID def execute_idempotent_step(event_id: UUID, step_fn, db_conn, *args, **kwargs): """Wraps any saga step with idempotency via a seen-events table. If the event has already been processed, returns without re-executing.""" with db_conn.cursor() as cur: # Attempt to insert — fails on duplicate (unique constraint) try: cur.execute( """INSERT INTO seen_events (event_id, processed_at) VALUES (%s, NOW())""", (str(event_id),), ) except Exception: # UniqueViolation db_conn.rollback() return # Already processed — idempotent skip # Execute the actual step result = step_fn(*args, **kwargs) db_conn.commit() return result

from datetime import timedelta TERMINAL_STATES = {'COMPLETED', 'FAILED', 'COMPENSATED'} SAGA_TIMEOUT = timedelta(minutes=30) def sweep_dead_sagas(db_conn, compensate_saga_fn): """Finds sagas stuck in non-terminal states beyond the timeout. Moves them to COMPENSATING via conditional update (prevents races).""" with db_conn.cursor() as cur: cur.execute( """SELECT saga_id, state FROM sagas WHERE state NOT IN %s AND updated_at < NOW() - INTERVAL %s FOR UPDATE SKIP LOCKED""", (tuple(TERMINAL_STATES), str(SAGA_TIMEOUT)), ) stale = cur.fetchall() for saga_id, state in stale: # Conditional update prevents double-compensation cur.execute( """UPDATE sagas SET state = 'COMPENSATING', updated_at = NOW() WHERE saga_id = %s AND state = %s""", (saga_id, state), ) if cur.rowcount == 1: compensate_saga_fn(saga_id) db_conn.commit() return len(stale)