Pattern

Producer-consumer / work queue

Whenever the thing making work and the thing doing work run at different speeds, put a queue between them. Simple, universal, surprisingly easy to get subtly wrong.

Decouple rate of production from rate of consumption with a durable queue and autoscaled workers.

Architecture diagram· Producer → durable queue → N workers → DLQ

Producers append fast; a bounded durable queue decouples throughput from processing speed; autoscaled consumers drain at their own pace. Messages that fail N times land in a dead-letter queue for manual inspection.

You’re looking at this pattern when

Synchronous call does work the user does not need to wait on (email send, image resize, index update)
Producers burst; consumers have steady capacity and need to drain at their own pace
Background side-effects that can tolerate seconds to minutes of latency

Shows up in

Upload → transcode pipeline (YouTube, TikTok)
Email / SMS / push-notification send (Twilio, SendGrid)

What most people get wrong

The user is waiting on the result in real time (use synchronous request-response)

When to reach for this

Reach for this when…

Synchronous call does work the user does not need to wait on (email send, image resize, index update)
Producers burst; consumers have steady capacity and need to drain at their own pace
Background side-effects that can tolerate seconds to minutes of latency
Fan-out to multiple downstream actions from one trigger event
Third-party APIs with tight rate limits that would throttle synchronous callers

Not really this pattern when…

The user is waiting on the result in real time (use synchronous request-response)
Work must be atomic with the triggering transaction (use the outbox pattern within a DB transaction)
Sub-second latency is required end-to-end and you cannot batch
The downstream system is a single stateful resource that cannot be parallelised

Cheat sheet

•Producer enqueues a reference (S3 key, row id), not the payload — messages have size limits (256 KB SQS, 1 MB Kafka).
•Return 202 Accepted immediately — the user never waits for the consumer.
•Visibility timeout = 2× your p99 processing time. Extend dynamically with ChangeMessageVisibility for variable-duration jobs.
•maxReceiveCount = 3-5 on the source queue, pointing to a DLQ. Alarm on DLQ depth > 0, period = 1 minute.
•Idempotency is structural: conditional writes (UPDATE WHERE version = expected), idempotency key tables (UNIQUE constraint on message id), or naturally idempotent ops (S3 PUT).
•Autoscale workers on queue depth: step scaling for predictability, target-tracking for simplicity.
•Backpressure: return 429 when queue depth > high-water mark. HWM = drain_rate × SLA_minutes.
•Use SQS Standard for general async work; FIFO for per-entity ordering; Kafka for > 100K msg/s or replay needs.
•Graceful shutdown: on SIGTERM, stop polling, finish in-flight message, delete it, then exit. Set ECS stopTimeout > p99.
•Outbox pattern for dual-write safety: business row + event row in one DB transaction, poller publishes to queue.
•Fan-out via SNS → N SQS queues when one event triggers multiple independent downstream processes.
•Three dashboard metrics: enqueue rate, dequeue rate, queue depth. Two alarms: DLQ depth > 0, oldest message age > SLA.

Core concept

The producer-consumer pattern is the single most reusable shape in distributed systems. Every time a synchronous path handles work the caller does not need to wait for — sending an email, resizing an image, updating a search index — you are looking at a candidate for this pattern. The shape is deceptively simple: a producer emits work items, a durable queue buffers them, and one or more consumers (workers) drain the queue at their own pace. A dead-letter queue (DLQ) catches messages that fail repeatedly so they do not block the healthy stream.

Architecture diagram· Producer → durable queue → N workers → DLQ

Why a queue, not a direct call?

The fundamental insight is Little's Law: in steady state, the average number of items in a system equals the arrival rate multiplied by the average processing time. If your API receives 500 image uploads per second and each transcode takes 4 seconds, you need 2,000 concurrent transcode slots just to keep up. Without a queue, each of those 2,000 slots is a thread (or a container) that the API server must keep alive — and if the API receives a burst of 2,000/s for 10 seconds, you need 8,000 slots or you start shedding requests. With a queue, the API enqueues and returns 202 Accepted in < 10 ms, and the queue absorbs the burst. Workers drain at whatever rate the GPU fleet can sustain.

The four pieces

1. Producer. The producer is the code that generates work. It should do exactly two things: validate the input and enqueue a message. It must NOT do the work itself — that is the whole point. The message should carry a reference to the payload (an S3 presigned URL, a database row id), not the payload itself. SQS messages have a 256 KB limit; Kafka records default to 1 MB. Passing references keeps messages small and lets you retry without re-uploading large blobs.

2. Queue. The queue is the shock absorber. It must be durable (survives broker restarts), ordered enough for your use case (FIFO or standard), and observable (queue-depth metric). SQS Standard gives you nearly unlimited throughput with at-least-once delivery and best-effort ordering. SQS FIFO gives you exactly-once processing and strict ordering within a message group, but caps at 3,000 messages per second per queue (with batching). Kafka gives you strict partition ordering with configurable retention, but you own the infrastructure. Choose based on your ordering, throughput, and ops-budget requirements.

3. Consumer (worker). The consumer polls the queue, processes the message, and deletes it. Three rules make or break your consumer: (a) idempotency — the same message processed twice must produce the same side-effect once; (b) visibility timeout — you must extend the lease if processing takes longer than the timeout, otherwise the message reappears and another worker picks it up, causing duplicate work; (c) graceful shutdown — on SIGTERM, finish the in-flight message before exiting, or at minimum stop polling and let the visibility timeout expire naturally.

4. Dead-letter queue (DLQ). Messages that exceed the maximum receive count (typically 3-5) are automatically moved to the DLQ by the broker. The DLQ must have an alarm (queue depth > 0 fires within 1 minute) and a replay mechanism. Never silently drop messages. The DLQ is your audit trail for every failure the system could not self-heal.

The crash-between-commit-and-side-effect problem

The subtlest bug in any queue-based system is the window between committing the side-effect and deleting the message. If a worker writes a row to the database and then crashes before calling DeleteMessage, the message reappears and the worker processes it again. This is why idempotency is not optional — it is a structural requirement of at-least-once delivery. Common strategies include idempotency keys (store the message id in a unique-constraint column), conditional writes (UPDATE ... WHERE version = expected), and natural idempotency (re-uploading the same S3 object with the same key is a no-op).

Backpressure

A queue can grow without bound if producers outpace consumers indefinitely. In production, unbounded queues eventually hit memory or cost limits and cause cascading failures. The fix is backpressure: monitor the queue depth, and when it exceeds a high-water mark, have the API return 429 to callers. This pushes pressure upstream instead of letting the queue act as an infinite trash can. The high-water mark should be set so that the current worker fleet can drain the backlog within your SLA — if your SLA is 10 minutes and each worker processes 100 messages per minute with 10 workers, your HWM is 10,000.

A well-designed producer-consumer system has three observable metrics on a single dashboard: enqueue rate, dequeue rate, and queue depth. If enqueue > dequeue, depth grows. If depth > HWM, you either autoscale workers or shed load. If DLQ depth > 0, page on-call. These three metrics tell you everything.

Canonical examples

→Upload → transcode pipeline (YouTube, TikTok)
→Email / SMS / push-notification send (Twilio, SendGrid)
→Webhook dispatcher with retries (Stripe, GitHub)
→Search index update after database write (Elasticsearch)
→Payment post-processing: receipts, loyalty points, analytics

Variants

Simple queue (single producer, single consumer type)

One producer, one queue, one consumer pool — the default starting point.

Architecture diagram· V2 — Basic: producer → SQS → worker pool

The API enqueues and returns 202 Accepted immediately. A pool of workers drains the queue independently. The user no longer waits for processing; the system tolerates spikes by growing the queue backlog.

The simplest incarnation: a single service produces messages, a single SQS Standard queue buffers them, and a pool of identical workers consumes them. This is the right choice when you have one job type, ordering does not matter at the global level, and you need at-least-once delivery.

Architecture diagram· V2 — Basic: producer → SQS → worker pool

How it works. The producer calls SendMessage with a JSON body containing a job reference (e.g., an S3 key). Workers long-poll the queue with WaitTimeSeconds=20 to minimize empty responses. Each worker receives a batch of up to 10 messages, processes them sequentially, and calls DeleteMessageBatch for the successful ones. Failed messages are left in the queue; SQS makes them visible again after the visibility timeout.

Sizing the worker pool. Start with the throughput equation: if each message takes T seconds and you need to sustain R messages per second, you need at least R × T workers. For a transcode pipeline doing 200 msg/s with 3s per transcode, you need 600 concurrent workers. In practice, add 20% headroom for GC pauses and network jitter. Use ECS with Fargate Spot for cost efficiency — transcode workers are inherently interruptible.

Observability. Alarm on ApproximateNumberOfMessagesVisible > 2× your steady-state depth for 5 minutes. Alarm on ApproximateAgeOfOldestMessage > your SLA. Alarm on DLQ depth > 0. These three alarms catch 95% of production issues before customers notice.

Limitations. SQS Standard does not guarantee ordering. If message A must be processed before message B, you need FIFO with message group ids or a different queue technology. SQS Standard also delivers at-least-once, meaning your consumer MUST be idempotent.

Pros

+Trivial to set up — SQS is serverless, no infrastructure to manage
+Scales horizontally by adding workers; no single-consumer bottleneck
+Built-in DLQ and visibility timeout handle failure without custom code
+Cost-efficient — SQS charges per request, not per idle capacity

Cons

−No ordering guarantee — use FIFO if order matters
−At-least-once delivery means duplicate processing without idempotency keys
−Single queue becomes a cost bottleneck at > 100K msg/s (Kafka may be cheaper)
−No priority lanes — all messages compete equally for workers

Choose this variant when

You have a single job type with no ordering requirement
Throughput is under 50K msg/s and you want zero infrastructure ops
You need a quick win to decouple a synchronous call path
You are on AWS and want native integration with Lambda, ECS, and CloudWatch

Competing consumers (horizontal scaling)

Multiple identical consumers race to process messages, scaling throughput linearly.

Competing consumers is the scaling strategy for simple queues. Instead of one consumer processing messages sequentially, N consumers poll the same queue and process in parallel. SQS handles the contention — each ReceiveMessage call returns messages that are invisible to other consumers for the duration of the visibility timeout.

The key contract. Every consumer is identical. They run the same code, stateless, and any consumer can handle any message. If your processing requires affinity (e.g., "all messages for user X must go to the same consumer"), you need partition-based consumption (Kafka) or FIFO with message group ids, not competing consumers.

Scaling formula. Desired workers = queue_depth / (messages_per_worker_per_minute × target_drain_minutes). If the queue has 10,000 messages, each worker processes 100/min, and you want to drain in 5 minutes, you need 20 workers. AWS Application Auto Scaling can use the ApproximateNumberOfMessagesVisible metric as a custom scaling dimension to automate this.

Visibility timeout tuning. Set visibility timeout to 2× your p99 processing time. If p99 is 30s, set timeout to 60s. Too short → messages reappear while still being processed → duplicate work. Too long → failed messages sit invisible for minutes before retry. For variable-duration jobs (some take 2s, some take 120s), use ChangeMessageVisibility to extend the timeout mid-flight — poll-extend-process-delete.

Graceful shutdown. When the autoscaler decides to remove a worker, it sends SIGTERM. The worker must stop polling, finish its current message, delete it, and then exit. If you just kill the process, the in-flight message becomes invisible until the timeout expires — wasted time. ECS gives you a stopTimeout (default 30s); make sure it exceeds your p99 processing time.

Poison message protection. A poison message is one that crashes every consumer that tries to process it (e.g., a corrupted payload, a job that triggers an OOM). Without protection, it cycles through all your workers, crashing each one. The DLQ with maxReceiveCount=3 catches it — after 3 crashes, SQS moves it to the DLQ automatically.

Pros

+Linear horizontal scaling — 2× workers ≈ 2× throughput
+No coordinator needed — the queue is the distributed lock
+Workers are stateless and disposable; perfect for spot instances
+Fault-isolated — one worker crash does not affect others

Cons

−No message affinity — cannot guarantee ordering per entity
−Hot partitions in downstream stores if all workers write to the same shard
−Autoscaling lag — takes 1-3 minutes to launch new ECS tasks on spike
−Each worker holds a DB connection; 200 workers = 200 connections (use a pool proxy)

Choose this variant when

Processing is stateless and any worker can handle any message
You need to scale throughput beyond what a single consumer can achieve
You are using SQS Standard and can tolerate at-least-once delivery
Worker cost is dominated by compute (CPU/GPU), not coordination

Priority queues (fast lane / slow lane)

Route high-priority work to a dedicated queue with more workers and tighter SLAs.

Architecture diagram· V3 — Priority: fast lane + slow lane queues

Paid users hit a high-priority queue with more workers and a lower visibility timeout. Free-tier traffic goes to the standard lane. Both drain into the same result store, but latency SLAs differ by an order of magnitude.

Most queue systems do not support message-level priority natively. SQS, Kafka, and RabbitMQ all treat messages equally within a queue or partition. The standard solution is multiple queues: a fast-lane queue for premium work and a slow-lane queue for best-effort work, each with its own consumer pool.

Architecture diagram· V3 — Priority: fast lane + slow lane queues

Routing logic. The producer (or API gateway) inspects the request metadata — user tier, job type, payment status — and routes to the appropriate queue. Keep routing stateless: a simple lookup of the user's subscription tier in a cache. Do not parse message content to decide priority; that is the producer's responsibility before enqueuing.

Worker allocation. Fast-lane workers are always over-provisioned relative to demand so that p99 latency stays low. A typical split: 60% of workers on the fast lane serving 20% of traffic, 40% on the slow lane serving 80%. This means fast-lane messages get processed in seconds; slow-lane messages may wait minutes during spikes. The asymmetry is intentional — you are spending compute budget on revenue-generating customers.

Starvation prevention. Without guardrails, a burst on the fast lane can starve the slow lane if workers are shared. Keep worker pools separate. If you must share workers, use weighted fair queuing: each poll cycle, a worker checks the fast queue first, and only polls the slow queue if the fast queue is empty. This gives strict priority to fast-lane work without dedicating fixed capacity.

SLA guarantees. Define each lane's SLA explicitly: "fast-lane p99 < 30s from enqueue to completion; slow-lane p99 < 10 minutes." Set CloudWatch alarms on ApproximateAgeOfOldestMessage for each queue independently. If the fast-lane alarm fires, it is a Sev-1 page; if the slow-lane alarm fires, it is a Sev-3 ticket. This operational asymmetry matches the business asymmetry.

When to add more lanes. Two lanes (fast + slow) cover 90% of use cases. Add a third only when you have a genuinely distinct SLA tier — e.g., internal batch jobs that can wait hours. Every additional lane adds operational surface area: separate alarms, dashboards, and capacity planning.

Pros

+Guarantees low latency for revenue-critical work without over-provisioning the whole fleet
+Clean operational model — separate alarms, dashboards, and capacity per lane
+No custom queue internals — standard SQS queues with different worker pools
+Easy to add or remove priority tiers without changing consumer code

Cons

−More infrastructure to operate — N queues × N worker pools × N alarm sets
−Routing errors send premium work to the slow lane (silent SLA violation)
−Total worker count is higher than a single-queue design (dedicated pools)
−Does not solve priority within a single lane (all messages in one lane are equal)

Choose this variant when

You have distinct customer tiers with different latency SLAs
Premium customers are willing to pay for guaranteed processing time
You want operational isolation between priority tiers
The fast-lane volume is < 30% of total traffic so over-provisioning it is affordable

Fan-out via topic (SNS → N queues)

One event triggers multiple independent consumer types via a topic-queue fan-out.

When a single event must be processed by multiple independent systems — e.g., an order placed triggers inventory update, email confirmation, analytics ingestion, and fraud check — you use a topic (SNS) that fans out to multiple queues (SQS), each with its own consumer.

Architecture. The producer publishes to an SNS topic. Each downstream system subscribes its own SQS queue to that topic. SNS copies the message to every subscribed queue. Each consumer pool processes independently, at its own rate, with its own retry policy and DLQ.

Why not one queue with multiple consumers? Because each consumer type has different processing requirements: the email sender needs 2 seconds, the analytics ingester is sub-second, the fraud check needs 10 seconds. With competing consumers on one queue, they would contend for the same messages. With fan-out, each gets its own copy.

Failure isolation. If the analytics ingester falls behind, its queue depth grows but the email sender is unaffected — it drains its own queue independently. This is the primary advantage over a single-queue design: one slow downstream does not block the others.

Filtering. SNS supports message filtering policies so subscribers only receive messages matching their criteria. The inventory service subscribes with a filter on event_type = 'order.placed'; the analytics system subscribes with no filter (receives everything). This reduces consumer-side waste without requiring the producer to publish to multiple topics.

Ordering. SNS FIFO → SQS FIFO preserves ordering within a message group. If global ordering is not needed (each downstream can process independently), use Standard SNS → Standard SQS for higher throughput and lower cost.

Cost. SNS charges per publish + per delivery. With 4 subscribers, each publish costs 4× the delivery fee. At 100K msg/s, that is 400K SQS sends/s — roughly $150/day. Compare this to the engineering cost of building a custom fan-out dispatcher and the choice is obvious.

Pros

+Clean separation of concerns — each downstream owns its queue and consumer
+Failure isolation — one slow consumer does not block others
+Adding a new downstream is a one-line SNS subscription, zero producer changes
+Native AWS integration with filtering, DLQ, and CloudWatch metrics

Cons

−Message duplication across N queues increases storage and SQS costs
−SNS message size limit (256 KB) applies; large payloads need S3 references
−End-to-end latency is higher than direct invocation (SNS + SQS delivery)
−No cross-subscriber transaction — each downstream processes independently

Choose this variant when

A single event must trigger 2+ independent downstream processes
Each downstream has different SLAs, retry policies, or processing speeds
You want to add new consumers without modifying the producer
You need failure isolation between downstream systems

Scaling path

V1 — Synchronous inline processing

Prove the product works with minimal infrastructure.

Architecture diagram· V1 — Synchronous: API → direct call to service

The naive starting point. The caller blocks while the downstream service processes. Any spike overwhelms the service; any service failure cascades back to the user.

The API receives a request and does all the work inline before responding. A POST /upload receives the file, transcodes it, writes the result to S3, updates the database, and returns 200. This is fine at 10 req/s with 500 ms processing time — you need 5 concurrent handlers, well within a single server.

Why it breaks. At 100 req/s with 3s processing, you need 300 threads. At 500 req/s you need 1,500 threads. Thread pools exhaust, latency spikes to 30s, and the load balancer health check fails. Worse: a downstream transcode service outage blocks ALL API requests, even those that do not need transcoding. The synchronous coupling makes every failure a total outage.

What triggers the next iteration

Thread pool exhaustion at > 100 concurrent requests
User-facing latency equals processing time (3-10s)
Downstream failure cascades to the API — no isolation
No retry mechanism — failed jobs are lost

V2 — Basic queue with worker pool

Decouple production from consumption; absorb traffic spikes.

Architecture diagram· V2 — Basic: producer → SQS → worker pool

Replace the synchronous call with an enqueue. The API writes a message to SQS and returns 202 Accepted in < 10 ms. A separate fleet of workers polls the queue and processes jobs. The queue absorbs burst traffic — a 10x spike for 30 seconds adds 15,000 messages to the queue, which the workers drain over the next few minutes.

Architecture diagram· V2 — Basic: producer → SQS → worker pool

Key decisions. (a) SQS Standard vs FIFO — choose Standard unless you need per-entity ordering. (b) Batch size — workers process up to 10 messages per poll to amortize the ReceiveMessage API call cost. (c) Visibility timeout — set to 2× p99 processing time to prevent duplicate deliveries. (d) DLQ — maxReceiveCount = 3, alarm on depth > 0.

Result. API latency drops from 3s to 10 ms. Throughput is limited only by the worker fleet size, which autoscales on queue depth. The system handles 10x traffic spikes with zero API impact.

What triggers the next iteration

All jobs share one queue — no priority differentiation
Autoscaling lag (1-3 min) causes temporary backlog growth during spikes
Single region — cross-region latency for producers in other regions
Worker pool DB connections scale linearly; need connection pooling at 100+ workers

V3 — Priority lanes + autoscaling

Differentiate SLAs by customer tier; scale workers automatically.

Architecture diagram· V3 — Priority: fast lane + slow lane queues

Split the single queue into two: a fast lane for paid customers and a slow lane for free tier. Each lane has its own autoscaling group. The fast lane scales aggressively (scale up at queue depth > 100) with a short cool-down (60s). The slow lane scales conservatively (scale up at depth > 5,000) with a long cool-down (300s).

Architecture diagram· V3 — Priority: fast lane + slow lane queues

Autoscaling strategy. Use ECS Service Auto Scaling with a step-scaling policy based on the SQS ApproximateNumberOfMessagesVisible metric. Step policy: 0-500 messages → 5 workers, 500-2000 → 20 workers, 2000-10000 → 50 workers, 10000+ → 100 workers. Use a target-tracking policy for the fast lane: target = 10 messages per worker (latency-optimised). Use step scaling for the slow lane: cost-optimised, drain within SLA.

Result. Paid-tier p99 drops below 30s. Free-tier p99 stays under 10 minutes. Total cost decreases 25% because slow-lane workers use Fargate Spot at 70% discount.

What triggers the next iteration

Two queues and two worker pools double the operational surface area
Cross-region producers still pay ~100 ms network latency to the queue region
Large jobs (video transcode > 5 min) hold workers idle during scale-down cool-down
Priority misrouting is a silent failure — needs a routing-accuracy metric

V4 — Multi-region queues with global reconciliation

Keep processing local to the producer; handle cross-region consistency.

Architecture diagram· V4 — Multi-region queues

Each region has its own queue + worker fleet, keeping latency local. A global reconciler resolves cross-region ordering conflicts post-fact. Failed messages route to a regional DLQ with centralized alerting.

Deploy a queue and worker fleet in each major region (us-east, eu-west, ap-south). Producers enqueue locally for sub-10 ms latency. Workers process locally, writing results to a regional data store. A global reconciler merges results post-fact for any cross-region aggregation or reporting needs.

Architecture diagram· V4 — Multi-region queues

Cross-region ordering. Within a region, ordering is handled by FIFO queues or Kafka partitions. Across regions, strict ordering is not feasible without a global sequencer (which adds latency and a single point of failure). Instead, use vector clocks or Lamport timestamps and reconcile asynchronously. For most producer-consumer workloads (image transcode, notification send), cross-region ordering does not matter.

DLQ strategy. Each region has its own DLQ. A centralised alarm aggregator (CloudWatch cross-account + Datadog) monitors all regional DLQs. Replay is region-local — the operator replays from the regional DLQ back to the regional source queue.

Result. Producer enqueue latency < 10 ms globally. Processing latency stays local (no cross-region network in the hot path). Blast radius of a regional outage is limited to that region's queue depth — other regions continue processing.

What triggers the next iteration

Global reconciler is a new single point of failure for cross-region consistency
Operational complexity: N regions × 2 queues × 2 worker pools × alarms
Cost of running hot standby worker pools in every region
Debugging cross-region message flow requires distributed tracing (X-Ray / Jaeger)

Deep dives

Visibility timeout tuning

Architecture diagram· Visibility timeout — message lifecycle timeline

A message becomes invisible after a consumer leases it. If the consumer finishes and deletes it, done. If it crashes, the message reappears after the timeout expires. After maxReceiveCount reappearances, SQS moves it to the DLQ.

The visibility timeout is the most misunderstood parameter in SQS. When a consumer calls ReceiveMessage, SQS hides that message from other consumers for the duration of the timeout. If the consumer deletes the message before the timeout expires, everything is clean. If the consumer crashes or takes too long, the message reappears and another consumer picks it up.

Architecture diagram· Visibility timeout — message lifecycle timeline

Setting the timeout. Rule of thumb: set it to 6× your median processing time or 2× your p99, whichever is larger. If median = 5s and p99 = 30s, set timeout to 60s. Too short (10s) means slow-but-successful jobs get duplicated. Too long (600s) means genuinely failed jobs sit invisible for 10 minutes before retry — your time-to-recovery suffers.

Dynamic extension. For variable-duration jobs (some take 1s, some take 120s), do not set a 240s timeout for all messages. Instead, set a conservative default (60s) and call ChangeMessageVisibility every 30s in a heartbeat loop while processing continues. This keeps the timeout tight for fast jobs while preventing premature reappearance for slow ones. AWS SDK v3 has no built-in heartbeat — you need to implement it: start a setInterval that calls ChangeMessageVisibility with a new timeout, and clear it when processing completes.

Interaction with maxReceiveCount. Every time a message reappears (visibility timeout expires without deletion), its receive count increments. When receive count exceeds maxReceiveCount, SQS moves it to the DLQ. If your timeout is too short and jobs occasionally take longer, you will see healthy messages in the DLQ — a false-positive that erodes trust in your alerting. Monitor the ApproximateReceiveCount metric to catch this early.

FIFO queues. In FIFO queues, a timed-out message blocks the entire message group until it is processed or moved to the DLQ. This is by design (strict ordering), but it means a single slow message can halt processing for all messages in that group. Keep message groups small and processing times predictable.

Transactional outbox pattern

Architecture diagram· Transactional outbox pattern

The service writes to its business table and the outbox table inside one DB transaction. A poller (or CDC stream) reads the outbox and publishes to the queue. This guarantees exactly-once publishing without two-phase commit.

The "dual write" problem: your service needs to update the database AND publish a message to the queue. If it writes the DB first and crashes before publishing, the event is lost. If it publishes first and crashes before writing, the DB is inconsistent. You cannot do both atomically without a distributed transaction — which is impractical at scale.

Architecture diagram· Transactional outbox pattern

The solution. Write both the business data and the outbound event into the SAME database, in the SAME transaction. The "outbox" is just a table:

CREATE TABLE outbox (id BIGSERIAL, aggregate_id UUID, event_type TEXT, payload JSONB, created_at TIMESTAMPTZ DEFAULT now(), published BOOLEAN DEFAULT false).

The service inserts into the orders table and the outbox table in one BEGIN...COMMIT. A separate poller process queries the outbox for unpublished rows, publishes them to SQS, and marks them published. If the poller crashes after publishing but before marking, it re-publishes — which is fine because the consumer is idempotent.

CDC alternative. Instead of a poller, use Change Data Capture (Debezium on Postgres WAL, DynamoDB Streams, or Aurora Event Notifications) to stream outbox inserts directly to Kafka or SQS. This eliminates the polling delay (milliseconds instead of seconds) and removes the poller as a component to operate.

Cleanup. The outbox table grows without bound if you do not prune it. Run a nightly job that deletes published rows older than 7 days. Keep unpublished rows forever — they represent data the downstream has never seen, which is a bug you want to investigate, not garbage-collect.

When to skip it. If your queue is SQS and your only concern is idempotency (not dual-write atomicity), you can skip the outbox and rely on consumer-side idempotency. The outbox pattern is specifically for the case where "DB write happened but event was never published" would cause business-visible data loss.

Idempotency layers

At-least-once delivery means the same message may arrive at your consumer twice (or more). Without idempotency, duplicate processing causes double charges, duplicate emails, or corrupted state. Idempotency is not a nice-to-have — it is a structural requirement.

Layer 1: message deduplication at the broker. SQS FIFO provides exactly-once processing within a 5-minute deduplication window using the MessageDeduplicationId. If the producer sends the same dedup id twice within 5 minutes, SQS silently drops the duplicate. This handles producer retries (network timeout → retry → duplicate publish) but does NOT handle consumer-side duplicates caused by visibility timeout expiry.

Layer 2: idempotency key at the consumer. Before processing, the consumer checks whether this message id has already been processed. The simplest implementation: an idempotency_keys table with a unique constraint on (message_id). INSERT INTO idempotency_keys (message_id) — if it succeeds, process the message; if it violates the unique constraint, skip it. Use a TTL (e.g., 7 days) to keep the table from growing unbounded.

Layer 3: naturally idempotent operations. Some operations are idempotent by nature: PUT an object to S3 with the same key overwrites safely. UPDATE users SET email = 'x' WHERE id = 1 is idempotent. INSERT is not. Design your consumer to use idempotent operations wherever possible — it is simpler and cheaper than maintaining an idempotency key store.

Layer 4: conditional writes. Use optimistic concurrency: UPDATE orders SET status = 'shipped', version = 6 WHERE id = ? AND version = 5. If the row was already updated (version is now 6), the UPDATE affects 0 rows and the consumer knows it is a duplicate. This works without an external idempotency store but requires the data model to support versioning.

Combining layers. Use Layer 1 (broker dedup) to catch producer-side duplicates, and Layer 2 or 4 (consumer-side) to catch redeliveries from visibility timeout expiry. Layer 3 (natural idempotency) is always preferable when the operation supports it.

Dead-letter queues and poison messages

Architecture diagram· DLQ retry chain → alert → replay

Messages retry with exponential backoff. After maxReceiveCount they land in the DLQ. An alarm fires on DLQ depth > 0. An operator inspects, fixes the root cause, and replays messages back to the source queue.

A dead-letter queue (DLQ) is a separate queue where messages are sent after exceeding the maximum receive count. It serves three purposes: (1) prevent poison messages from cycling endlessly through the consumer fleet, (2) preserve failed messages for investigation, and (3) provide a replay mechanism after the root cause is fixed.

Architecture diagram· DLQ retry chain → alert → replay

Poison messages. A poison message is one that causes every consumer to fail deterministically — a malformed JSON payload, a reference to a deleted S3 object, or a job that triggers an OOM in the worker. Without a DLQ, it reappears after every visibility timeout, crashes every worker that picks it up, and blocks throughput on the healthy messages behind it. With maxReceiveCount = 3, it moves to the DLQ after 3 failures, and the healthy stream resumes within seconds.

Alerting. Set a CloudWatch alarm on ApproximateNumberOfMessagesVisible on the DLQ with threshold = 1 and period = 60 seconds. Any message in the DLQ is an anomaly — it means something failed that the retry logic could not self-heal. This alarm should page on-call, not just log to a dashboard.

Replay. After fixing the root cause (deploying a code fix, restoring the deleted S3 object), replay messages from the DLQ back to the source queue. AWS provides StartMessageMoveTask API for SQS DLQ redrive. For custom replay, read from the DLQ, re-publish to the source queue, and delete from the DLQ. Always replay to the source queue, not directly to the consumer — this ensures the message goes through the same retry logic.

DLQ retention. Set the DLQ message retention period to 14 days (SQS max). This gives you two weekends to investigate. If the message expires from the DLQ before investigation, the data is gone. For compliance-sensitive systems, archive DLQ messages to S3 before they expire.

DLQ of the DLQ. Never. If the DLQ itself needs a DLQ, your retry logic is broken. Fix the consumer.

Backpressure design

Architecture diagram· Backpressure: full queue → 429 upstream

When the queue depth exceeds a configurable high-water mark, the producer-side middleware starts returning 429 to callers. This pushes backpressure upstream instead of accepting work the system cannot drain in SLA time.

A queue without backpressure is an unbounded buffer. If producers enqueue faster than consumers drain for an extended period, the queue grows until something breaks: SQS charges spike, message age exceeds the retention period and messages are silently deleted, or downstream stores get overwhelmed when the backlog finally drains.

Architecture diagram· Backpressure: full queue → 429 upstream

High-water mark (HWM). Define the maximum acceptable queue depth based on your SLA and drain rate. If your SLA is "process within 15 minutes" and your fleet drains 1,000 msg/min, your HWM is 15,000 messages. When depth exceeds HWM, the system must shed load.

Shedding mechanisms. (1) API-level 429: the API gateway checks queue depth (cached, refreshed every 5s) and returns 429 Too Many Requests to callers. Include a Retry-After header so clients know when to retry. (2) Producer-side circuit breaker: the producer checks an in-memory gauge of the queue depth and stops enqueuing. (3) Admission control: a front-door rate limiter caps the ingest rate at the known drain rate.

Graduated backpressure. Instead of a binary on/off, use multiple thresholds: at 50% HWM, start shedding free-tier traffic. At 80% HWM, shed all non-critical traffic. At 100% HWM, shed everything and page on-call. This preserves availability for premium customers during sustained overload.

Autoscaling as the first defense. Backpressure should be the second defense. The first is autoscaling: when queue depth rises, add workers. Only when autoscaling reaches its maximum (cost cap, resource limit, or scaling lag) should backpressure engage. Configure your autoscaling max to handle your expected peak + 50% headroom; configure backpressure HWM to handle what autoscaling cannot.

Metric: queue drain rate. Monitor dequeue rate (messages deleted per minute) alongside queue depth. If depth is rising but drain rate is at maximum, you have a capacity problem → add workers or engage backpressure. If depth is rising and drain rate is low, you have a consumer problem → investigate failures.

Consumer autoscaling strategies

Architecture diagram· Queue depth → autoscaler → workers

A CloudWatch alarm monitors ApproximateNumberOfMessagesVisible. When depth crosses the scale-up threshold, an autoscaler adds workers. When depth drops, workers are scaled down after a cool-down period.

Autoscaling consumers based on queue depth is the canonical way to match compute supply to demand in a producer-consumer system. The idea is simple: more messages → more workers. But the implementation details determine whether it works well or oscillates between over- and under-provisioning.

Architecture diagram· Queue depth → autoscaler → workers

Target-tracking vs step scaling. Target-tracking: "maintain 100 messages per worker." AWS divides queue depth by desired count and adjusts. This is simple and works for steady workloads but can oscillate with bursty traffic. Step scaling: "0-500 messages → 5 workers, 500-2000 → 15, 2000+ → 50." More predictable because each step is a fixed decision, but you must tune thresholds manually. For most producer-consumer workloads, step scaling gives better control.

Scale-up speed. The critical parameter is how fast you can add workers when a burst arrives. ECS Fargate launch time is 30-90 seconds. EC2 launch time is 2-5 minutes. Lambda is instant (but limited to 15-minute execution). During the scale-up window, the queue backlog grows. Size your minimum fleet to handle steady-state traffic so that bursts only require scaling, not cold starts from zero.

Scale-down cool-down. Aggressive scale-down causes flapping: workers scale down → depth rises → workers scale up → depth drops → repeat. Set a cool-down period of 5-10 minutes after scale-down. Also set a minimum worker count above zero — scaling to zero means the next message waits for a cold start. A minimum of 2-3 workers keeps the system responsive during low-traffic periods.

Cost-aware scaling. Use Fargate Spot for worker pools. Spot interruptions are fine because the visibility timeout ensures interrupted messages reappear. Set your Spot/on-demand mix to 80/20 — 80% Spot for cost savings, 20% on-demand as a baseline that survives Spot capacity shortages.

Queue-depth metric lag. SQS updates ApproximateNumberOfMessagesVisible every ~1 minute. This means your scaling decisions are based on 1-minute-old data. For sub-minute bursts, the queue absorbs the spike before the scaler reacts. This is acceptable — the queue is doing its job as a shock absorber. If you need faster reaction, publish a custom metric from your consumer fleet (messages processed per second) and scale on that.

Case studies

Stripe

Webhook delivery and retry pipeline

Stripe delivers over 1 billion webhook events per day to merchant endpoints. Each event (payment succeeded, subscription cancelled, dispute created) must be delivered reliably even when the merchant's server is down.

Architecture. When a payment event occurs, Stripe's event system publishes to an internal Kafka topic. A fan-out service copies the event to a per-merchant SQS queue (avoiding head-of-line blocking — one slow merchant does not delay others). A fleet of delivery workers polls per-merchant queues and makes HTTPS POST requests to the merchant's configured webhook URL.

Retry strategy. If the merchant returns a non-2xx response, the delivery worker does not retry immediately. Instead, it uses exponential backoff with jitter: 1 min, 5 min, 30 min, 2 hours, 8 hours, up to 3 days. Each retry is a new SQS message with an increasing delay. After the final attempt, the event moves to a DLQ and Stripe marks the endpoint as "disabled" in the dashboard, alerting the merchant via email.

Idempotency. Stripe includes an idempotency key (event id) in every webhook payload. Merchants are expected to deduplicate on this id because network retries and SQS redelivery can both cause duplicates. Stripe's own delivery workers use an idempotency table to avoid double-delivery within a 24-hour window.

Numbers. ~1.2 billion events/day, 99.97% first-attempt delivery success rate, p50 delivery latency 340 ms, p99 delivery latency 4.5s (dominated by slow merchant endpoints). The DLQ processes ~3.6 million events/day (0.3%) which are mostly caused by merchant endpoint downtime.

Takeaway

Per-entity queues prevent head-of-line blocking. Exponential backoff with jitter prevents thundering herd on recovery. The DLQ is a business feature, not just an engineering safeguard.

Uber

Trip event processing pipeline

Uber processes over 500 million trip-related events per day across its global fleet. Events include ride requests, driver assignments, trip start/end, fare calculations, and post-trip actions (receipts, ratings, surge pricing updates).

Architecture. Each microservice (Trip, Dispatch, Pricing, Payments) publishes events to a shared Kafka cluster with 10,000+ partitions. Consumer groups for each downstream service read independently. The Trip Completion consumer group processes trip-end events to trigger receipts, driver payouts, rider charges, and analytics ingestion — all as competing consumers on the same Kafka partition set.

Partitioning strategy. Events are partitioned by trip_id, ensuring all events for a single trip go to the same partition and are processed in order. This eliminates the need for cross-partition coordination — a single consumer handles the complete lifecycle of a trip. Partition count is set to 4× the expected peak consumer count to allow rebalancing headroom.

Backpressure. Kafka consumer lag is the primary backpressure signal. When lag exceeds 5 minutes (measured by Burrow), an autoscaler adds consumers to the group. When lag exceeds 15 minutes, an alarm pages the owning team. The producer side does not throttle — Kafka's retention (7 days) acts as the buffer, and producers are never told to slow down.

Failure handling. Poison messages (events that crash consumers) are detected by a circuit breaker: if a partition's consumer fails 3× on the same offset, the consumer skips the message, publishes it to a DLQ topic, and continues. This prevents one bad event from blocking the entire partition.

Numbers. ~500M events/day, 10,000 Kafka partitions, 800 consumers across 20 consumer groups, p50 end-to-end latency 200 ms, p99 2.1s. Peak during New Year's Eve: 18,000 events/second sustained for 6 hours.

Takeaway

Partition by entity id for ordered processing. Use consumer lag (not queue depth) as the autoscaling signal for Kafka. Skip-and-DLQ individual messages instead of blocking the partition.

GitHub

Background job processing with Aqueduct

GitHub processes over 300 million background jobs per day using its internal job framework (Aqueduct, built on top of Redis-based queues). Jobs include push webhook delivery, CI/CD trigger, code search indexing, dependency graph updates, and notification fan-out.

Architecture. Each job type is a Ruby class that defines a perform method. When a developer calls MyJob.perform_async(args), the framework serializes the job and enqueues it to a Redis-backed queue. Worker processes poll the queue, deserialize the job, and execute the perform method. Each queue has a dedicated worker pool with configurable concurrency.

Priority model. GitHub uses 5 priority levels (critical, high, default, low, bulk). Push webhook delivery is "critical" (processed within 1 second). Dependency graph updates are "bulk" (processed within 30 minutes). Each priority level maps to a separate Redis list. Workers check queues in priority order: drain all critical jobs before touching high, drain all high before touching default. This ensures latency-sensitive jobs are never delayed by bulk work.

Reliability. Redis is not durable by default — a crash loses the in-memory queue. GitHub runs Redis with AOF persistence (fsync every second) and cross-AZ replication. Despite this, they accept that ~0.01% of jobs may be lost in a catastrophic Redis failure. For truly critical jobs (billing, security alerts), they use the outbox pattern with Postgres as the source of truth and Redis as the fast path.

Numbers. ~300M jobs/day, 5 priority queues, 1,200 worker processes across 150 hosts, p50 queue wait time 15 ms (critical), p50 120 ms (default), p99 2.3s (default). Peak during Copilot launches: 850K jobs/minute sustained for 45 minutes.

Takeaway

Priority queues are implemented as separate Redis lists with priority-ordered polling. Redis speed comes at the cost of durability — use the outbox pattern for must-not-lose jobs.

Decision levers

Queue technology

SQS Standard is the default for 80% of use cases: serverless, zero-ops, at-least-once with DLQ. Switch to SQS FIFO when you need strict ordering within an entity (e.g., state machine transitions for an order). Switch to Kafka when throughput exceeds 100K msg/s, you need indefinite retention for replay, or you want consumer-group semantics for multiple independent readers. Switch to RabbitMQ when you need complex routing patterns (topic exchanges, dead-letter exchanges with TTL-based retry).

Visibility timeout vs acknowledgement

SQS uses visibility timeout: the message is hidden for N seconds, and if not deleted, reappears. Kafka uses offset commit: the consumer explicitly marks the offset as processed. RabbitMQ uses ack/nack: the consumer explicitly acknowledges or rejects. The visibility timeout model is simpler (no explicit ack code) but less flexible — you cannot partially acknowledge a batch. Kafka's offset model gives you replay for free (reset offset to re-process) but requires careful commit management to avoid data loss.

Idempotency strategy

Choose based on your data model. If you have a natural unique key (S3 object key, order id + action), use conditional writes. If your operation is naturally idempotent (PUT, UPDATE with fixed value), rely on that. If neither applies, implement an idempotency key table with a unique constraint and a TTL. SQS FIFO provides broker-level deduplication for a 5-minute window, which handles producer retries but not consumer-side redelivery.

Retry policy and backoff

Immediate retry helps with transient errors (network blip). Exponential backoff helps with downstream overload (database connection limit). Jitter prevents thundering herd when many messages fail simultaneously. A good default: retry 3 times with delays of 1s, 5s, 25s (5× exponential). After 3 retries, move to DLQ. For SQS, implement backoff by setting increasing delay on re-enqueue; for Kafka, use a retry topic with a delay consumer.

Consumer scaling strategy

Target-tracking ("maintain 100 msg per worker") is simple but oscillates under bursty load. Step scaling ("0-500 → 5 workers, 500-2000 → 15, 2000+ → 50") is more predictable. For Kafka, scale by partition count — you cannot have more consumers than partitions in a group. For SQS, you can have arbitrarily many consumers. Always set a minimum worker count > 0 to avoid cold-start latency on the first message after idle.

Failure modes

Message loss on dual write

The service writes to the database and then publishes to the queue. If it crashes between the two operations, the event is lost. The queue never receives it; the consumer never processes it. Fix: use the transactional outbox pattern — write the event to an outbox table in the same DB transaction, and let a poller or CDC stream publish it to the queue.

Duplicate processing without idempotency

Visibility timeout expires while a slow consumer is still processing. The message reappears; another consumer picks it up. Now two consumers are processing the same job. The result: double charges, duplicate emails, corrupted aggregates. Fix: implement idempotency at the consumer level — idempotency key table, conditional writes, or naturally idempotent operations.

Poison message blocking the queue

A malformed message crashes every consumer that attempts to process it. Without a DLQ, it cycles endlessly: receive → crash → reappear → receive → crash. Every cycle wastes compute and delays healthy messages. Fix: set maxReceiveCount (3-5) on the queue, configure a DLQ, and alarm on DLQ depth > 0.

Head-of-line blocking in FIFO queuesAdvanced

In SQS FIFO, messages within the same message group are delivered in order. If one message in the group fails and retries, all subsequent messages in that group are blocked until it succeeds or moves to the DLQ. Fix: use granular message group ids (per entity, not per queue), and keep maxReceiveCount low so poison messages move to the DLQ quickly.

Unbounded queue growth (missing backpressure)

Producers enqueue faster than consumers drain for hours. The queue grows to millions of messages. SQS costs spike. When the backlog finally drains, the downstream store gets overwhelmed by the write burst. Fix: implement backpressure — monitor queue depth, return 429 above a high-water mark, and autoscale workers as the first defense.

Autoscaling lag causing SLA breachAdvanced

A traffic spike adds 50,000 messages to the queue in 30 seconds. The autoscaler takes 90 seconds to launch new ECS tasks. During the lag, the queue depth exceeds the SLA drain time. Fix: maintain a warm minimum worker count that handles 2× steady-state traffic, use step scaling with aggressive thresholds, and accept that the queue absorbs sub-minute bursts by design.

Silent message expirationAdvanced

SQS messages have a maximum retention period (14 days default). If a consumer is down for longer than the retention period, messages are silently deleted. There is no alarm for this by default. Fix: alarm on ApproximateAgeOfOldestMessage approaching the retention limit. For critical systems, archive messages to S3 before they expire using an SQS → Lambda → S3 pipeline.

Decision table

Choosing your queue technology

Dimension	SQS Standard	SQS FIFO	Kafka	RabbitMQ
Ordering	Best-effort	Strict per group	Strict per partition	Per queue (single consumer)
Delivery	At-least-once	Exactly-once (5 min window)	At-least-once (configurable)	At-least-once / at-most-once
Throughput	Nearly unlimited	3K msg/s (batch)	Millions/s (partitioned)	50K msg/s (tuned)
Retention	4-14 days	4-14 days	Configurable (days-infinite)	Until consumed
Ops burden	Zero (serverless)	Zero (serverless)	High (clusters, ZK/KRaft)	Medium (Erlang cluster)
Cost model	Per request	Per request (higher)	Per broker-hour + storage	Per node-hour
Best for	General async jobs	Ordered workflows	High-throughput streaming	Complex routing (exchanges)

SQS is the default choice for AWS shops — zero ops, pay-per-use, native DLQ.
Kafka wins at > 100K msg/s sustained or when you need replay from arbitrary offsets.
RabbitMQ wins when you need complex routing (topic exchanges, headers-based routing).

Worked example

Worked example: image upload and transcode pipeline

You are designing a system where users upload images, and the system generates multiple sizes (thumbnail, medium, large) plus a WebP variant. The upload rate is 200 images/second at peak, each transcode takes 2-4 seconds, and users expect the thumbnails to appear within 60 seconds.

Architecture diagram· Producer → durable queue → N workers → DLQ

Step 1: accept the upload (producer)

The client uploads the image to S3 via a presigned URL (bypassing the API server for large payloads). When the upload completes, S3 emits an event to an SNS topic (or the client calls POST /images to confirm). The API validates the upload, writes an image record to the database with status=pending, and enqueues a transcode job to SQS:

{ "image_id": "img-abc123", "s3_key": "uploads/raw/img-abc123.jpg", "sizes": ["thumb", "medium", "large", "webp"], "user_id": "usr-456", "priority": "paid" }

The API returns 202 Accepted with the image_id. The client polls GET /images/img-abc123/status or subscribes to a WebSocket channel for completion notification.

Step 2: configure the queue

Use SQS Standard (ordering does not matter for independent image transcodes). Set visibility timeout to 30 seconds (2× p99 transcode time of 12s, rounded up). Set maxReceiveCount = 3, pointing to a DLQ. Set retention to 4 days (more than enough for investigation).

For the paid-tier use case: create two queues — fast-lane (paid users) and slow-lane (free users). The API checks the user's subscription tier and routes accordingly.

Step 3: consumer implementation

Each worker runs a loop:

1Long-poll the queue (WaitTimeSeconds=20, MaxNumberOfMessages=1 for image jobs since they are CPU-heavy)
2Download the raw image from S3
3Generate all requested sizes using Sharp (Node) or Pillow (Python)
4Upload each size to S3 at a deterministic key: outputs/{size}/img-abc123.{ext}
5Update the database: UPDATE images SET status='done', sizes=jsonb_payload WHERE id='img-abc123' AND status='pending'
6Delete the SQS message

The WHERE status='pending' clause provides natural idempotency — if the message is processed twice, the second UPDATE matches zero rows and the consumer skips the redundant work.

Step 4: autoscaling

Apply step scaling on ApproximateNumberOfMessagesVisible:

0-200 messages → 10 workers (steady state: 200 msg/s ÷ 20 msg/worker/s = 10 workers needed for real-time drain)
200-1000 → 30 workers (burst absorption)
1000-5000 → 60 workers (sustained spike)
5000+ → 100 workers (maximum fleet)

Workers run on ECS Fargate Spot (80%) + On-Demand (20%). Each worker is a single-container task with 2 vCPU / 4 GB RAM (image processing is CPU-bound). Scale-up cool-down: 60 seconds. Scale-down cool-down: 300 seconds.

Step 5: DLQ and alerting

DLQ alarm: ApproximateNumberOfMessagesVisible > 0 for 1 minute → pages on-call. Common DLQ causes: corrupted image files that crash Sharp, S3 permission errors on the output bucket, database connection exhaustion during spike. The operator inspects the DLQ message, fixes the root cause (deploy a fix, increase DB connection limit), and uses StartMessageMoveTask to replay.

Step 6: observability dashboard

Three graphs: (1) Enqueue rate vs dequeue rate (should track closely; divergence means backlog). (2) Queue depth over time (should hover near zero in steady state). (3) p50/p99 message age (should stay well under the 60-second SLA). One alarm: DLQ depth > 0.

The math

Peak: 200 uploads/s × 4 sizes × 3s avg transcode = 2,400 concurrent transcode operations. With 20 msg/worker/s throughput (pipelining downloads while transcoding), you need 120 concurrent workers at peak. The step-scaling policy reaches 100 workers at 5,000 messages — sufficient because the queue absorbs the first 30 seconds of burst while workers scale up. Steady state: 50 uploads/s → 10 workers → queue depth ≈ 0.

Total cost estimate: 100 Fargate Spot tasks × 2 vCPU × $0.012/vCPU-hour × 4 peak hours/day = $9.60/day for burst capacity. Steady-state: 10 tasks × 24 hours × $0.024/vCPU-hour = $11.52/day. Total: ~$21/day or ~$640/month including SQS costs (~$50/month at 200M messages).

Interview playbook

Interview playbook8-12 minutes in a 45-minute system design interview. Spend 2 min on the async boundary decision, 3 min on the skeleton + queue choice, 3 min on failure handling (idempotency, DLQ), 2 min on scaling and backpressure.

When it comes up

The prompt involves a user action that triggers heavy background processing (transcode, report generation)
The interviewer asks "what happens if this downstream service is slow?"
You identify work the user does not need to wait for (emails, notifications, analytics)
The system has bursty traffic patterns and the processing backend has steady capacity
The interviewer probes on reliability: "what if the server crashes mid-processing?"

Order of reveal

1
1. Name the async boundary. The user does not need to wait for this work. I will decouple with a queue — the API enqueues and returns 202 Accepted immediately.
2
2. Draw the skeleton. Producer → durable queue → worker pool → DLQ. Let me sketch that on the whiteboard.
3
3. Choose the queue tech. SQS Standard for this use case — serverless, at-least-once delivery, no ordering requirement. If we needed ordering per entity, I would use FIFO with message group ids.
4
4. Address idempotency. At-least-once means duplicates are possible. I will use conditional writes — UPDATE WHERE status = pending — so re-processing the same message is a no-op.
5
5. Handle failure. maxReceiveCount of 3, then DLQ. Alarm on DLQ depth > 0. The operator replays after fixing the root cause.
6
6. Autoscaling. Step scaling on queue depth: 0-500 → 5 workers, 500-2000 → 15, 2000+ → 50. Workers on Fargate Spot for cost efficiency.
7
7. Backpressure. If queue depth exceeds the high-water mark — say, 15,000 messages — the API returns 429 to shed load upstream instead of growing the queue unboundedly.

Signature phrases

“The queue is the shock absorber.”

“Enqueue and return 202 — the user does not wait for side-effects.”

“At-least-once means idempotency is structural, not optional.”

“DLQ depth > 0 is an anomaly — page on-call, do not just log it.”

“Backpressure pushes the problem upstream instead of hiding it in unbounded queue growth.”

“Visibility timeout equals 2× your p99 processing time.”

“The queue is the shock absorber.” — Frames the queue as a deliberate engineering choice for burst absorption, not just "use a queue because queues are good."
“Enqueue and return 202 — the user does not wait for side-effects.” — Shows you understand the fundamental async boundary decision.
“At-least-once means idempotency is structural, not optional.” — Demonstrates awareness that queue semantics dictate consumer design.
“DLQ depth > 0 is an anomaly — page on-call, do not just log it.” — Shows operational maturity: DLQ is an alarm, not a log line.
“Backpressure pushes the problem upstream instead of hiding it in unbounded queue growth.” — Shows you think about steady-state sustainability, not just happy-path throughput.
“Visibility timeout equals 2× your p99 processing time.” — Concrete, memorable sizing rule that interviewers remember and repeat.

Likely follow-ups

?“What if processing time varies widely — some jobs take 1 second, others take 5 minutes?”Reveal

Use a short default visibility timeout (60s) and extend it dynamically with ChangeMessageVisibility in a heartbeat loop. Every 30 seconds, the worker extends the timeout by another 60 seconds. When processing completes, delete the message. This keeps the timeout tight for fast jobs while preventing premature reappearance for slow ones. For extremely long jobs (> 15 min), consider a separate "long-running" queue with a 15-minute timeout.

?“How do you guarantee exactly-once processing?”Reveal

You cannot guarantee exactly-once at the infrastructure level with SQS Standard. You guarantee it at the application level: (1) SQS FIFO with deduplication handles producer retries within a 5-minute window, (2) consumer-side idempotency keys handle redelivery from visibility timeout expiry. The combination gives you effectively-exactly-once. True exactly-once requires a distributed transaction, which is impractical at scale.

?“How would you handle priority between different job types?”Reveal

Separate queues per priority tier with separate worker pools. Fast lane gets more workers and tighter autoscaling thresholds. Slow lane uses Fargate Spot for cost. Do not try to implement priority within a single SQS queue — SQS has no native priority mechanism. The routing decision happens at enqueue time based on the user's tier or job type.

?“What happens if the queue itself goes down?”Reveal

SQS is a managed, multi-AZ service with 99.999% availability — it effectively does not go down. If you are using self-managed Kafka or RabbitMQ, you need a replication factor of 3, cross-AZ placement, and a failover strategy. The producer should have a circuit breaker: if the queue is unreachable, buffer locally (in-memory with a size cap) and retry with backoff. Never drop the work silently.

?“How do you monitor the health of this system?”Reveal

Three metrics on one dashboard: enqueue rate, dequeue rate, and queue depth. If enqueue > dequeue, depth grows → autoscale or investigate. Two alarms: (1) DLQ depth > 0 → page on-call. (2) ApproximateAgeOfOldestMessage > SLA → the system is falling behind. Optional: consumer error rate (failed deletions / total polls) to catch code bugs before they fill the DLQ.

?“When would you choose Kafka over SQS?”Reveal

Three scenarios: (1) throughput > 100K msg/s sustained — Kafka is cheaper per message at high volume. (2) You need replay — Kafka retains messages for configurable periods; consumers can reset their offset to re-process. SQS deletes on consume. (3) You need multiple independent consumer groups reading the same topic — Kafka is built for this; SQS requires SNS fan-out to separate queues.

Code snippets

typescriptSQS consumer with visibility timeout extension

import { SQSClient, ReceiveMessageCommand, DeleteMessageCommand, ChangeMessageVisibilityCommand } from '@aws-sdk/client-sqs';

const sqs = new SQSClient({});
const QUEUE_URL = process.env.QUEUE_URL!;
const VISIBILITY_TIMEOUT = 60; // seconds
const HEARTBEAT_INTERVAL = 25_000; // ms — extend before timeout expires

async function pollAndProcess(): Promise<void> {
  const { Messages } = await sqs.send(new ReceiveMessageCommand({
    QueueUrl: QUEUE_URL,
    MaxNumberOfMessages: 1,
    WaitTimeSeconds: 20,
  }));
  if (!Messages?.length) return;
  const msg = Messages[0];

  // Start heartbeat to extend visibility while processing
  const heartbeat = setInterval(async () => {
    await sqs.send(new ChangeMessageVisibilityCommand({
      QueueUrl: QUEUE_URL,
      ReceiptHandle: msg.ReceiptHandle!,
      VisibilityTimeout: VISIBILITY_TIMEOUT,
    }));
  }, HEARTBEAT_INTERVAL);

  try {
    const job = JSON.parse(msg.Body!);
    await processJob(job); // your business logic
    await sqs.send(new DeleteMessageCommand({
      QueueUrl: QUEUE_URL,
      ReceiptHandle: msg.ReceiptHandle!,
    }));
  } finally {
    clearInterval(heartbeat);
  }
}

// Graceful shutdown
let running = true;
process.on('SIGTERM', () => { running = false; });
(async () => { while (running) await pollAndProcess(); })();

typescriptTransactional outbox pattern (Postgres + Knex)

import knex from 'knex';

const db = knex({ client: 'pg', connection: process.env.DATABASE_URL });

async function createOrderWithOutbox(order: Order): Promise<string> {
  return db.transaction(async (trx) => {
    // 1. Insert business data
    const [row] = await trx('orders').insert({
      id: order.id,
      user_id: order.userId,
      total: order.total,
      status: 'pending',
    }).returning('id');

    // 2. Insert outbox event in the SAME transaction
    await trx('outbox').insert({
      aggregate_id: row.id,
      event_type: 'order.created',
      payload: JSON.stringify({
        order_id: row.id,
        user_id: order.userId,
        total: order.total,
      }),
      published: false,
    });

    return row.id;
    // Both writes commit atomically — no dual-write risk
  });
}

// Separate poller publishes outbox events to SQS
async function pollOutbox(): Promise<void> {
  const events = await db('outbox')
    .where('published', false)
    .orderBy('created_at')
    .limit(100);

  for (const event of events) {
    await sqs.send(new SendMessageCommand({
      QueueUrl: QUEUE_URL,
      MessageBody: event.payload,
      MessageGroupId: event.aggregate_id, // for FIFO
    }));
    await db('outbox').where('id', event.id).update({ published: true });
  }
}

typescriptIdempotent consumer with dedup table

async function processIdempotent(messageId: string, body: JobPayload): Promise<void> {
  // Attempt to claim this message via unique constraint
  try {
    await db('idempotency_keys').insert({
      message_id: messageId,
      created_at: new Date(),
    });
  } catch (err: any) {
    if (err.code === '23505') { // unique_violation
      console.log(`Duplicate message ${messageId}, skipping`);
      return; // Already processed — idempotent skip
    }
    throw err; // Unexpected error — let it retry
  }

  // First time seeing this message — process it
  await processJob(body);
}

// Cleanup: delete keys older than 7 days (nightly cron)
// DELETE FROM idempotency_keys WHERE created_at < now() - interval '7 days';

typescriptQueue-depth autoscaler (ECS + CloudWatch)

import { ECSClient, UpdateServiceCommand, DescribeServicesCommand } from '@aws-sdk/client-ecs';
import { CloudWatchClient, GetMetricDataCommand } from '@aws-sdk/client-cloudwatch';

const ecs = new ECSClient({});
const cw = new CloudWatchClient({});

async function getQueueDepth(queueName: string): Promise<number> {
  const result = await cw.send(new GetMetricDataCommand({
    StartTime: new Date(Date.now() - 60_000),
    EndTime: new Date(),
    MetricDataQueries: [{
      Id: 'depth',
      MetricStat: {
        Metric: {
          Namespace: 'AWS/SQS',
          MetricName: 'ApproximateNumberOfMessagesVisible',
          Dimensions: [{ Name: 'QueueName', Value: queueName }],
        },
        Period: 60,
        Stat: 'Average',
      },
    }],
  }));
  return result.MetricDataResults?.[0]?.Values?.[0] ?? 0;
}

function desiredWorkers(depth: number): number {
  if (depth < 500) return 5;
  if (depth < 2000) return 15;
  if (depth < 5000) return 50;
  return 100;
}

async function scaleWorkers(cluster: string, service: string, desired: number): Promise<void> {
  const { services } = await ecs.send(new DescribeServicesCommand({
    cluster, services: [service],
  }));
  const current = services?.[0]?.desiredCount ?? 0;
  if (current === desired) return;

  await ecs.send(new UpdateServiceCommand({
    cluster, service, desiredCount: desired,
  }));
  console.log(`Scaled ${service}: ${current} → ${desired} workers`);
}

typescriptPriority router (fast lane / slow lane)

import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs';

const sqs = new SQSClient({});
const FAST_QUEUE = process.env.FAST_QUEUE_URL!;
const SLOW_QUEUE = process.env.SLOW_QUEUE_URL!;

interface Job {
  jobId: string;
  userId: string;
  tier: 'paid' | 'free';
  payload: Record<string, unknown>;
}

async function routeJob(job: Job): Promise<void> {
  const queueUrl = job.tier === 'paid' ? FAST_QUEUE : SLOW_QUEUE;
  await sqs.send(new SendMessageCommand({
    QueueUrl: queueUrl,
    MessageBody: JSON.stringify(job),
    MessageAttributes: {
      tier: { DataType: 'String', StringValue: job.tier },
      jobId: { DataType: 'String', StringValue: job.jobId },
    },
  }));
  console.log(`Routed job ${job.jobId} to ${job.tier} lane`);
}

// In the API handler:
// const user = await getUser(req.userId);
// await routeJob({ jobId: uuid(), userId: user.id, tier: user.tier, payload: req.body });

Drills

A worker processes an SQS message successfully, writes to the DB, and then crashes before calling DeleteMessage. What happens next, and how do you prevent the duplicate?Reveal

The message reappears after the visibility timeout expires. Another worker picks it up and processes it again, causing a duplicate DB write. Prevention: use an idempotency key (the SQS message id) stored in a unique-constraint column. The second INSERT fails on the constraint, and the worker skips processing. Alternatively, use a conditional write: UPDATE ... WHERE status = 'pending' — the second attempt matches zero rows.

Your SQS queue depth is growing steadily at 1,000 messages per minute, but your dequeue rate is flat at 800/min. What are your options?Reveal

You are accumulating 200 msg/min of backlog. Options: (1) Add workers — scale up the consumer fleet so dequeue rate > 1,000/min. (2) Optimize processing — profile the worker; maybe it is spending time on network I/O that can be parallelized. (3) Engage backpressure — if you cannot add more workers (cost cap, resource limit), return 429 to producers to reduce the enqueue rate to match drain capacity. The right answer depends on whether the imbalance is temporary (burst → wait it out) or sustained (permanent capacity gap → add workers or shed load).

You set a visibility timeout of 10 seconds, but your p99 processing time is 30 seconds. What goes wrong?Reveal

Messages reappear while still being processed. A second worker picks up the same message, causing duplicate work. Both workers finish and try to delete — one succeeds, one gets an error (message already deleted). At scale, this wastes 30% of your compute on duplicate processing and can cause data corruption if the side-effects are not idempotent. Fix: set visibility timeout to at least 2× p99 (60s), or use ChangeMessageVisibility to extend dynamically.

An interviewer asks: "Why not just use a database table as a queue?" What is your answer?Reveal

A database table can be used as a queue (the outbox pattern does exactly this), but it has significant limitations compared to a purpose-built queue: (1) Polling is expensive — SELECT ... FOR UPDATE SKIP LOCKED on every poll cycle puts load on the DB. (2) No native visibility timeout — you must implement lease management manually. (3) No native DLQ — you must build retry counting and dead-lettering. (4) Throughput ceiling — a Postgres table tops out at ~5K dequeues/s before lock contention dominates; SQS handles millions. (5) No autoscaling integration. Use a DB-as-queue only when you need transactional atomicity (outbox pattern) or when adding a queue is not worth the infra cost at very low volumes (< 100 msg/min).

Your system uses SNS → SQS fan-out with 4 subscriber queues. One subscriber falls behind. Does it affect the others?Reveal

No. Each SQS queue is independent. SNS delivers a copy of the message to each subscriber queue. If the analytics queue falls behind, its depth grows, but the email queue, inventory queue, and fraud queue continue draining at their own pace. This is the primary advantage of topic-queue fan-out over a single queue with multiple consumer types. The only shared bottleneck is the SNS publish — if the producer cannot publish, all subscribers are affected.

How would you implement exponential backoff retries in SQS?Reveal

SQS does not natively support per-message delay on retry. Two approaches: (1) Re-enqueue with delay: on failure, the consumer publishes a new message to the same queue with a DelaySeconds value (max 900s / 15 min), then deletes the original. Increasing the delay on each retry (1s, 5s, 25s) gives exponential backoff. (2) Retry topic: publish failed messages to a separate "retry" queue with a fixed delay, and have a delayed-drain consumer that re-publishes to the main queue after the delay. Approach 1 is simpler for short delays; approach 2 is better for delays > 15 minutes (chain multiple retry queues).

6% complete

Current

When to reach for this

Step 1 of 16

Cheat sheet

Jump to next

Dimension

SQS Standard

SQS FIFO

Kafka

RabbitMQ

Ordering

Best-effort

Strict per group

Strict per partition

Per queue (single consumer)

Delivery

At-least-once

Exactly-once (5 min window)

At-least-once (configurable)

At-least-once / at-most-once

Throughput

Nearly unlimited

3K msg/s (batch)

Millions/s (partitioned)

50K msg/s (tuned)

Retention

4-14 days

Configurable (days-infinite)

Until consumed

Ops burden

Zero (serverless)

High (clusters, ZK/KRaft)

Medium (Erlang cluster)

Cost model

Per request

Per request (higher)

Per broker-hour + storage

Per node-hour

Best for

General async jobs

Ordered workflows

High-throughput streaming

Complex routing (exchanges)

import { SQSClient, ReceiveMessageCommand, DeleteMessageCommand, ChangeMessageVisibilityCommand } from '@aws-sdk/client-sqs'; const sqs = new SQSClient({}); const QUEUE_URL = process.env.QUEUE_URL!; const VISIBILITY_TIMEOUT = 60; // seconds const HEARTBEAT_INTERVAL = 25_000; // ms — extend before timeout expires async function pollAndProcess(): Promise<void> { const { Messages } = await sqs.send(new ReceiveMessageCommand({ QueueUrl: QUEUE_URL, MaxNumberOfMessages: 1, WaitTimeSeconds: 20, })); if (!Messages?.length) return; const msg = Messages[0]; // Start heartbeat to extend visibility while processing const heartbeat = setInterval(async () => { await sqs.send(new ChangeMessageVisibilityCommand({ QueueUrl: QUEUE_URL, ReceiptHandle: msg.ReceiptHandle!, VisibilityTimeout: VISIBILITY_TIMEOUT, })); }, HEARTBEAT_INTERVAL); try { const job = JSON.parse(msg.Body!); await processJob(job); // your business logic await sqs.send(new DeleteMessageCommand({ QueueUrl: QUEUE_URL, ReceiptHandle: msg.ReceiptHandle!, })); } finally { clearInterval(heartbeat); } } // Graceful shutdown let running = true; process.on('SIGTERM', () => { running = false; }); (async () => { while (running) await pollAndProcess(); })();

import knex from 'knex'; const db = knex({ client: 'pg', connection: process.env.DATABASE_URL }); async function createOrderWithOutbox(order: Order): Promise<string> { return db.transaction(async (trx) => { // 1. Insert business data const [row] = await trx('orders').insert({ id: order.id, user_id: order.userId, total: order.total, status: 'pending', }).returning('id'); // 2. Insert outbox event in the SAME transaction await trx('outbox').insert({ aggregate_id: row.id, event_type: 'order.created', payload: JSON.stringify({ order_id: row.id, user_id: order.userId, total: order.total, }), published: false, }); return row.id; // Both writes commit atomically — no dual-write risk }); } // Separate poller publishes outbox events to SQS async function pollOutbox(): Promise<void> { const events = await db('outbox') .where('published', false) .orderBy('created_at') .limit(100); for (const event of events) { await sqs.send(new SendMessageCommand({ QueueUrl: QUEUE_URL, MessageBody: event.payload, MessageGroupId: event.aggregate_id, // for FIFO })); await db('outbox').where('id', event.id).update({ published: true }); } }

async function processIdempotent(messageId: string, body: JobPayload): Promise<void> { // Attempt to claim this message via unique constraint try { await db('idempotency_keys').insert({ message_id: messageId, created_at: new Date(), }); } catch (err: any) { if (err.code === '23505') { // unique_violation console.log(`Duplicate message ${messageId}, skipping`); return; // Already processed — idempotent skip } throw err; // Unexpected error — let it retry } // First time seeing this message — process it await processJob(body); } // Cleanup: delete keys older than 7 days (nightly cron) // DELETE FROM idempotency_keys WHERE created_at < now() - interval '7 days';

import { ECSClient, UpdateServiceCommand, DescribeServicesCommand } from '@aws-sdk/client-ecs'; import { CloudWatchClient, GetMetricDataCommand } from '@aws-sdk/client-cloudwatch'; const ecs = new ECSClient({}); const cw = new CloudWatchClient({}); async function getQueueDepth(queueName: string): Promise<number> { const result = await cw.send(new GetMetricDataCommand({ StartTime: new Date(Date.now() - 60_000), EndTime: new Date(), MetricDataQueries: [{ Id: 'depth', MetricStat: { Metric: { Namespace: 'AWS/SQS', MetricName: 'ApproximateNumberOfMessagesVisible', Dimensions: [{ Name: 'QueueName', Value: queueName }], }, Period: 60, Stat: 'Average', }, }], })); return result.MetricDataResults?.[0]?.Values?.[0] ?? 0; } function desiredWorkers(depth: number): number { if (depth < 500) return 5; if (depth < 2000) return 15; if (depth < 5000) return 50; return 100; } async function scaleWorkers(cluster: string, service: string, desired: number): Promise<void> { const { services } = await ecs.send(new DescribeServicesCommand({ cluster, services: [service], })); const current = services?.[0]?.desiredCount ?? 0; if (current === desired) return; await ecs.send(new UpdateServiceCommand({ cluster, service, desiredCount: desired, })); console.log(`Scaled ${service}: ${current} → ${desired} workers`); }

import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs'; const sqs = new SQSClient({}); const FAST_QUEUE = process.env.FAST_QUEUE_URL!; const SLOW_QUEUE = process.env.SLOW_QUEUE_URL!; interface Job { jobId: string; userId: string; tier: 'paid' | 'free'; payload: Record<string, unknown>; } async function routeJob(job: Job): Promise<void> { const queueUrl = job.tier === 'paid' ? FAST_QUEUE : SLOW_QUEUE; await sqs.send(new SendMessageCommand({ QueueUrl: queueUrl, MessageBody: JSON.stringify(job), MessageAttributes: { tier: { DataType: 'String', StringValue: job.tier }, jobId: { DataType: 'String', StringValue: job.jobId }, }, })); console.log(`Routed job ${job.jobId} to ${job.tier} lane`); } // In the API handler: // const user = await getUser(req.userId); // await routeJob({ jobId: uuid(), userId: user.id, tier: user.tier, payload: req.body });