Async messaging & queues
Decoupling, delivery semantics, ordering, DLQs, Kafka vs SQS vs RMQ.
Queues decouple producers from consumers — but the delivery semantics come with sharp edges. Exactly-once is a lie; at-least-once + idempotent consumers is the truth.
Read this if your last attempt…
- You said "we'll put a queue in front of it" without naming the semantics
- You assumed exactly-once delivery was a real thing
- You haven't thought about what happens to messages that fail N times in a row
- You can't name the difference between a queue and a log
The concept
Async messaging is the mechanism that lets you say "the response doesn't have to wait for this". The producer commits the work to a broker and returns to the client; the consumer runs at its own pace.
The three things to name when you propose a queue
- Delivery semantics — at-most-once, at-least-once, or "effectively exactly-once". At-least-once is the pragmatic default; it requires idempotent consumers. Exactly-once within a closed system (Kafka transactions, FoundationDB) exists but is expensive and doesn't cross system boundaries.
- Ordering guarantees — per-partition / per-key ordering vs global ordering. Global ordering doesn't scale; key-based partitioning preserves order for a user/entity while still scaling horizontally.
- Failure handling — visibility timeouts, max-receives, and a Dead Letter Queue (DLQ) for messages that fail N times. Without a DLQ, a poison message blocks the partition forever.
The queue decouples lifetimes: producer commits the work, consumer processes on its own schedule. DLQ catches what can't be processed after N attempts.
Queue vs log — pick the right shape for the workload.
| Queue (SQS, RabbitMQ) | Log (Kafka, Kinesis) | |
|---|---|---|
| Consumption model | One consumer per message | Many independent consumer groups |
| Message lifetime | Deleted on ack | Retained for N days regardless |
| Replay | Not possible — it's gone | Trivial — rewind offset |
| Ordering | Per-queue or none | Per-partition (key-based) |
| Best for | Work distribution, jobs | Event streams, CDC, analytics fan-out |
| Scaling limit | Throughput per queue | Partitions × per-partition throughput |
How interviewers grade this
- You name delivery semantics explicitly: "at-least-once with idempotent consumers".
- You state the ordering guarantee and what partitions it (per-user, per-account, per-aggregate).
- You have a DLQ in the picture with a max-receive count.
- You distinguish queue (work distribution, one consumer per message) from log (event fan-out, multiple independent readers).
- You size the queue: expected depth at peak, retention window, message size, resulting GB.
Variants
Work queue (competing consumers)
Producers enqueue jobs; a pool of workers competes to consume.
Queue (top): one worker per message, message deleted on ack. Log (bottom): every group reads the full stream at its own offset, retained for days.
The canonical async job pattern. Each job is processed exactly once by one worker (assuming ack-on-success). Throughput scales with worker count. Ordering is not preserved — accept that or use a keyed queue.
Pros
- +Simple operational model
- +Linear horizontal scaling
- +Back-pressure is visible (queue depth)
Cons
- −No replay
- −No fan-out — one consumer per job
- −Poison messages need a DLQ
Choose this variant when
- Background jobs
- Email sending, image processing
- Loose-coupling between services
Event log (Kafka-style)
Producers publish; multiple independent consumer groups read the same stream.
Messages with the same key hash to the same partition; consumers see per-key ordering. Different keys process in parallel. Global ordering is sacrificed for scale.
The right shape for event-driven architectures. Each event is durably written to a partition and retained for days. Different consumers (search index, analytics, notifications) each maintain their own offset and replay from any point.
Pros
- +Replay + time-travel debugging
- +Fan-out is free — add a new consumer group
- +Per-partition ordering
Cons
- −Heavier to operate (broker cluster, ZK/KRaft)
- −Per-partition ordering doesn't give global order
- −Rebalancing pauses consumers briefly
Choose this variant when
- Multiple consumers of the same events
- CDC fan-out
- Analytics/audit pipeline alongside ops
Pub/sub (SNS, Google Pub/Sub)
Fire-and-forget broadcast; subscribers get a copy each.
At-most-once drops; at-least-once duplicates; "effectively exactly-once" requires transactions + idempotency. Exactly-once across system boundaries is a lie.
Lightweight fan-out without the retention and operational cost of a log. No replay, but each subscriber (often another queue) gets its own copy. Good for "notify many things when X happens".
Pros
- +Dead simple fan-out
- +Managed-service friendly
- +Decouples publisher count from subscriber count
Cons
- −No replay
- −Best-effort ordering
- −Per-message cost adds up at high fan-out
Choose this variant when
- Notifications, webhooks
- Low-retention fan-out
- Cross-service "something happened" events
Saga (distributed transaction via compensations)
Chain local transactions across services; compensate in reverse on failure.
A distributed transaction implemented as a sequence of local txns. If any step fails, the preceding steps run their compensations in reverse order.
When one user action spans multiple services that each own their own data (order service, payment service, inventory service, shipping service), you cannot use a single ACID transaction. The saga pattern replaces the distributed lock with a sequence of local transactions plus compensations that undo prior steps if a later step fails.
Two orchestration styles:
- Choreography — each service listens for events and publishes the next event. No central coordinator. Simpler to deploy but harder to reason about the full flow ("which services participate?").
- Orchestration — a dedicated orchestrator (state machine, often in something like Temporal / Step Functions / AWS SWF) calls each service in order and triggers compensations on failure. Easier to observe; one place owns the flow.
Key design rules:
- 1Every forward step has a compensation. "Cancel shipment" undoes "schedule shipment". "Refund" undoes "charge".
- 2Compensations must be idempotent — they may be retried.
- 3Not every step can be perfectly compensated (you can refund money but you can't un-send an email). Order steps so the hardest-to-undo step runs last.
- 4Saga state must be durable — if the orchestrator dies mid-flow, it resumes. Temporal, Step Functions, and DB-backed state machines give you this.
What the interviewer looks for: you distinguish saga from 2PC, you name choreography vs orchestration with a reason, you order steps by reversibility, and you name the durable state store.
Worked example
Scenario: User uploads a video. The system has to (a) transcode, (b) extract thumbnails, (c) run content moderation, (d) notify followers.
Bad shape: upload → sync fan-out to all four services. p95 is now the slowest service — fails if any one is slow or down.
Good shape:
- 1Upload endpoint writes the raw file to object storage and inserts a row in
videoswith status=uploaded. - 2Uploader publishes one event
video.uploadedto a log (Kafka topic, partitioned by user_id). - 3Four consumer groups read this topic independently:
- transcoder — reads, transcodes, updates DB, publishes video.transcoded. - thumbnailer — parallel with transcoding. - moderator — parallel; if flagged, publishes video.flagged and sets status. - notifier — listens to video.transcoded + video.cleared (composite gate) and fans out to followers.
Delivery: at-least-once; every consumer is idempotent (keyed by video_id + stage). DLQ per consumer group, max-receives = 5. If a message ends up in DLQ, an alert fires and a human investigates.
Ordering: per-partition (user_id). Two uploads from the same user are processed in order; two users in parallel. Global ordering is not needed here.
Retention: 7 days on the topic — enough to replay a consumer group after a bug fix without losing data.
Good vs bad answer
Interviewer probe
“You want to send emails asynchronously. What does that look like?”
Weak answer
"We put a queue in front of the email service. The producer fires a message, the consumer sends the email. Done."
Strong answer
"Work queue (SQS). Producer inserts a row in an outbox table in the same transaction as the business write, then publishes the message (outbox pattern — guarantees the event is published iff the business change committed). Consumer is idempotent, keyed on (user_id, template_id, send_dedupe_key) — retries won't send duplicates. Visibility timeout: 30s. Max receives: 3. DLQ gets the poison messages. At 1k/s expected, one queue with ~10 workers is plenty. If the provider is down, the queue absorbs up to N minutes of messages; we alert when depth > 10k."
Why it wins: The strong answer names the outbox pattern (how events correctly bridge DB and queue), idempotency key design, operational knobs (visibility, retries, DLQ), sizing, and the alert threshold. The weak answer omits all of it.
When it comes up
- When a slow or unreliable downstream would drag synchronous p99 beyond the budget
- When multiple consumers need the same event (search index, notifications, analytics)
- When the interviewer asks "what if this service is down?" about a dependency
- Any email, notification, video processing, or background-job flow
- When CDC, event-driven architecture, or saga comes up in deep-dive
Order of reveal
- 1Name queue vs log upfront. "Queue for work distribution — one consumer per message. Log for event fan-out and replay — multiple independent consumer groups. Different shapes for different problems."
- 2State delivery semantics explicitly. "At-least-once delivery with idempotent consumers. Exactly-once across network boundaries is a myth; at-least-once + idempotency is the production pattern."
- 3Name the partition key for ordering. "Partitioned by user_id — per-user ordering preserved, global ordering sacrificed. Two users can process in parallel; two events for the same user are ordered."
- 4Bridge DB and queue with outbox. "Outbox pattern: insert event into an 'outbox' table in the same DB transaction as the business write, then a relay publishes from the outbox. No dual-write race."
- 5Specify operational knobs. "Visibility timeout 30s, max-receives 3, then DLQ. Alert when DLQ depth > 0 or queue depth > 10K."
- 6Size the queue. "Peak depth × message size × retention = storage. At 1K/sec with 2 KB messages and 7-day retention, that's ~1.2 TB — Kafka SSD territory."
Signature phrases
- “Exactly-once is a lie; at-least-once + idempotent is the truth” — The single most important semantics fact; candidates who claim exactly-once get flagged.
- “Queue for work, log for events” — Distinguishes the two shapes in one phrase.
- “Per-partition ordering, not global” — Shows you know how scale and ordering trade.
- “Outbox or CDC — never dual-write” — The correct pattern for bridging DB and queue.
- “Every consumer has a DLQ and a page on DLQ depth” — Operational maturity in one line.
- “Async is not free — consumer lag is on the dashboard” — Pushes back on the naive "async means infinite scale" view.
Likely follow-ups
?“Explain the outbox pattern and why it beats a naive transaction.publish() pattern.”Reveal
The naive dual-write:
BEGIN;
INSERT INTO orders ...; -- DB commit
COMMIT;
kafka.publish(event); -- Network publishThree failure modes:
- 1DB commits, network to Kafka fails → event lost, business state drifts.
- 2Kafka publish succeeds, DB commit fails → phantom event, consumers act on a non-existent order.
- 3Process crashes between the two → either of the above, silently.
The outbox fix:
BEGIN;
INSERT INTO orders ...;
INSERT INTO outbox (event_type, payload, aggregate_id) VALUES (...);
COMMIT;A separate relay process reads from outbox (poll or CDC), publishes to Kafka, marks rows as sent.
Why this works:
- Atomicity with business change — either both the order AND the outbox row commit, or neither.
- At-least-once publish — if the relay crashes mid-publish, it retries. Consumers must be idempotent (keyed by event_id).
- No distributed transaction — the relay handles async publish; the DB transaction stays local.
Implementation options:
- 1Polling relay — SELECT FROM outbox WHERE sent=false. Simple but ~100-500ms lag.
- 2Debezium / CDC — tail the DB change log directly to Kafka. Near-realtime (~10 ms lag), but adds operational complexity.
- 3Transactional outbox service — dedicated service per aggregate. Middle ground.
Tradeoff: one extra table, a relay process, ~seconds of publish latency. In exchange you get atomicity between DB and queue that no dual-write scheme provides.
?“A message fails repeatedly and lands in the DLQ. Walk me through the runbook.”Reveal
Step 1 — alert fires on DLQ depth > 0. Page the on-call, not a silent dashboard.
Step 2 — inspect. What kind of failure?
- Transient downstream outage (DB was slow, API was rate-limited) → replay after the outage clears.
- Poison data (malformed payload, missing required field) → consumer or producer bug.
- Schema drift (producer emitted v2 event, consumer only knows v1) → deployment ordering bug.
- Business-rule rejection (event references a deleted aggregate) → business logic needs a branch.
Step 3 — fix and replay. Most brokers support moving messages from DLQ back to the main queue (SQS has ReceiveMessage+DeleteMessage, Kafka needs tooling). After the fix ships, replay.
Step 4 — if it was a consumer bug, consider backfill. Messages may have silently succeeded with the wrong result before hitting DLQ. Check your idempotency logs for those IDs; re-emit if needed.
Step 5 — tune. Was max-receives too low (caused false DLQ during transient blip)? Too high (poison messages thrashed for too long before surfacing)? 3-5 is typical; tune per-consumer.
Step 6 — prevent. Add a schema registry (if drift), add a validation step in the producer (if poison data), add a circuit breaker around the downstream (if outage cascades).
Metrics to have: DLQ depth (alert), DLQ enqueue rate (trend), per-consumer failure rate (by error class). Without the last one you can't tell transient from poison.
?“Your partition key is user_id. A celebrity user generates 10× the message rate. What happens and how do you fix it?”Reveal
What happens: that one partition becomes hot. Its consumer is saturated while others idle. Consumer lag on that partition grows — the celebrity's events are seconds or minutes behind while others are real-time. Eventually you hit partition retention limits and lose data, or the consumer OOMs from lag.
Fixes, in order of pain:
- 1Scale the consumer for the hot partition. Beef up that specific consumer (bigger instance). Buys time but doesn't fix the structural issue. One partition still has one consumer in Kafka.
- 1Sub-partition the hot key. If per-user ordering isn't required for all event types, emit with key
user_id + event_id_hash % 10to spread across 10 partitions. Fast fix, loses ordering for that user.
- 1Two-level partitioning. Hash user_id into a wider space first. Celebrity users map to multiple partitions by construction. Ordering is preserved per (user, sub-partition).
- 1Dedicated consumer group for hot users. Detect hot users, route their events to a separate topic with more partitions and more consumers.
- 1Fix the producer. Often the hot key is a bug — the producer is retrying, amplifying, or fan-out-on-write-ing unnecessarily. Audit before scaling the consumer.
The ordering tradeoff to name explicitly: per-user ordering matters for some event types (like "user blocked another user, then sent message" — order matters). For others (like "user viewed post X") ordering doesn't. Classify events and sub-partition only the ones that tolerate reordering.
?“Kafka or SQS for this? How do you choose?”Reveal
Rule of thumb: SQS when you want a queue; Kafka when you want a log.
Pick SQS when:
- One consumer group; work distribution pattern (jobs, emails, image processing).
- Deleting on ack is the natural lifecycle — once done, gone.
- You don't need replay.
- You want a managed service with near-zero operational overhead.
- Throughput is under ~30K msg/sec (SQS standard) or ~3K msg/sec with FIFO.
- Message lifetime is minutes to hours, not days.
Pick Kafka when:
- Multiple independent consumers of the same events (CDC fan-out, analytics alongside ops, search indexing).
- You need replay (bug fix in a consumer, backfill a new consumer).
- Throughput is >50K msg/sec.
- You want per-partition ordering at scale.
- Retention is measured in days or weeks — the stream is the source of truth.
- You're willing to operate brokers, ZK/KRaft, schema registry.
Common mistake: using Kafka for a simple work queue. You pay 10× the operational cost for replay and fan-out you don't need. SQS + Lambda is often the right answer.
The other common mistake: using SQS where Kafka's multi-consumer replay actually matters. You end up building "replay from S3" infrastructure that Kafka gives you for free.
Newer middle ground: Google Pub/Sub, AWS Kinesis Data Streams. Managed log-shaped services without full Kafka operational weight.
?“How do you guarantee end-to-end ordering when a message flows through 3 services?”Reveal
You usually don't — and you shouldn't try to. End-to-end ordering across services means a global distributed ordering, which collapses to single-partition throughput.
Better framing: where does ordering actually matter?
- Between two events for the same aggregate (same user, same order, same account) — yes, preserve this.
- Between events across aggregates — no, these are independent.
How to preserve per-aggregate ordering across services:
- 1Partition every topic by the same key. If service A emits
order.createdpartitioned by order_id and service B emitsorder.updatedpartitioned by order_id, then a single consumer that groups by order_id sees them in order (assuming at-least-once + idempotency).
- 1Propagate the partition key through the pipeline. Service A consumes from topic X (keyed by user_id), processes, emits to topic Y — also keyed by user_id. The downstream sees per-user ordering maintained.
- 1Use a sequence number per aggregate. If strict causal ordering is required, include a monotonic sequence per aggregate. Consumers buffer and wait for gaps or detect out-of-order delivery. Expensive; reserve for high-stakes flows (financial ledger, audit).
The trap: "serializer" service. Some teams funnel all events through a single ordering service to guarantee ordering. This destroys scale and creates a SPOF. Unless ordering crosses aggregate boundaries (which is usually a design smell), don't do this.
The right interview answer: "Per-aggregate ordering, not global. Every service partitions by the aggregate id. Global ordering is a different problem and usually indicates the boundary is wrong."
Code examples
// At-least-once delivery guarantees you WILL see duplicates.
// The consumer must make processing idempotent by deduping on a
// stable key derived from the event, not just the delivery id.
async function handleOrderEvent(msg: Message) {
const event = JSON.parse(msg.body) as OrderEvent;
const idempotencyKey = `${event.order_id}:${event.type}`;
const inserted = await db.execute(
`INSERT INTO processed_events (key, received_at)
VALUES ($1, now())
ON CONFLICT (key) DO NOTHING
RETURNING key`,
[idempotencyKey],
);
if (inserted.rows.length === 0) {
// Duplicate — already processed. Ack and move on.
await msg.ack();
return;
}
try {
await applyBusinessLogic(event); // safe to run exactly once now
await msg.ack();
} catch (err) {
// Rollback dedup row so a retry can process it.
await db.execute('DELETE FROM processed_events WHERE key = $1', [idempotencyKey]);
throw err; // broker will redeliver
}
}// Problem: publishing to Kafka after the business commit can fail
// (process crashes in between), leading to lost events. Publishing
// BEFORE the commit can lead to ghost events (committed to Kafka,
// rolled back in DB). Neither is acceptable.
//
// Solution: write the event to an `outbox` table in the SAME
// transaction as the business write. A separate relay polls the
// outbox and publishes. At-least-once by construction.
async function createOrder(cmd: CreateOrderCmd) {
await db.transaction(async (tx) => {
const order = await tx.insert('orders', { ...cmd, status: 'pending' });
await tx.insert('outbox', {
aggregate_id: order.id,
topic: 'orders.created',
payload: JSON.stringify(order),
created_at: new Date(),
});
});
// commit -> both order and outbox row are durable
}
// Relay (separate process)
setInterval(async () => {
const batch = await db.query(
'SELECT * FROM outbox WHERE published_at IS NULL ORDER BY id LIMIT 100 FOR UPDATE SKIP LOCKED');
for (const row of batch) {
await kafka.send({ topic: row.topic, key: row.aggregate_id, value: row.payload });
await db.query('UPDATE outbox SET published_at = now() WHERE id = $1', [row.id]);
}
}, 100);const MAX_RECEIVES = 5;
async function processWithBackoff(msg: Message) {
const receives = msg.attributes.ApproximateReceiveCount ?? 1;
try {
await processMessage(msg);
await msg.ack();
} catch (err) {
if (receives >= MAX_RECEIVES) {
// Poison pill — stop retrying, send to DLQ for human triage.
await dlq.send({ body: msg.body, error: String(err), receives });
await msg.ack();
return;
}
// Requeue with exponential visibility timeout: 1s, 2s, 4s, 8s, 16s.
const visibilityMs = 1000 * Math.pow(2, receives - 1);
await msg.changeVisibilityTimeout(visibilityMs);
throw err; // broker will redeliver after the timeout
}
}Common mistakes
Exactly-once across a network is impossible in the general case. Even "Kafka transactions" are exactly-once within Kafka, not end-to-end. Design for at-least-once + idempotent consumers. The candidate who asserts exactly-once is flagged.
A single poison message can block a partition or a consumer forever. Every consumer must have a DLQ and a max-receive count. Messages in the DLQ should page a human, not silently accumulate.
Writing to the DB and publishing to a queue in two separate operations creates a race: the DB succeeds, the publish fails, the queue never gets the event. Fix with the outbox pattern (publish via CDC from an outbox table) or a transactional queue (a message-broker-in-your-DB).
At-least-once means duplicates happen. Consumers must dedupe using a key — either the business identity (user_id + action_id) or a producer-provided correlation id. Without this, retries cause double-sends, double-charges, double-notifications.
Global ordering across all messages collapses to a single-partition bottleneck. Pick an order-preserving partition key (user_id, account_id, aggregate_id) and accept that messages across keys are not ordered with respect to each other.
Practice drills
Explain the outbox pattern in one paragraph.Reveal
In the same DB transaction as the business write, insert a row into an outbox table describing the event. A separate process (CDC tail, or a polling worker) reads the outbox and publishes to the broker, marking rows as sent. This makes the DB write and the event publish atomic from the outside world's point of view: you never publish without committing, and you never commit without eventually publishing. The cost is seconds of publish latency and one extra table — cheap compared to the correctness it buys.
Interviewer: "a message ends up in the DLQ. What's your runbook?"Reveal
Step 1: inspect — is it a poison-data bug (bad payload) or a transient downstream failure that outlasted the retry window? Step 2: if transient, replay the message back onto the main queue (most brokers support this directly). Step 3: if poison, fix the consumer or sanitise the payload, then replay. Step 4: alert tuning — DLQ depth > 0 should page; DLQ growth rate > X should escalate.
Your partition key is user_id. A celebrity user triggers 10× the rate of everyone else. What happens and how do you fix it?Reveal
One partition becomes hot: its consumer is pegged while others idle. Lag on that partition grows; the celebrity's messages are seconds behind. Fixes: (1) sub-partition within the hot key (user_id + sequence%N, if order-within-user is not needed); (2) use two-level keying (user_id hashed into a wider space); (3) dedicated consumer for the hot partition with more resources. The "right" fix depends on whether ordering within that user matters.
Cheat sheet
- •Queue = one consumer per message (work). Log = many independent consumers + replay (events).
- •Default semantics: at-least-once + idempotent consumer. Never "exactly-once".
- •Always: visibility timeout + max-receives + DLQ + alert on DLQ depth.
- •Ordering: per-partition (key-based), not global. Name the partition key.
- •DB + queue together: outbox pattern. Never dual-write.
- •Size the queue: peak depth × message size × retention = storage.
- •Latency budget: async ≠ free; p99 consumer lag should be on your dashboard.
Practice this skill
These problems exercise Async messaging & queues. Try one now to apply what you just learned.