Long-running tasks
When to reach for this
Reach for this when…
- Processing takes 10 seconds to hours — far longer than HTTP timeout budgets
- Users submit work and come back later for results (report, export, transcode)
- The prompt mentions "progress bar", "job status", or "background processing"
- Multi-step pipelines where individual stages can fail and must be retried independently
- Work that is expensive enough to warrant cancellation and resumability
Not really this pattern when…
- Processing finishes in < 1 second — use synchronous request-response
- The focus is on decoupling throughput (producer-consumer queue pattern instead)
- The client never checks results — fire-and-forget notification pattern
- Real-time streaming output (use WebSocket streaming, not job polling)
Cheat sheet
- •Return 202 Accepted + Location header + job_id for any task > gateway timeout
- •Job FSM: PENDING → QUEUED → RUNNING → SUCCEEDED | FAILED | CANCELLED — enforce with CAS
- •Heartbeat every 30s; zombie reaper at 90s staleness; do not rely on wall-clock timeout alone
- •Progress to Redis (fast path), not the jobs DB — separate read-hot from write-hot
- •Checkpoint every 1,000 items or 60s, whichever first; include schema_version in checkpoint
- •Poll + Retry-After as baseline; add SSE for dashboards, webhooks for B2B
- •Idempotency key on submission: hash(user_id, params) — return existing job_id on duplicate
- •CAS on QUEUED → RUNNING transition: UPDATE ... WHERE status='queued' — prevents double execution
- •Result to S3 with presigned URL (7-day TTL); transition to EXPIRED after TTL
- •Retry budget: 3 retries with exponential backoff; non-retryable errors skip straight to FAILED
- •Cancellation is cooperative: set flag in Redis, worker polls flag at checkpoints
- •Three dashboard panels: submission rate, active jobs by status, p50/p95/p99 duration
Core concept
Long-running tasks are the complement of synchronous request-response. When work takes longer than an HTTP timeout — a PDF report over a million rows, a 4K video transcode, an ML training run — the caller cannot hold an open connection and wait. The system must accept the intent, acknowledge it immediately, and let the caller check back later. This is the 202 contract, and it is the single most important design primitive for any task that outlasts a gateway timeout.
Client submits work, gets 202 with a job_id, and polls for status. The API enqueues the job; a worker processes it and writes results to a result store. The client polls until status is done.
The six pieces
1. Submission endpoint. A POST that validates input, mints a unique job_id (UUIDv7 for time-ordering), persists a job record with status=pending, enqueues the job_id to a work queue, and returns 202 Accepted with a Location header pointing to GET /jobs/{job_id}. The response body carries the job_id so the client can poll immediately. The endpoint must be idempotent — if the client retries the POST with the same idempotency key, return the existing job_id rather than creating a duplicate.
2. Job state machine. Every job follows a deterministic state machine: PENDING → QUEUED → RUNNING → SUCCEEDED | FAILED | CANCELLED. Transitions are one-way except for retry: FAILED can transition back to QUEUED if retry budget remains. Store the current state, last_transition_at, and a transition log (an append-only array of {state, timestamp, reason}). The transition log is your debugging lifeline — it tells you exactly when a job stalled and why.
3. Worker fleet. Workers dequeue job_ids, load the job record, set status=RUNNING, and do the actual work. The worker must send heartbeats (update a last_heartbeat_at timestamp every 30 seconds) so the system can detect zombies. If a worker crashes, its job eventually times out and transitions back to QUEUED for another worker to pick up.
4. Progress reporting. For multi-minute jobs, "it is still running" is not enough. Workers write a progress struct — { percentage, current_stage, items_processed, estimated_eta } — to a fast store (Redis key with 5-minute TTL). The status endpoint reads this struct on every poll. Without progress, users assume the job is hung and resubmit, creating duplicate work.
5. Status channel. How the client learns the job is done. Four options, each with trade-offs: (a) Polling — simplest, works everywhere, but wastes requests when the job is slow; (b) SSE — server pushes progress frames, efficient but requires sticky connections; (c) WebSocket — bidirectional, useful if the client also sends cancellation; (d) Webhook — server-to-server push, no client polling at all, ideal for B2B APIs. Most systems start with polling and add SSE for the dashboard.
6. Result storage. When the job completes, the worker writes the output (a PDF, a CSV, a video file) to blob storage (S3) and stores a presigned URL in the job record. Set a TTL on the result — 7 days is typical. After TTL, the result is garbage-collected and the job record transitions to EXPIRED. Clients that poll after TTL get a 410 Gone.
Checkpointing and resumability
Long-running jobs that process data in stages should checkpoint intermediate state to durable storage (S3, DynamoDB) after each stage. If the worker crashes mid-stage, a new worker reads the last checkpoint and resumes. Without checkpoints, a 2-hour job that crashes at 1h59m restarts from zero — wasted compute and doubled latency. The checkpoint cadence is a trade-off: too frequent and you pay I/O overhead; too infrequent and you repeat work on crash. A good default is every 1,000 items or every 60 seconds, whichever comes first.
Cross-reference
The queue mechanics (enqueue, dequeue, visibility timeout, DLQ) are covered in depth in the producer-consumer queue pattern. This pattern focuses on what sits above the queue: the 202 contract, the job state machine, status channels, progress reporting, checkpointing, and timeout/cancellation — the pieces that make a long-running task observable and manageable rather than a fire-and-forget black hole.
Canonical examples
- →PDF / CSV report generation from large datasets
- →Video transcoding pipeline (YouTube, Vimeo)
- →ML model training and batch inference
- →Shopify bulk product import / export
- →Large database migration or ETL jobs
Variants
Poll-based status tracking
Client polls GET /jobs/:id on an interval until status is terminal.
The simplest and most reliable status channel. After receiving the 202, the client sets up a polling loop: call GET /jobs/:id every N seconds, check the status field, and stop when it is SUCCEEDED, FAILED, or CANCELLED.
API returns 202 immediately. Client polls GET /jobs/:id on an interval. Worker processes the job and updates status in the DB.
Polling interval strategy. Use exponential backoff with a cap: start at 1s, double each poll, cap at 10s. For jobs with predictable durations (e.g., "reports take 30-90s"), start polling after an initial delay equal to the median duration and then poll every 2s. This avoids wasting 30 requests polling a job that will not be done for a minute.
Response shape. The poll response should include: status (the FSM state), progress (percentage, 0-100), current_stage (human-readable), estimated_eta (ISO timestamp or null), result_url (presigned S3 URL, populated only when status=SUCCEEDED), and error (populated only when status=FAILED). Clients render the progress bar from the percentage and stage fields.
Retry-After header. The server should return a Retry-After header with the recommended poll interval. Well-behaved clients respect it. This gives the server control over poll frequency — during high load, increase Retry-After to 30s to reduce poll traffic.
Idempotency of the status endpoint. GET /jobs/:id is naturally idempotent and cacheable. Set Cache-Control: no-store to prevent intermediate proxies from serving stale status. For high-traffic dashboards, add a short TTL (1s) via CDN to absorb thundering-herd polls when many users watch the same job.
Limitations. Polling wastes bandwidth when jobs are slow — 10,000 users polling every 2s generates 5,000 req/s of pure overhead. For dashboard-heavy use cases with many concurrent viewers, push-based channels (SSE, WebSocket) are more efficient. But polling is universally compatible, needs no special infrastructure, and works through every proxy and firewall.
Pros
- +Works through every proxy, CDN, and corporate firewall — no special protocol
- +Trivial client implementation — a setTimeout loop with fetch
- +Stateless server — no connection tracking, no pub/sub infrastructure
- +Easy to add caching (CDN, API gateway) to absorb poll storms
Cons
- −Wastes bandwidth and server resources on empty polls (job still running)
- −Latency between completion and client awareness equals the poll interval
- −Thundering herd when many clients poll the same job at the same frequency
- −Exponential backoff adds seconds of notification delay for fast-completing jobs
Choose this variant when
- You are building an MVP and want the simplest possible status channel
- Clients are mobile apps or CLI tools that cannot hold long-lived connections
- Jobs take minutes to hours and a few seconds of notification delay is acceptable
- You want zero additional infrastructure beyond the API and database
Push-based (SSE)
Server pushes progress frames to the client over a long-lived HTTP connection.
Server-Sent Events (SSE) flip the polling model: instead of the client asking "are you done yet?" every few seconds, the server pushes progress events as they happen. The client opens an EventSource connection to GET /jobs/:id/stream and receives a series of text/event-stream frames.
After submitting the job, the client opens an SSE stream. The worker publishes progress events to Redis Pub/Sub; the API relays them as SSE frames. No polling overhead.
How it works. The worker writes progress updates to a Redis Pub/Sub channel keyed by job_id. The API server subscribes to that channel when the client opens the SSE connection. Each Redis message becomes an SSE frame: data: {"pct":45,"stage":"Rendering page 450/1000","eta":"2026-04-27T14:05:00Z"}. When the job completes, the server sends a final frame with status=SUCCEEDED and the result_url, then closes the connection.
Connection management. SSE uses a standard HTTP connection, so it works through most proxies and load balancers. However, you must configure: (a) disable response buffering in Nginx (proxy_buffering off; X-Accel-Buffering: no) so frames are not batched; (b) set a long read timeout (proxy_read_timeout 3600s) so the proxy does not kill the connection during idle periods; (c) handle reconnection — EventSource auto-reconnects with the Last-Event-ID header, and your server must resume from that point.
Scaling SSE. Each SSE connection holds a server thread (or async task). At 10,000 concurrent viewers, that is 10,000 open connections. Use an async framework (Node.js, Go, or Python asyncio) to handle this without thread-per-connection overhead. The Redis Pub/Sub subscription is per-server, not per-client — one subscription fans out to all local SSE clients watching that job.
Fallback to polling. Not all environments support SSE (some corporate proxies kill long-lived connections). Implement the polling endpoint as a fallback. The client tries SSE first; if the connection drops three times within 30 seconds, it falls back to polling.
When to use WebSocket instead. SSE is unidirectional (server → client). If the client needs to send cancellation or pause commands on the same channel, use a WebSocket. For most "watch my job" use cases, SSE is sufficient and simpler to implement.
Pros
- +Near-instant notification — sub-second latency from worker progress to client
- +No wasted requests — server only sends when there is new information
- +Standard HTTP — works through most proxies with minimal configuration
- +Auto-reconnection built into the EventSource API
Cons
- −Requires async server infrastructure to hold thousands of open connections
- −Some corporate proxies and older browsers do not support SSE
- −Connection affinity needed — the SSE server must subscribe to the right Redis channel
- −More infrastructure than polling: Redis Pub/Sub + async server + proxy config
Choose this variant when
- You have a web dashboard where users watch job progress in real time
- Jobs produce frequent progress updates (every 1-5 seconds)
- You want to eliminate polling overhead for high-concurrency dashboards
- Your infrastructure supports long-lived HTTP connections (no aggressive proxy timeouts)
Webhook callback
Server POSTs the result to the client's callback URL when the job completes.
Webhooks invert the relationship entirely: instead of the client checking on the job, the server pushes the result to a URL the client provides at submission time. This is the standard pattern for B2B APIs where the "client" is another server, not a browser.
Flow. The client includes a callback_url in the POST /jobs request. The server stores it with the job record. When the worker completes the job, it POSTs the result payload to the callback_url with an HMAC signature in the X-Signature-256 header. The client verifies the signature against a shared secret to authenticate the webhook.
Reliability. Webhooks fail when the client's server is down. You need a retry policy: 3-5 attempts with exponential backoff (1 min, 10 min, 1 hour). After exhausting retries, mark the webhook as failed and log it. The client must also be able to manually retrieve the result via GET /jobs/:id as a fallback — webhooks augment polling, they do not replace it.
Idempotency. Network retries and server-side retries can deliver the same webhook twice. Include an idempotency key (the job_id or a delivery_id) in the payload so the client can deduplicate. The client should respond with 200 quickly (within 5 seconds) and process the payload asynchronously — do not let webhook delivery block on the client's own processing.
Security. Always use HTTPS for the callback URL. Sign the payload with HMAC-SHA256 using a per-client secret. Reject callback URLs that point to internal IPs (SSRF protection). Rate-limit outbound webhook requests per client to prevent abuse. Validate the URL scheme and host at submission time, not at delivery time.
Comparison to SSE. Webhooks are server-to-server; SSE is server-to-browser. Webhooks do not require long-lived connections but need retry infrastructure. SSE gives real-time progress but only works for interactive clients. Most production systems support both: SSE for the dashboard, webhooks for API integrations.
Pros
- +No polling, no long-lived connections — fire-and-forget delivery
- +Ideal for server-to-server integrations (B2B, Zapier, internal microservices)
- +Client does not need to be online during processing — just when the webhook fires
- +Decouples client availability from job processing
Cons
- −Client must expose a publicly reachable HTTPS endpoint
- −Retry infrastructure needed on the server side (exponential backoff, DLQ)
- −Security surface: SSRF risk, signature verification, TLS enforcement
- −No progress updates — only final result (unless you send progress webhooks too)
Choose this variant when
- Your API consumers are other servers, not browser clients
- The industry standard for your domain uses webhooks (payments, e-commerce, CI/CD)
- Clients process results asynchronously and do not need real-time progress
- You want to minimize client-side polling complexity
Pipeline with checkpointing
Break the job into stages, checkpoint after each, and resume on failure.
When a job has distinct processing phases — extract, transform, load; or download, transcode, upload — treat each phase as a pipeline stage with its own checkpoint. An orchestrator (AWS Step Functions, Temporal, or a custom DAG runner) coordinates the stages and handles retries at the stage level, not the job level.
A DAG orchestrator breaks the job into stages. Each stage runs on a dedicated worker, checkpoints to S3, and reports completion. On failure, the orchestrator resumes from the last checkpoint.
Checkpoint design. After each stage, the worker writes the intermediate output to durable storage (S3, GCS) along with a checkpoint record in the jobs DB: {job_id, stage: "transform", checkpoint_uri: "s3://...", items_processed: 45000, timestamp}. If the worker crashes during stage 3, the orchestrator reads the latest checkpoint, sees that stage 2 completed, and dispatches a new worker starting at stage 3 with the stage-2 output as input.
Checkpoint granularity. For stages that process data in batches (e.g., "transform 1M rows"), checkpoint every N batches (e.g., every 10,000 rows). The checkpoint stores the offset: "processed rows 0-40,000, intermediate file at s3://…/chunk-4.parquet". On resume, the new worker starts at row 40,001. This limits wasted work on crash to at most N items.
Orchestrator patterns. (a) Sequential pipeline — stages run one after another; the simplest model. (b) DAG pipeline — stages have dependencies; a DAG runner (Airflow, Step Functions) resolves the execution order. (c) Map-reduce pipeline — a single stage fans out to N parallel workers (e.g., transcode N video segments), then a reduce stage stitches the results. Each fan-out worker checkpoints independently.
Timeout per stage. Set a timeout on each stage, not just the overall job. If stage 2 normally takes 5 minutes, set a 15-minute stage timeout. This catches stuck workers faster than a single 2-hour job timeout. The orchestrator marks the stage as FAILED and retries it (up to stage_max_retries) before failing the overall job.
Cost of checkpointing. S3 PUT costs $0.005 per 1,000 requests. At one checkpoint per stage per job, 1M jobs/day with 4 stages = 4M PUTs = $20/day. The cost of NOT checkpointing — restarting a 2-hour job from zero — is far higher in compute. Checkpointing is almost always worth it for jobs > 5 minutes.
Pros
- +Crash recovery resumes from the last checkpoint, not from zero
- +Per-stage retry logic isolates failures to the broken stage
- +Enables parallelism — fan-out stages run concurrently
- +Clear observability: you know exactly which stage a job is in
Cons
- −Orchestrator adds infrastructure complexity (Step Functions, Temporal, custom DAG)
- −Checkpoint I/O adds latency — 100-500ms per S3 PUT
- −Schema evolution: checkpoint format must be forward-compatible across deploys
- −Debugging multi-stage failures requires distributed tracing
Choose this variant when
- Jobs take > 10 minutes and a restart-from-zero on crash is unacceptable
- The job has natural stages (extract, transform, load) with clear boundaries
- You need per-stage timeouts and retry budgets
- You are building a pipeline that will eventually need parallelism (fan-out)
Scaling path
V1 — Synchronous inline
Prove the feature works; no infrastructure beyond the API server.
Client sends request and blocks until work completes. Gateway timeout risk grows linearly with job duration. A 60s ALB timeout kills anything longer.
The API receives a request (e.g., POST /reports), generates the report inline, and returns 200 with the PDF payload. At low traffic with fast jobs (< 5s), this works fine — you need only a handful of server threads.
Client sends request and blocks until work completes. Gateway timeout risk grows linearly with job duration. A 60s ALB timeout kills anything longer.
Why it breaks. ALB default idle timeout is 60s. Nginx default proxy_read_timeout is 60s. CloudFront default origin response timeout is 30s. Any job that takes longer hits a gateway timeout. Even below the timeout, holding threads for 30s means 100 concurrent jobs need 100 threads — and each thread holds memory, a DB connection, and CPU. At 500 concurrent jobs, you are running 500 threads on a single server, and response times spike from thread contention.
What triggers the next iteration
- Gateway timeout kills jobs longer than 30-60 seconds
- Thread pool exhaustion at > 100 concurrent long-running requests
- No retry — if the server crashes mid-job, the work is lost
- User sees a spinner for the entire duration with no progress indication
V2 — Async + poll
Decouple submission from processing; give users a job_id to track.
API returns 202 immediately. Client polls GET /jobs/:id on an interval. Worker processes the job and updates status in the DB.
Replace the synchronous call with the 202 contract. The API validates input, mints a job_id, persists a PENDING job record, enqueues the job_id, and returns 202 with { job_id, status_url: "/jobs/{job_id}" }. A worker fleet processes jobs from the queue. The client polls the status URL until the job reaches a terminal state.
API returns 202 immediately. Client polls GET /jobs/:id on an interval. Worker processes the job and updates status in the DB.
Key decisions. (a) Queue choice: SQS for simplicity, Kafka for ordering or replay. (b) Job DB: Postgres with a jobs table (id, status, created_at, updated_at, result_url, error). (c) Worker concurrency: start with 5 workers, autoscale on queue depth. (d) Poll interval: recommend 2-5s via Retry-After header.
Result. API latency drops from minutes to < 50ms. Jobs can run for hours without hitting any timeout. Users get a job_id they can bookmark and check later. The system handles 10x more submissions because the API is no longer blocked by processing.
What triggers the next iteration
- Polling generates wasted HTTP traffic — 10K users × 2s interval = 5K req/s of overhead
- No progress feedback — users only see "pending" or "done"
- Single worker type — all jobs share the same fleet regardless of priority or size
- No checkpointing — a worker crash restarts the job from scratch
V3 — Push notifications + progress tracking
Eliminate polling overhead; show real-time progress to the user.
After submitting the job, the client opens an SSE stream. The worker publishes progress events to Redis Pub/Sub; the API relays them as SSE frames. No polling overhead.
Add SSE (or WebSocket) for real-time progress delivery. Workers write progress structs to Redis; the API subscribes and pushes frames to the client. The polling endpoint remains as a fallback.
After submitting the job, the client opens an SSE stream. The worker publishes progress events to Redis Pub/Sub; the API relays them as SSE frames. No polling overhead.
Progress design. The worker updates a Redis key (job:{id}:progress) with a JSON struct every 2 seconds: { pct: 65, stage: "Rendering page 650/1000", items_done: 650, items_total: 1000, eta: "2026-04-27T14:05:00Z" }. The key has a 300s TTL so stale progress auto-expires. The worker also publishes to a Redis Pub/Sub channel (job:{id}) so the SSE endpoint receives updates without polling Redis.
Result. Users see a live progress bar with ETA. API poll traffic drops 95% because active users use SSE. The remaining poll traffic is from CLI tools and mobile apps that fall back to polling.
What triggers the next iteration
- SSE connections require async server + sticky routing (connection affinity)
- Redis Pub/Sub is fire-and-forget — if the SSE server misses a message, the client sees a stale percentage
- No checkpointing yet — crashes restart from zero
- Single-stage jobs only — no pipeline decomposition
V4 — Distributed pipeline with checkpointing
Break jobs into stages, checkpoint each, and resume on failure.
A DAG orchestrator breaks the job into stages. Each stage runs on a dedicated worker, checkpoints to S3, and reports completion. On failure, the orchestrator resumes from the last checkpoint.
For jobs that naturally decompose into stages (extract → transform → render → upload), use an orchestrator (Step Functions, Temporal) to run each stage as an independent task. Each stage checkpoints its output to S3. On failure, the orchestrator retries the failed stage from its checkpoint.
A DAG orchestrator breaks the job into stages. Each stage runs on a dedicated worker, checkpoints to S3, and reports completion. On failure, the orchestrator resumes from the last checkpoint.
Stage isolation. Each stage runs in its own container with tailored resource limits. The extract stage needs high network I/O; the transform stage needs CPU; the render stage needs GPU. Dedicated containers prevent a CPU-heavy transform from starving an I/O-bound extract.
Parallel fan-out. For large datasets, the orchestrator fans out the transform stage to N parallel workers, each processing a shard. A final reduce stage merges the results. Checkpointing at the shard level means a single shard failure only replays that shard, not the entire dataset.
Result. Jobs that crashed at 90% now resume from 90% instead of 0%. Stage-level retries catch transient failures without restarting the whole pipeline. Resource utilization improves because each stage uses right-sized containers.
What triggers the next iteration
- Orchestrator is a new dependency — Step Functions has a 25K event-history limit per execution
- Checkpoint format must be forward-compatible across code deploys
- Debugging multi-stage failures requires distributed tracing end-to-end
- Increased infrastructure cost: orchestrator + per-stage containers + S3 checkpoints
Deep dives
Job state machine design
Every job transitions through a well-defined FSM. Pending → queued → running → succeeded or failed. A cancelled state can be entered from pending, queued, or running.
A job state machine is the backbone of any long-running task system. Without it, you end up with ad-hoc status fields that admit impossible transitions (e.g., a job that is simultaneously "running" and "cancelled") and make debugging a nightmare.
Every job transitions through a well-defined FSM. Pending → queued → running → succeeded or failed. A cancelled state can be entered from pending, queued, or running.
States. PENDING: job record created, not yet enqueued. QUEUED: message is in the work queue. RUNNING: a worker has dequeued and is actively processing. SUCCEEDED: work completed, result available. FAILED: work failed after exhausting retries. CANCELLED: user or system requested cancellation. Each state has exactly one entry condition and well-defined exit transitions.
Transition rules. (1) PENDING → QUEUED is triggered by successful enqueue. (2) QUEUED → RUNNING is triggered by the worker setting status=RUNNING with an atomic compare-and-swap (UPDATE jobs SET status='RUNNING' WHERE id=? AND status='QUEUED'). This CAS prevents two workers from running the same job. (3) RUNNING → SUCCEEDED requires the worker to write the result URL first, then update status. (4) RUNNING → FAILED sets an error message and increments a retry counter. If retry_count < max_retries, the transition is RUNNING → QUEUED (re-enqueue). (5) Any non-terminal state → CANCELLED is triggered by the user or a timeout.
Transition log. Store an append-only array of transitions: [{state: 'PENDING', at: '...', reason: 'created'}, {state: 'QUEUED', at: '...', reason: 'enqueued'}, ...]. This log is invaluable for debugging: "Why did this job take 3 hours? Because it transitioned RUNNING → FAILED → QUEUED → RUNNING three times with 30-minute gaps between each."
Terminal state immutability. Once a job reaches SUCCEEDED, FAILED, or CANCELLED, it must never transition again. Enforce this in your state machine code and in the database (a CHECK constraint or application-level guard). A job that flip-flops between SUCCEEDED and FAILED will corrupt downstream systems that rely on the final status.
Status channels: poll vs SSE vs WebSocket vs webhook
Four ways for the client to learn that a job is done. Poll is simplest. SSE is efficient for browser clients. WebSocket adds bidirectional. Webhook pushes to a server callback URL.
Choosing the right status channel depends on who the client is, how latency-sensitive the feedback must be, and how much infrastructure you are willing to add.
Four ways for the client to learn that a job is done. Poll is simplest. SSE is efficient for browser clients. WebSocket adds bidirectional. Webhook pushes to a server callback URL.
Polling. The client calls GET /jobs/:id on an interval. Pros: universally compatible, stateless server, easy to cache. Cons: wasted requests, notification delay equals poll interval. Best for: APIs consumed by scripts, CLIs, or mobile apps behind unreliable networks. Recommended interval: exponential backoff from 1s to 10s, capped.
SSE (Server-Sent Events). The client opens an EventSource to a streaming endpoint. The server pushes text/event-stream frames as progress updates arrive. Pros: sub-second latency, efficient (no wasted requests), auto-reconnect built into the browser API. Cons: unidirectional (server → client only), requires async server, proxy configuration (disable buffering, extend timeouts). Best for: web dashboards showing live progress.
WebSocket. Full-duplex channel over a single TCP connection. Pros: bidirectional (client can send cancel/pause commands on the same connection), lowest latency. Cons: more complex than SSE, requires WebSocket-aware load balancers, connection management overhead. Best for: interactive applications that need both progress push AND client-to-server commands.
Webhook. The server POSTs the result to a URL the client provides at submission time. Pros: no client polling, works for server-to-server integrations. Cons: client must expose a public HTTPS endpoint, retry/DLQ infrastructure needed, SSRF risk. Best for: B2B APIs (Stripe, Shopify, GitHub).
Hybrid approach. Most production systems implement polling as the baseline (it is the fallback everyone can use) and SSE for the dashboard. Webhooks are added for B2B integrations. WebSocket is rarely needed unless the client needs to send real-time commands during processing.
Checkpointing and resumability
A worker writes intermediate state to durable storage at defined intervals. On crash, a new worker reads the last checkpoint and resumes from that point instead of restarting from scratch.
Checkpointing is the difference between "a 2-hour job that crashed at 1h50m restarts from zero" and "it resumes from 1h50m." For any job longer than 5 minutes, checkpointing is not optional — it is a structural requirement.
A worker writes intermediate state to durable storage at defined intervals. On crash, a new worker reads the last checkpoint and resumes from that point instead of restarting from scratch.
What to checkpoint. The checkpoint must capture everything the worker needs to resume: the current offset (row number, page number, byte offset), intermediate state (partial aggregation, in-progress file), and metadata (job_id, stage, timestamp). Store this as a JSON file in S3 with a deterministic key: s3://checkpoints/{job_id}/stage-{n}/checkpoint.json.
When to checkpoint. Two strategies: (a) Time-based — every 60 seconds. Simple, but may lose up to 60 seconds of work. (b) Count-based — every N items (e.g., every 10,000 rows). More predictable replay cost. (c) Hybrid — whichever comes first. This is the recommended approach. The cost of checkpointing (one S3 PUT per interval) is negligible compared to the cost of replaying hours of work.
Atomic checkpoint + progress update. The worker should update the checkpoint and the progress record atomically (or at least in the same code path). If the checkpoint is ahead of the progress, users see inaccurate percentages. If the progress is ahead of the checkpoint, a crash resumes from behind where the user saw — which feels like regression.
Checkpoint versioning. Include a schema_version field in the checkpoint JSON. When you deploy new code that changes the checkpoint format, the new worker checks the schema_version. If it matches, resume. If it is an older version, either migrate the checkpoint in-place or restart from scratch (with a log warning). Never silently consume a checkpoint in an incompatible format — data corruption is worse than a restart.
Cleanup. After a job reaches SUCCEEDED, delete its checkpoints (or archive them to Glacier for audit). Stale checkpoints from abandoned jobs should be cleaned up by a TTL policy on the S3 prefix (lifecycle rule: delete objects older than 7 days under the checkpoints/ prefix).
Timeout and cancellation
A job-level timeout starts when the job enters RUNNING. If the worker does not report completion before the deadline, the orchestrator sets the job to CANCELLED, sends a cancellation signal, and the worker drains gracefully.
Without timeouts, a stuck worker holds a job in RUNNING state forever — consuming resources, blocking the slot, and making the user wait indefinitely. Without cancellation, users have no way to abort jobs that are no longer needed. Both are required for a production-grade system.
A job-level timeout starts when the job enters RUNNING. If the worker does not report completion before the deadline, the orchestrator sets the job to CANCELLED, sends a cancellation signal, and the worker drains gracefully.
Job-level timeout. Set a maximum wall-clock time for the entire job (e.g., 30 minutes for a report, 4 hours for a video transcode). When the job transitions to RUNNING, start a timer. If the timer fires before the job reaches a terminal state, transition to CANCELLED with reason=TIMEOUT. Implementation: the orchestrator (or a cron job) queries for jobs WHERE status='RUNNING' AND updated_at < NOW() - INTERVAL '30 minutes' and cancels them.
Heartbeat-based liveness. The timeout alone does not distinguish between "still working" and "hung." Workers send heartbeats — an update to last_heartbeat_at every 30 seconds. The orchestrator's liveness check is: if last_heartbeat_at < NOW() - 90s AND status = 'RUNNING', the worker is presumed dead. Transition to QUEUED for retry (if retries remain) or FAILED.
User-initiated cancellation. The client calls DELETE /jobs/:id or POST /jobs/:id/cancel. The API sets a cancel flag in the job record (or publishes to a Redis channel). The worker checks the cancel flag on every checkpoint or iteration. If set, the worker stops processing, cleans up intermediate files, and transitions to CANCELLED. The cancellation is cooperative — you cannot kill a running worker mid-computation without risking data corruption, so the worker must poll for the cancel signal.
Drain timeout. After receiving a cancel signal, give the worker a drain period (30 seconds) to finish its current batch and write a checkpoint. If it does not exit within the drain period, force-kill the container. The checkpoint ensures the job can be resumed later if the user changes their mind.
Preventing zombie jobs. A zombie job is one stuck in RUNNING with no live worker. Causes: worker OOM, network partition, or a bug that skips the heartbeat. A zombie reaper cron runs every 5 minutes, finds zombies (RUNNING + stale heartbeat), and transitions them to QUEUED or FAILED. Alert on zombie count > 0 — it indicates a systemic worker health issue.
Progress reporting design
Workers write a progress struct { pct, stage, eta } to a Redis key with a TTL. The API reads this key on poll requests or subscribes to a Redis channel for SSE push.
Progress reporting transforms a black-box job into a transparent process. Without it, users see "pending" for 10 minutes and assume the system is broken. With it, they see "45% — rendering page 450 of 1,000, ETA 2 minutes" and wait patiently.
Workers write a progress struct { pct, stage, eta } to a Redis key with a TTL. The API reads this key on poll requests or subscribes to a Redis channel for SSE push.
Progress data model. The worker writes a struct to a fast store (Redis, not the main DB — you do not want progress writes competing with business queries): { pct: 45, stage: "Rendering", items_done: 450, items_total: 1000, eta: "2026-04-27T14:05:00Z", started_at: "2026-04-27T14:00:00Z" }. The key is job:{id}:progress with a TTL of 300 seconds. If the worker dies, the progress key expires and the status endpoint returns null progress — a signal that the job may be hung.
Calculating ETA. Naive ETA = (items_total - items_done) / (items_done / elapsed_time). This is noisy at the start (0/0) and inaccurate if processing speed varies across items. Better: use a rolling average of the last 30 seconds of throughput. Even better: do not show ETA until items_done > 10% — before that, the estimate is unreliable. Display "Calculating..." instead.
Progress update frequency. Update every 2-5 seconds or every 1% of progress, whichever is less frequent. More frequent updates waste Redis writes and SSE bandwidth. Less frequent updates make the progress bar jerky. For SSE, batch updates into 2-second windows — if three updates arrive within 2 seconds, send only the latest.
Multi-stage progress. For pipeline jobs, report both stage-level and overall progress: { overall_pct: 60, stage: "Transform", stage_pct: 80, stages_done: 2, stages_total: 3 }. The frontend can show a multi-segment progress bar with each stage as a labeled section.
Graceful completion. When the job finishes, set pct=100 and send a final progress update before transitioning to SUCCEEDED. If you transition the status first and the SSE client reads the status before the final progress frame, the UI jumps from 95% to "Done" — a minor but noticeable UX glitch. Sequence: write progress 100% → write result_url → transition to SUCCEEDED → close SSE.
Result storage and TTL
The output of a long-running job — a PDF, a CSV export, a transcoded video — must be stored durably, served efficiently, and garbage-collected eventually. This sounds trivial but has several sharp edges in production.
Where to store results. Blob storage (S3, GCS, Azure Blob) is the canonical choice. Store the output with a deterministic key: s3://results/{job_id}/{filename}. Do not store results in the database — a 50 MB PDF in a Postgres row degrades backup times, replication lag, and query performance.
Presigned URLs. When the job completes, generate a presigned GET URL with a short expiry (1 hour). Store this URL in the job record. When the client polls, return the presigned URL. This avoids routing result downloads through your API server — the client downloads directly from S3, which scales infinitely and is cheaper.
Result TTL. Set a lifecycle rule on the S3 prefix: delete objects older than 7 days. After 7 days, the job record transitions to EXPIRED and the result_url becomes null. Clients that poll after expiry get a 410 Gone response with a message: "Result expired. Re-submit the job." This prevents unbounded storage growth.
Re-download protection. Users sometimes close the tab before downloading. Store a download_count in the job record. If download_count > 0, the user has already downloaded. If download_count = 0 and the result is about to expire, send a reminder email (if the user opted in). This reduces re-submission rates for users who simply forgot to download.
Large result handling. For results > 5 GB (video transcodes, large exports), use S3 multipart upload from the worker. Serve via CloudFront to avoid S3 transfer costs and to provide edge caching for popular results. For results that are accessed by multiple users (a shared team report), cache at the CDN layer with a 1-hour TTL.
Compliance. In regulated industries (healthcare, finance), results may contain PII. Enable S3 server-side encryption (SSE-S3 or SSE-KMS). Log all presigned URL generations to an audit trail. Ensure the lifecycle rule does not delete results before the legal retention period (which may be longer than 7 days — override the TTL per job type if needed).
Case studies
Step Functions: orchestrating multi-stage workflows
AWS Step Functions is a managed orchestrator for long-running, multi-stage workflows. It powers internal AWS services (CodePipeline, Glue ETL, SageMaker training) and thousands of customer workloads.
Architecture. A state machine definition (ASL JSON) describes the stages, transitions, retries, and error handling. When a job is submitted, Step Functions creates an execution, assigns it a unique ARN, and drives it through the state machine. Each state can invoke a Lambda, an ECS task, a Glue job, or wait for a callback. The execution history is durable — Step Functions stores every state transition in DynamoDB, providing a complete audit trail.
Numbers. Step Functions Standard supports up to 25,000 events per execution history (a practical limit of ~2,500 states with retries). Express Workflows support up to 5 minutes of execution with higher throughput (100K starts/sec). At AWS re:Invent 2023, the team reported > 1 billion executions per day across all customers. Typical step latency (Lambda invoke + state transition) is 50-100ms.
Checkpoint semantics. Step Functions provides built-in checkpointing: every state transition is persisted before the next state begins. If the service crashes mid-execution, it resumes from the last persisted state. This is exactly the checkpoint-resume pattern, implemented at the infrastructure layer so customers do not need to build it.
Timeout and retry. Each state supports TimeoutSeconds (per-state timeout), HeartbeatSeconds (liveness check), and Retry with MaxAttempts and BackoffRate. A state that fails 3 times transitions to a Catch block (error handler), which can route to a cleanup state or a human-approval step.
Takeaway
Step Functions proves that the orchestrator + checkpoint + per-stage-timeout model scales to billions of executions. If you are on AWS and your jobs are < 25K states, use it instead of building a custom orchestrator.
Bulk Operations API: async GraphQL for million-row exports
Shopify's Bulk Operations API lets merchants export millions of products, orders, or customers via a single GraphQL mutation. The API returns a job_id, and the merchant polls for completion — a textbook implementation of the 202 contract.
Flow. The merchant calls bulkOperationRunQuery with a GraphQL query. Shopify validates the query, creates a bulk operation record with status=CREATED, and returns the operation id. Workers execute the query in batches against Shopify's sharded MySQL database, writing results as JSONL to a GCS bucket. Each batch checkpoint updates the operation's object_count and status fields. When all batches complete, the worker sets status=COMPLETED and generates a presigned download URL.
Progress. The merchant polls the currentBulkOperation query, which returns: status, objectCount (items processed so far), fileSize (bytes written), and url (download link, populated on completion). There is no percentage — Shopify does not know the total count upfront because the query may span sharded databases. Instead, objectCount increases monotonically, giving merchants a rough sense of progress.
Numbers. Bulk operations process up to 10 million objects per query. Typical export time for 1M products: 5-15 minutes. The system handles > 50K concurrent bulk operations across all merchants. Results are available for download for 7 days, then garbage-collected — matching the TTL pattern.
Cancellation. Merchants can call bulkOperationCancel to abort a running export. The worker checks a cancel flag after each batch and stops processing. Partial results are still available for download — the merchant gets what was processed before cancellation.
Takeaway
Shopify shows that the 202 + poll + TTL pattern works at massive scale (10M objects, 50K concurrent ops) without SSE or webhooks — polling is sufficient when combined with monotonic progress counters.
Video transcoding: multi-stage pipeline at planetary scale
YouTube processes over 500 hours of video uploaded every minute. Each upload triggers a multi-stage transcoding pipeline that produces dozens of renditions (144p to 8K, H.264 and VP9 and AV1) — a long-running task that can take minutes to hours per video.
Pipeline stages. (1) Ingest — the upload service writes the raw file to Colossus (Google's distributed file system) and creates a job record. (2) Analysis — a worker extracts metadata (codec, resolution, frame rate, audio channels) and determines which renditions to produce. (3) Transcode — the job fans out to N workers, each producing one rendition. GPU-accelerated workers handle H.264/VP9; TPU-accelerated workers handle AV1. (4) Assembly — a reducer stitches segments, generates DASH/HLS manifests, and writes the final output to serving storage. (5) Publish — the video status transitions from "processing" to "public" and becomes available to viewers.
Checkpointing. Each transcode worker processes the video in segments (typically 4-second chunks). After each segment, the worker writes the output chunk to Colossus and updates a checkpoint record. On crash, a new worker reads the checkpoint and resumes from the last completed segment. For a 1-hour 4K video, this means a crash at 58 minutes loses at most 4 seconds of work.
Progress. Creators see a percentage-based progress bar on the upload page. The progress is computed from (segments_completed / segments_total) across all renditions, weighted by rendition priority (the first playable rendition — typically 360p — is weighted highest so it completes first and the video becomes watchable quickly).
Numbers. Transcode latency targets: first playable rendition within 1-2 minutes of upload; all renditions within 30 minutes for a typical 10-minute video. The system runs on hundreds of thousands of workers across multiple data centers, with automatic failover between regions.
Takeaway
YouTube demonstrates the pipeline-with-checkpointing pattern at extreme scale: segment-level checkpoints, fan-out parallelism, weighted progress, and a "first playable rendition" priority that optimizes user-perceived latency.
Decision levers
Status channel selection
The choice of poll vs SSE vs WebSocket vs webhook depends on three factors: (1) who the client is (browser, server, mobile), (2) how latency-sensitive the progress feedback must be (seconds vs sub-second), and (3) how much infrastructure you are willing to add (none for poll, Redis Pub/Sub + async server for SSE, WS-aware LB for WebSocket, retry pipeline for webhook). Start with polling. Add SSE when the dashboard UX demands real-time progress. Add webhooks when B2B customers request them. Add WebSocket only if you need bidirectional communication (cancel from the client during processing).
Checkpoint granularity
Fine-grained checkpoints (every 100 items) minimize wasted work on crash but increase I/O overhead. Coarse checkpoints (every 10,000 items) are cheaper but risk replaying more work. The sweet spot depends on item processing cost: if each item takes 1ms (100K items/s), checkpoint every 60 seconds (60K items). If each item takes 10s (ML inference), checkpoint after every item. The rule of thumb: checkpoint so that the maximum replay cost on crash is < 5% of the total job duration.
Job timeout budget
Set the job timeout to 3x the p99 observed duration. If p99 is 10 minutes, set timeout to 30 minutes. Too short and healthy jobs get killed during load spikes. Too long and stuck jobs hold resources for hours. For pipeline jobs, set per-stage timeouts (3x stage p99) AND an overall timeout (sum of stage timeouts + 50% buffer). The heartbeat interval should be timeout / 3 — if timeout is 30 minutes, heartbeat every 10 minutes.
Result TTL
Short TTLs (24 hours) reduce storage cost but increase re-submission rates. Long TTLs (30 days) keep results available but grow storage linearly. 7 days is the industry default (Shopify, GitHub Actions, most CI/CD systems). For compliance-regulated outputs (financial reports, medical records), override the TTL to match the legal retention period. Send an email reminder 24 hours before expiry to reduce "where did my report go?" support tickets.
Retry budget and backoff
Each job should have a retry budget (max 3 retries) with exponential backoff (1 min, 5 min, 15 min). After exhausting retries, transition to FAILED and alert the user. Do not retry indefinitely — it wastes compute and hides bugs. For pipeline jobs, retry at the stage level: a transient network error in the "upload" stage should not re-run the "transform" stage. Differentiate between retryable errors (timeout, 503, OOM) and non-retryable errors (400, invalid input, data corruption). Non-retryable errors go straight to FAILED without retry.
Failure modes
A worker crashes (OOM, network partition) and the job remains in RUNNING state forever. No heartbeat arrives, no progress updates. The user sees "Running..." indefinitely. Fix: a zombie reaper cron checks for jobs WHERE status='RUNNING' AND last_heartbeat_at < NOW() - 90s, and transitions them to QUEUED (for retry) or FAILED. Alert on zombie count > 0.
The visibility timeout expires before the worker finishes, the message reappears, and a second worker picks it up. Both workers are now processing the same job, racing to write results. Fix: use a compare-and-swap when transitioning to RUNNING (UPDATE jobs SET status='RUNNING', worker_id=? WHERE id=? AND status='QUEUED'). The losing worker sees 0 rows affected and aborts.
The worker is running but stuck — waiting on a lock, a slow database query, or an external API that is not responding. Heartbeats keep arriving (the heartbeat loop is on a separate thread) but progress percentage does not advance. Fix: add a progress staleness check in addition to heartbeat liveness. If pct has not changed in 5 minutes despite heartbeats, flag the job for investigation.
The worker writes the result to S3 and crashes before updating the job status to SUCCEEDED. The user sees "Running..." forever while the result already exists. Fix: the zombie reaper should check for results in S3 before transitioning a zombie job to FAILED. If a result exists, transition to SUCCEEDED and backfill the result_url.
A new code deploy changes the checkpoint format. Workers pick up old checkpoints in the new format and either crash or produce garbled output. Fix: include a schema_version field in every checkpoint. The worker checks schema_version on resume. If mismatched, either migrate the checkpoint or restart from scratch with a warning log. Never silently consume an incompatible checkpoint.
When a popular job nears completion, thousands of users poll GET /jobs/:id simultaneously. The status endpoint's database query becomes the bottleneck. Fix: (1) cache the job status in Redis with a 1s TTL so polls read from cache, not the DB. (2) Return Retry-After: 5 to slow down eager clients. (3) Use SSE for active viewers to eliminate polling entirely.
A global timeout of 30 minutes kills jobs that normally take 10 minutes but slow to 35 minutes during a database overload. Fix: base timeouts on heartbeat staleness, not wall clock. If the worker is still sending heartbeats and making progress, extend the deadline dynamically. Only kill when heartbeats stop or progress stalls.
Decision table
Status channel comparison
| Dimension | Polling | SSE | WebSocket | Webhook |
|---|---|---|---|---|
| Notification latency | 1-10s (poll interval) | < 100ms | < 100ms | < 1s + retry delay |
| Client complexity | Trivial (setTimeout) | Low (EventSource) | Medium (ws protocol) | Medium (HTTPS server) |
| Server complexity | None (stateless) | Medium (async + pubsub) | High (connection mgmt) | Medium (retry + DLQ) |
| Proxy compatibility | Universal | Good (needs proxy config) | Needs WS-aware LB | N/A (server-to-server) |
| Progress updates | Yes (per-poll) | Yes (real-time) | Yes (real-time) | Usually final only |
| Bidirectional | No | No | Yes (cancel/pause) | No |
| Best for | APIs, CLIs, mobile | Web dashboards | Interactive apps | B2B integrations |
- Most systems implement polling + SSE. Webhooks are added for B2B. WebSocket is rarely needed.
Worked example
Worked example: PDF report generation service
A SaaS analytics platform lets customers generate PDF reports from their dashboard data. Reports pull from a Postgres data warehouse, render charts via a headless browser, assemble multi-page PDFs, and upload to S3 for download. A typical report takes 30-120 seconds; large enterprise reports can take 10+ minutes.
Step 1: Define the API contract
POST /api/reports accepts a JSON body: { template_id, date_range, filters, callback_url? }. The API validates input, mints a UUIDv7 job_id, inserts a job record (status=PENDING, created_at=now), enqueues the job_id to an SQS queue, and returns:
HTTP 202 Accepted Location: /api/reports/01927abc-def0-7000-8000-000000000001 { "job_id": "01927abc...", "status": "pending", "status_url": "/api/reports/01927abc..." }
The idempotency key is a hash of (user_id, template_id, date_range, filters). If the client retries within 5 minutes, the API returns the existing job_id instead of creating a duplicate.
Client submits work, gets 202 with a job_id, and polls for status. The API enqueues the job; a worker processes it and writes results to a result store. The client polls until status is done.
Step 2: Job state machine
The jobs table: id (UUIDv7), user_id, template_id, status (enum: pending, queued, running, succeeded, failed, cancelled), progress_pct (0-100), current_stage (text), result_url (nullable), error (nullable), retry_count (default 0), created_at, updated_at, last_heartbeat_at. Status transitions are enforced by a check: UPDATE jobs SET status='running' WHERE id=? AND status='queued' — if 0 rows affected, another worker already claimed it.
Step 3: Worker implementation
The worker long-polls SQS (WaitTimeSeconds=20, MaxNumberOfMessages=1). On receiving a message:
- 1Load the job record. CAS transition to RUNNING.
- 2Start a heartbeat goroutine: every 30s, UPDATE last_heartbeat_at=now AND call ChangeMessageVisibility to extend the SQS timeout.
- 3Extract: query the data warehouse for the report data (checkpoint after query completes).
- 4Render: launch a headless browser, render each chart as SVG (progress: 0-60%, update Redis every 2 pages).
- 5Assemble: combine SVGs + data into a multi-page PDF using a PDF library (progress: 60-90%).
- 6Upload: PUT the PDF to S3 at results/{job_id}/report.pdf (progress: 90-100%).
- 7Generate a presigned URL (7-day expiry), update the job record (status=SUCCEEDED, result_url=presigned URL).
- 8Delete the SQS message. Stop the heartbeat goroutine.
On error: increment retry_count. If retry_count < 3, transition to QUEUED and let the message re-appear (do NOT delete it). If retry_count >= 3, transition to FAILED, delete the message, and send a user notification.
Step 4: Status endpoint
GET /api/reports/:id returns: { job_id, status, progress: { pct, stage, eta }, result_url, error, created_at, updated_at }. The progress object is read from Redis (job:{id}:progress) — fast and does not hit the jobs DB. If the Redis key is missing (expired or not yet written), return progress: null.
For SSE: GET /api/reports/:id/stream opens an EventSource connection. The server subscribes to Redis Pub/Sub channel job:{id} and relays messages as SSE frames: data: {"pct":45,"stage":"Rendering page 9/20","eta":"2026-04-27T14:05:00Z"}.
Step 5: Timeout, cancellation, and cleanup
A cron job runs every 5 minutes:
- Zombie reaper: jobs WHERE status='running' AND last_heartbeat_at < NOW() - '90 seconds' → check S3 for result; if exists, transition to SUCCEEDED; otherwise, transition to QUEUED if retry_count < 3, else FAILED.
- Expired results: jobs WHERE status='succeeded' AND updated_at < NOW() - '7 days' → delete S3 object, set result_url=null, status=EXPIRED.
User cancellation: POST /api/reports/:id/cancel → set a cancel flag in Redis (job:{id}:cancel). The worker checks this flag after each render page. If set, stop rendering, clean up partial PDF, transition to CANCELLED.
Metrics dashboard
Three panels: (1) Submission rate vs completion rate (should track closely with a lag equal to median job duration). (2) Active job count by status (PENDING, QUEUED, RUNNING) — a growing RUNNING count without matching completions signals stuck workers. (3) p50/p95/p99 job duration — trending up indicates data warehouse slowdown or increased report complexity.
Interview playbook
When it comes up
- The prompt mentions "generate a report", "export data", "transcode video", or any task > 30s
- The interviewer asks "what happens if this takes too long?"
- You are designing a system where the user submits work and comes back later
- The prompt includes "background job", "async processing", or "job queue"
Order of reveal
- 11. Name the 202 contract. Any task that outlasts a gateway timeout needs the 202 pattern: accept the request, return a job_id, and let the client poll or subscribe for status.
- 22. Draw the skeleton. Client → API → job queue → worker → result store. Client polls GET /jobs/:id. The API returns 202 with a Location header.
- 33. Define the state machine. Every job follows PENDING → QUEUED → RUNNING → SUCCEEDED | FAILED | CANCELLED. Transitions are guarded by compare-and-swap to prevent duplicate execution.
- 44. Add progress and status channels. Workers write progress to Redis. The client polls or subscribes via SSE. I would start with polling and add SSE for the dashboard.
- 55. Add checkpointing. For jobs > 5 minutes, checkpoint intermediate state to S3 after each stage. On crash, resume from the last checkpoint instead of restarting from zero.
- 66. Add timeout and cancellation. A heartbeat-based liveness check detects zombie workers. A zombie reaper cron transitions stale jobs. Users can cancel via a cooperative cancel flag.
- 77. Add result storage with TTL. Results go to S3 with a presigned URL. A 7-day TTL garbage-collects expired results. The job transitions to EXPIRED after TTL.
Signature phrases
- “202 Accepted with a Location header” — Instantly communicates you know the async contract
- “job state machine with compare-and-swap transitions” — Shows you prevent duplicate execution at the DB level
- “heartbeat-based liveness, not just wall-clock timeout” — Distinguishes slow-but-healthy from truly stuck
- “checkpoint every N items or every 60 seconds, whichever comes first” — Shows you have thought about crash recovery cost
- “progress to Redis, not the main DB — separate the read hot path” — Demonstrates operational awareness of DB load
- “presigned URL with TTL for result download” — Proves you know how to serve large files without routing through your API
Likely follow-ups
?“How do you handle a job that is stuck but the worker is still sending heartbeats?”Reveal
Add a progress staleness check on top of heartbeat liveness. If last_heartbeat_at is fresh but progress_pct has not changed in 5 minutes, the worker is alive but stuck — maybe waiting on a lock or a slow dependency. Flag the job for investigation: either alert the user, or automatically kill and retry if retry budget remains. The key insight is that heartbeat proves "process is alive," not "process is making progress." You need both signals.
?“What if the user submits the same report twice?”Reveal
Use an idempotency key: hash(user_id, template_id, date_range, filters). On POST, check if a job with this key exists and is in a non-terminal state (PENDING, QUEUED, RUNNING). If so, return the existing job_id with 200 (not 202). If the previous job FAILED, allow re-submission (new job_id). Store the idempotency key with a TTL equal to the job timeout + result TTL. This prevents duplicate work without blocking legitimate retries.
?“How do you prevent a thundering herd when a popular job completes?”Reveal
Three layers: (1) Cache the job status in Redis with a 1-second TTL. Poll requests read from Redis, not the DB. (2) Return Retry-After: 5 in the poll response to throttle eager clients. (3) Use SSE for dashboard viewers — they get a push notification on completion and stop polling entirely. The combination eliminates 99% of redundant poll traffic.
?“How do you size the worker fleet?”Reveal
Start with the throughput equation: workers_needed = submission_rate × avg_job_duration. If users submit 100 reports/min and each takes 60 seconds, you need 100 concurrent workers. Add 30% headroom for retries and variance. Autoscale on queue depth: if ApproximateNumberOfMessagesVisible > 2x steady-state for 3 minutes, add 50% more workers. Scale down with a 10-minute cool-down to prevent flapping. Use Fargate Spot for cost — report workers are interruptible.
?“What happens when the result store (S3) is down?”Reveal
The worker retries the S3 PUT with exponential backoff (1s, 5s, 25s). If S3 is down for > 2 minutes, the worker checkpoints its progress and transitions the job back to QUEUED. When S3 recovers, the next worker resumes from the checkpoint. The key design choice: the worker must NOT transition to SUCCEEDED without confirming the result is durable. A race between "mark done" and "S3 PUT" can produce jobs that are SUCCEEDED but have no downloadable result — the worst possible UX.
?“How do you handle cancellation mid-pipeline?”Reveal
Set a cancel flag in Redis (job:{id}:cancel). Each pipeline stage checks the flag at stage boundaries (between stages) and within stages (every N items or every checkpoint). On cancel: (1) stop processing, (2) write a final checkpoint (so the job can be resumed if the user changes their mind), (3) clean up partial results, (4) transition to CANCELLED with reason=USER_REQUESTED. The cancellation is cooperative — killing a worker mid-write risks data corruption. Give a 30-second drain period after setting the flag.
Code snippets
// POST /api/jobs — submit a long-running task
import { randomUUIDv7 } from 'crypto';
import { db } from './db';
import { queue } from './queue';
export async function POST(req: Request) {
const body = await req.json();
const idempotencyKey = req.headers.get('Idempotency-Key');
// Check for existing job with same idempotency key
if (idempotencyKey) {
const existing = await db.query(
'SELECT id, status, result_url FROM jobs WHERE idempotency_key = $1',
[idempotencyKey],
);
if (existing.rows.length > 0) {
const job = existing.rows[0];
return Response.json(
{ job_id: job.id, status: job.status, status_url: '/api/jobs/' + job.id },
{ status: job.status === 'pending' ? 202 : 200 },
);
}
}
const jobId = randomUUIDv7();
await db.query(
'INSERT INTO jobs (id, user_id, payload, status, idempotency_key, created_at) ' +
"VALUES ($1, $2, $3, 'pending', $4, NOW())",
[jobId, body.user_id, JSON.stringify(body.payload), idempotencyKey],
);
await queue.send({ jobId });
await db.query(
"UPDATE jobs SET status = 'queued' WHERE id = $1 AND status = 'pending'",
[jobId],
);
return Response.json(
{ job_id: jobId, status: 'queued', status_url: '/api/jobs/' + jobId },
{
status: 202,
headers: { Location: '/api/jobs/' + jobId, 'Retry-After': '2' },
},
);
}// Worker loop: dequeue → process → heartbeat → progress
import { sqs, redis, db, s3 } from './infra';
async function processJob(jobId: string, receiptHandle: string) {
// CAS transition: only one worker can claim the job
const res = await db.query(
"UPDATE jobs SET status = 'running', last_heartbeat_at = NOW() " +
"WHERE id = $1 AND status = 'queued' RETURNING *",
[jobId],
);
if (res.rowCount === 0) return; // another worker claimed it
// Heartbeat: extend visibility + update liveness every 30s
const heartbeat = setInterval(async () => {
await sqs.changeMessageVisibility(receiptHandle, 120);
await db.query(
'UPDATE jobs SET last_heartbeat_at = NOW() WHERE id = $1',
[jobId],
);
}, 30_000);
try {
const job = res.rows[0];
const totalSteps = 100;
for (let step = 1; step <= totalSteps; step++) {
// Check for cancellation
const cancelled = await redis.get('job:' + jobId + ':cancel');
if (cancelled) {
await db.query(
"UPDATE jobs SET status = 'cancelled' WHERE id = $1",
[jobId],
);
return;
}
await doWork(job.payload, step);
// Update progress in Redis (fast path, not the DB)
await redis.set(
'job:' + jobId + ':progress',
JSON.stringify({ pct: step, stage: 'Step ' + step + '/' + totalSteps }),
'EX', 300,
);
}
// Upload result to S3
const key = 'results/' + jobId + '/output.pdf';
await s3.putObject(key, resultBuffer);
const url = await s3.getSignedUrl(key, 7 * 86400);
await db.query(
"UPDATE jobs SET status = 'succeeded', result_url = $2 WHERE id = $1",
[jobId, url],
);
await sqs.deleteMessage(receiptHandle);
} catch (err) {
const retryCount = await db.query(
"UPDATE jobs SET status = 'queued', retry_count = retry_count + 1 " +
"WHERE id = $1 AND retry_count < 3 RETURNING retry_count",
[jobId],
);
if (retryCount.rowCount === 0) {
await db.query(
"UPDATE jobs SET status = 'failed', error = $2 WHERE id = $1",
[jobId, String(err)],
);
await sqs.deleteMessage(receiptHandle);
}
// If retries remain, do NOT delete — let visibility timeout re-deliver
} finally {
clearInterval(heartbeat);
}
}// GET /api/jobs/:id/stream — Server-Sent Events
import { redis } from './infra';
export async function GET(
req: Request,
{ params }: { params: { id: string } },
) {
const jobId = params.id;
const encoder = new TextEncoder();
const stream = new ReadableStream({
start(controller) {
const subscriber = redis.duplicate();
subscriber.subscribe('job:' + jobId);
subscriber.on('message', (_channel: string, message: string) => {
const data = JSON.parse(message);
controller.enqueue(
encoder.encode('data: ' + JSON.stringify(data) + '
'),
);
// Close stream on terminal status
if (['succeeded', 'failed', 'cancelled'].includes(data.status)) {
subscriber.unsubscribe();
subscriber.quit();
controller.close();
}
});
// Heartbeat to keep connection alive through proxies
const keepAlive = setInterval(() => {
controller.enqueue(encoder.encode(': keepalive
'));
}, 15_000);
req.signal.addEventListener('abort', () => {
clearInterval(keepAlive);
subscriber.unsubscribe();
subscriber.quit();
controller.close();
});
},
});
return new Response(stream, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
Connection: 'keep-alive',
'X-Accel-Buffering': 'no', // Nginx: disable response buffering
},
});
}// Worker that checkpoints to S3 and resumes on crash
import { s3, db } from './infra';
interface Checkpoint {
schemaVersion: number;
jobId: string;
stage: string;
offset: number;
intermediateUri: string;
timestamp: string;
}
async function loadCheckpoint(jobId: string): Promise<Checkpoint | null> {
try {
const data = await s3.getObject('checkpoints/' + jobId + '/latest.json');
const ckpt: Checkpoint = JSON.parse(data);
if (ckpt.schemaVersion !== 1) {
console.warn('Incompatible checkpoint v' + ckpt.schemaVersion + ', restarting');
return null;
}
return ckpt;
} catch {
return null; // no checkpoint — start from scratch
}
}
async function saveCheckpoint(ckpt: Checkpoint): Promise<void> {
await s3.putObject(
'checkpoints/' + ckpt.jobId + '/latest.json',
JSON.stringify(ckpt),
);
}
async function processWithCheckpoints(jobId: string, data: any[]) {
const ckpt = await loadCheckpoint(jobId);
const startOffset = ckpt?.offset ?? 0;
for (let i = startOffset; i < data.length; i++) {
await processItem(data[i]);
// Checkpoint every 1000 items or every 60s
if ((i - startOffset) % 1000 === 999) {
await saveCheckpoint({
schemaVersion: 1,
jobId,
stage: 'transform',
offset: i + 1,
intermediateUri: 's3://intermediate/' + jobId + '/batch-' + (i + 1) + '.parquet',
timestamp: new Date().toISOString(),
});
}
}
// Cleanup checkpoints after success
await s3.deleteObject('checkpoints/' + jobId + '/latest.json');
}-- Run every 5 minutes via cron to detect and recover zombie jobs
-- A zombie is a job stuck in RUNNING with no recent heartbeat
-- Step 1: Find zombies (running but heartbeat stale > 90s)
WITH zombies AS (
SELECT id, retry_count
FROM jobs
WHERE status = 'running'
AND last_heartbeat_at < NOW() - INTERVAL '90 seconds'
FOR UPDATE SKIP LOCKED -- avoid contention with workers
)
-- Step 2: Retry if budget remains, else fail
UPDATE jobs
SET
status = CASE
WHEN retry_count < 3 THEN 'queued'
ELSE 'failed'
END,
error = CASE
WHEN retry_count >= 3 THEN 'Zombie: no heartbeat for 90s, retries exhausted'
ELSE NULL
END,
retry_count = CASE
WHEN retry_count < 3 THEN retry_count + 1
ELSE retry_count
END,
updated_at = NOW()
WHERE id IN (SELECT id FROM zombies);
-- Step 3: Expire old results (7-day TTL)
UPDATE jobs
SET status = 'expired', result_url = NULL, updated_at = NOW()
WHERE status = 'succeeded'
AND updated_at < NOW() - INTERVAL '7 days';Drills
A job has been in RUNNING state for 2 hours but the worker is sending heartbeats and progress is stuck at 45%. What do you do?Reveal
Heartbeat proves liveness, not progress. Check the progress staleness: if pct has not changed in 5+ minutes despite heartbeats, the worker is stuck (likely waiting on a lock or a slow dependency). Investigate the worker logs. If the dependency is recoverable, wait. If not, set a cancel flag and let the worker drain gracefully. Then restart from the last checkpoint. Add a progress-staleness alert to catch this automatically: if pct unchanged for > 5 minutes AND heartbeat fresh, page on-call.
Your status endpoint is getting 10,000 req/s from users polling a popular job. How do you handle it?Reveal
Three-layer defense: (1) Cache the job status in Redis with a 1-second TTL — polls read from cache, not the DB. (2) Add Retry-After: 5 header to slow clients. (3) Offer SSE for active dashboard viewers so they stop polling entirely. For CDN-fronted APIs, add Cache-Control: public, max-age=1 so the CDN absorbs repeated requests. This reduces DB load from 10K/s to ~1/s (one cache miss per second).
Two workers pick up the same job due to a visibility timeout race. How do you prevent them from producing duplicate results?Reveal
Use a compare-and-swap (CAS) when transitioning to RUNNING: UPDATE jobs SET status='running', worker_id=? WHERE id=? AND status='queued'. The first worker succeeds (1 row affected); the second sees 0 rows affected and aborts immediately. This is the same optimistic concurrency pattern used in distributed locks. The CAS happens at the DB level, not the queue level, so it works regardless of the queue technology.
A code deploy changes the checkpoint format. Running jobs have old-format checkpoints. What happens?Reveal
If you do not handle this, the new code reads old checkpoints, misinterprets fields, and either crashes or produces garbled output. Solution: include a schema_version field in every checkpoint. On resume, check the version: if it matches, resume normally. If it is an older version, either (a) migrate the checkpoint in-place (if the migration is simple), or (b) restart from scratch with a warning log. Never silently consume an incompatible checkpoint — the data corruption is worse than a restart.
Why should you write progress to Redis instead of the jobs table in Postgres?Reveal
Progress updates are write-heavy and read-heavy but low-durability: if you lose one progress update, nothing breaks — the next one overwrites it. Writing to Postgres means every progress update is a WAL write, fsync, and replication event. At 1,000 concurrent jobs updating every 2 seconds, that is 500 writes/s on a table that also handles status transitions, result URLs, and administrative queries. Redis handles this trivially — SET with EX (TTL) is sub-millisecond and does not compete with the business database. The jobs table stays lean for transactional operations.
How would you add webhook support to a system that currently only supports polling?Reveal
Add a callback_url column to the jobs table (nullable). When the client includes callback_url in the POST /jobs request, store it. When the worker transitions a job to SUCCEEDED or FAILED, check if callback_url is set. If so, enqueue a webhook delivery message to a separate webhook-delivery queue. A dedicated webhook worker POSTs the result payload to the callback_url with an HMAC-SHA256 signature. Retry 3 times with exponential backoff. After 3 failures, mark the webhook as failed but leave the job as SUCCEEDED — the client can still poll. The key insight: webhook delivery is itself a long-running task — do not block the main worker on it.