Large file upload & blob handling
When to reach for this
Reach for this when…
- User-uploaded media (video, images, audio)
- Document processing / attachments
- Backup / archive flows
- Data ingest > 10 MB per file
Not really this pattern when…
- Tiny payloads (<1 MB) embedded in JSON — just post the bytes
- Server-generated content (nothing to upload)
Good vs bad answer
Interviewer probe
“How does video upload work?”
Weak answer
"Client POSTs the file to /upload; server saves it."
Strong answer
"Client requests a pre-signed multipart upload from our API. Server issues short-lived signed URLs per part (5 MB parts). Client uploads in parallel directly to S3 — our fleet sees no bytes. S3 emits an ObjectCreated event to SQS on CompleteMultipartUpload. A worker consumes, AV-scans, runs FFmpeg transcoding for HLS variants, writes thumbnails, and updates the DB row from uploading to ready. The user sees an async progress indicator; frontend polls or uses SSE. Untrusted uploads live in a scan bucket; only post-scan files move to the public CDN-backed bucket. Lifecycle rules abort incomplete multipart after 7 days to avoid phantom storage cost."
Why it wins: Signed multipart, async processing, two-bucket trust boundary, lifecycle hygiene.
Cheat sheet
- •Bytes bypass the app. Always.
- •Signed URL: method + path + content-type + expiry.
- •Multipart > 100 MB. Part-level retries.
- •Processing is async via blob event → queue → worker.
- •Two-bucket pattern: untrusted → scan → trusted.
- •Lifecycle rule: abort incomplete multipart after 7 days.
- •Sniff MIME server-side; never trust the header.
Core concept
The bytes bypass your app. Canonical flow:
- 1Client requests a signed upload URL: POST /uploads with filename + content-type. Server returns a short-lived (15-min) pre-signed S3/GCS URL + an upload_id.
- 2Client PUTs the file directly to blob storage. Can be a single PUT for small files or multipart for large. Your app fleet sees none of the bytes.
- 3Blob storage emits an event (S3 event → SQS/EventBridge, GCS → Pub/Sub) on upload completion.
- 4Processor worker picks up the event and runs pipeline: virus scan → transcode → thumbnail → metadata extraction → DB row update.
Multipart / resumable for large files:
- S3 multipart: client splits into N parts (each 5 MB+), uploads in parallel with per-part signed URLs, completes with a CompleteMultipartUpload call. Per-part retries, no full restart.
- tus.io / GCS resumable: same idea, different API.
Security is non-trivial. Signed URLs: narrow to method (PUT), path (bucket + specific key), content-type, content-length. Short expiry (15 min). Otherwise, one leaked URL + no type check = attacker uploads a .exe to your "images" bucket.
Canonical examples
- →YouTube / Vimeo video upload
- →Profile photo upload
- →Dropbox / Google Drive file sync
- →GitHub LFS
- →Attachment upload in chat apps
Decision levers
Single PUT vs multipart
Cutoff: 100 MB. Below: single PUT is simpler. Above: multipart wins — parallel parts, part-level retries, resumability after network blip.
Processing: sync vs async
Always async. Upload completion emits an event; workers process. Return immediately to the user with "processing" state. Never block an HTTP connection on transcoding.
Storage class strategy
Hot: S3 Standard / GCS Standard for first 30 days. Warm: S3 IA / GCS Nearline for 30–90 days. Cold: Glacier / Archive for > 90. Lifecycle rules automate transitions — 10× cost reduction for archive-heavy workloads.
Content validation placement
Validate content-type and size in the signed URL (server-enforced). Post-upload: MIME sniff, magic bytes, AV scan before the file is publicly accessible. Two-bucket pattern: untrusted → scanned → trusted.
Failure modes
App fleet becomes the bandwidth bottleneck; instances OOM on big files; upload latency eats the request timeout. Always signed-URL direct.
Attacker uploads 100 GB to your bucket. Enforce Content-Length in the signed URL; enforce bucket-level quotas; alert on unusual growth.
Abandoned uploads leave parts billed forever. Lifecycle rule: AbortIncompleteMultipartUpload after 7 days.
Client says content-type: image/png; it's actually an HTML file that gets served with that type → XSS via uploaded content. Sniff real MIME server-side; serve from a different origin than the app.
Bucket default ACL public; anyone with a URL reads. Default private; serve via signed download URLs or a CDN with signed cookies.
Drills
Why signed URLs, not streaming through the app?Reveal
Three reasons. (1) Bandwidth — your app fleet becomes the bottleneck and cost centre. (2) Memory — large uploads can OOM or require streaming disk IO. (3) Timeout — HTTP connections break on slow networks; signed URL + multipart gives the client resumability. Your app issues a tiny signed URL in milliseconds; S3 handles the ingest.
An attacker has a leaked signed URL. Damage?Reveal
If the signed URL is constrained to PUT + exact key + content-type + content-length + 15-min expiry: they can overwrite the intended file with content of matching type and size within 15 minutes. Damage scope = one key. Without those constraints: they can upload anything, anywhere in the bucket, for the signed lifetime — potentially pollute the entire bucket with malicious content.