Design a REST API for an image‑inference service that accepts large images and returns class probabilities plus Grad‑CAM heatmaps. Specify endpoint paths/verbs, request/response schemas, idempotency, batching and pagination, async processing with job IDs and webhooks, rate limiting, auth (OAuth2/JWT), versioning, retries/timeouts/circuit breaking, and an error taxonomy. Discuss input validation, content‑type checks, secure storage, and ensuring backward compatibility during model upgrades and rollbacks.
Quick Answer: This question evaluates a candidate's ability to design secure, scalable REST APIs for machine learning inference, emphasizing API surface design, synchronous and asynchronous workflows, data validation, security controls, and operational concerns like idempotency and batching.
Solution
# Overview
This design presents a versioned REST API for image classification with Grad-CAM explainability. It supports:
- Synchronous single-image inference for fast/typical requests.
- Asynchronous jobs for large images/batches with webhooks and polling.
- Strong idempotency, rate limits, OAuth2/JWT auth, and a stable, additive versioning strategy.
Guiding principles:
- Additive, backward-compatible changes only in a given API version.
- Consistent, typed, machine-readable errors.
- Secure-by-default: TLS, signed URLs, encrypted storage, minimal retention.
# 1) Endpoint Surface (v1)
Base URL: /v1
- Health
- GET /v1/health → 200 OK when service operational.
- Models
- GET /v1/models → List available models and their versions/labels.
- Synchronous Inference
- POST /v1/infer → Perform single-image synchronous inference. 200 on success; 202 if auto-upgraded to async due to size.
- Asynchronous Jobs
- POST /v1/jobs → Submit an inference job (single or batch). Returns job_id (202 Accepted).
- GET /v1/jobs → List jobs (cursor-based pagination, filters by status, model, created_at).
- GET /v1/jobs/{job_id} → Get job status and result (when complete).
- GET /v1/jobs/{job_id}/results → Paginated results for batch jobs.
- DELETE /v1/jobs/{job_id} → Cancel pending/running job.
- Webhooks
- POST /v1/webhooks → Register a webhook endpoint (optional; clients may also pass a callback URL per request).
- GET /v1/webhooks → List registered webhooks.
- DELETE /v1/webhooks/{webhook_id} → Delete webhook.
- Limits
- GET /v1/limits → Return per-tenant quotas and limits (rate, size, batch size).
# 2) Authentication & Authorization
- OAuth2 client credentials flow or JWT bearer tokens.
- Scopes (examples):
- inference.read, inference.write
- jobs.read, jobs.write
- webhooks.read, webhooks.write
- Example header: Authorization: Bearer <jwt>
- Tenancy derived from token; access restricted to tenant’s resources.
# 3) Idempotency
- All non-GET creation endpoints accept Idempotency-Key header (RFC-7231 semantics):
- Same key + same payload within 24h returns the original response (status, body).
- Response includes Idempotency-Replayed: true|false.
- Include Request-Id in every response for tracing.
# 4) Request/Response Schemas (representative)
Common types:
- ImageInput: one of
- image.url (https URL, signed OK)
- image.bytes (base64), Content-MD5 optional
- image.storage_id (pre-uploaded object ref)
- GradCAMOptions:
- enabled (bool), layer (string|"auto"), colormap (e.g., "jet"), overlay (bool), format ("png"|"npy"), resolution ("input"|{width,height})
- ClassificationOptions:
- top_k (1–1000, default 5), prob_threshold (0–1, default 0)
Synchronous request (POST /v1/infer):
- Content-Type: application/json or multipart/form-data (file field=image)
- Body (JSON):
- model: string (e.g., "resnet50")
- model_version: string (e.g., "stable" or pinned version "2025-01-15")
- image: ImageInput
- grad_cam: GradCAMOptions { enabled: true }
- classify: ClassificationOptions
- response: { heatmap: "inline"|"url" } (default url)
Synchronous response 200:
- request_id: string
- model: string; model_version: string
- timings_ms: { queue: int, inference: int, total: int }
- classes: [ { id: string, label: string, prob: float } ] (sorted desc)
- heatmap: one of
- { type: "png", data_b64: string, width: int, height: int, colormap: string }
- { type: "url", url: string, expires_at: RFC3339 }
Async submission (POST /v1/jobs):
- Body:
- job_type: "inference"
- model, model_version
- inputs: [ { id: string, image: ImageInput, grad_cam: GradCAMOptions, classify: ClassificationOptions } ] (1..N)
- callback_url: optional (per-job webhook)
- metadata: optional opaque JSON
- ttl_hours: optional (retention policy)
Async submit response 202:
- job_id: string; status: "queued"
- counts: { submitted: N }
- estimated_wait_ms: int
Job status (GET /v1/jobs/{job_id}) 200:
- job_id, status: queued|running|succeeded|failed|canceled|expired
- submitted_at, started_at, completed_at
- model, model_version
- error: nullable ErrorObject
- result_summary: { items: int, succeeded: int, failed: int }
Batch results (GET /v1/jobs/{job_id}/results?cursor=...) 200:
- items: [
- { id: string, status, error?, result?: { classes: [...], heatmap: {url|inline}, timings_ms } }
]
- page: { next_cursor?: string, size: int }
ErrorObject (for all endpoints):
- error: { type: string, code: string, message: string, status: int, details?: object, request_id: string }
# 5) Batching & Pagination
- Batching: POST /v1/jobs accepts up to max_batch_size (e.g., 256). Each input has a client-supplied id for correlation.
- Pagination: cursor-based for listing jobs and retrieving batch results.
- Request: ?cursor=opaque&limit=50
- Response: page.next_cursor
# 6) Large Image Handling
- Accept multipart/form-data for uploads; max Content-Length enforced (e.g., 100 MB).
- Alternative: pre-signed upload:
- Client uploads to object storage → receives storage_id → uses in API request.
- Images may be downscaled server-side if resize options provided (e.g., resize: { longest_side: 1536 }).
- Async auto-routing: if size/compute estimate exceeds sync thresholds, server returns 202 with job_id.
# 7) Webhooks (Async Callbacks)
- Delivery: POST to callback_url with body: { job_id, status, items_succeeded, items_failed, link_to_results }
- Security: HMAC-SHA256 signature in header (X-Signature) using shared secret; timestamp and replay window enforced.
- Retries: exponential backoff with jitter, up to N attempts; idempotency token in X-Event-Id to dedupe on receiver.
# 8) Rate Limiting & Quotas
- Per-tenant token bucket limits (requests/s), daily quotas (images/day), and concurrency caps.
- On limit breach → 429 Too Many Requests with headers:
- X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After
# 9) Retries, Timeouts, Circuit Breaking
Client guidance:
- Sync calls: timeout ~15–30s; prefer async for large inputs.
- Safe retries on network errors/timeouts and 5xx when Idempotency-Key is set.
- Exponential backoff with jitter (e.g., base 500ms, factor 2.0, cap 30s).
- Respect Retry-After on 429/503.
Server-side:
- Shed load with 429/503; include Retry-After.
- Circuit breaking on GPU/backends; health probing, bulkheads per model.
- Queue with fair scheduling; cancel on client abort for sync where possible.
# 10) Versioning Strategy
- Path versioning: /v1, /v2 for breaking changes.
- Only additive changes within a major version (add fields, new enum values).
- Model versioning is separate from API versioning:
- model_version: "stable" by default; clients may pin to a specific version.
- Deprecation policy: announce new API version, maintain old for a sunset window; return Deprecation and Sunset headers when applicable.
# 11) Error Taxonomy
HTTP status → error.type → error.code (examples):
- 400 Bad Request → validation_error
- image.too_large, image.invalid_format, param.out_of_range, json.malformed
- 401 Unauthorized → auth_error
- auth.missing_token, auth.invalid_token, auth.scope_insufficient
- 403 Forbidden → authorization_error
- access.denied
- 404 Not Found → not_found
- job.not_found, model.not_found
- 409 Conflict → conflict
- job.already_completed, resource.version_conflict
- 413 Payload Too Large → limit_exceeded
- upload.too_large, batch.too_large
- 415 Unsupported Media Type → media_type_unsupported
- 422 Unprocessable Entity → unprocessable
- image.decoding_failed, image.animated_not_supported
- 429 Too Many Requests → rate_limited
- 500 Internal Server Error → server_error
- 502/503/504 → upstream_error/service_unavailable/gateway_timeout
Error response body (all cases):
- error: { type, code, message, status, details?, request_id }
# 12) Input Validation & Content-Type Checks
- Size limits: e.g., <= 100 MB per image (configurable per tenant).
- Dimensions: max width/height (e.g., 16k px); reject extremely skewed aspect ratios if needed.
- Formats: image/jpeg, image/png, image/tiff, image/bmp. Reject mismatched extension vs MIME sniffing.
- Disallow animated formats (GIF/WEBP) unless explicitly supported.
- Validate base64 bytes; enforce Content-MD5 if provided.
- Security scanning: AV scan, zip-bomb/Decompression bomb detection, EXIF stripping (optional), disable SVG/scriptable content.
- Parameter validation: top_k in [1,1000]; prob_threshold in [0,1]; layer must exist or "auto".
# 13) Secure Storage & Data Privacy
- TLS 1.2+ in transit; server-side encryption at rest (SSE-KMS) for objects and results.
- Signed URLs for temporary access; short expiry (e.g., 15 min).
- Data minimization: default retention TTL (e.g., 24–72h); configurable per job.
- Access control enforced per tenant; audit logs with request_id.
- Secrets hygiene: don’t log image URLs/bytes; hash or redact PII; encrypt webhook secrets; rotate keys.
# 14) Grad-CAM Options & Outputs
- Default layer: auto-select final convolutional layer; allow explicit layer override.
- Output options:
- Inline PNG base64 for small responses
- URL to signed object for large heatmaps or NPY arrays
- Metadata: heatmap width/height, normalization (0–1), colormap, overlay flag.
# 15) Backward Compatibility During Model Upgrades/Rollbacks
- Expose model_version, labels_version, and calibration_version in responses.
- Allow clients to pin model_version per request or via tenant setting.
- Upgrade strategy:
- Blue/green or canary by tenant/percent; monitor distribution drift, latency.
- Keep previous model hot for instant rollback.
- Maintain label set stability; if labels change, version labels and expose mapping. Never reorder without version bump.
- Keep response shape stable; only add optional fields.
- Rollbacks:
- Continue honoring pinned versions.
- Persist compatibility tests (golden inputs) to ensure identical shapes and tolerances.
- Ensure idempotency-key routes to the same model version during job lifetime.
# 16) Concrete Examples (abridged)
Synchronous (POST /v1/infer):
- Request JSON:
- model: "resnet50"
- model_version: "stable"
- image: { url: "https://signed.example.com/cat.jpg" }
- classify: { top_k: 5 }
- grad_cam: { enabled: true, overlay: true, format: "png" }
- Response 200:
- classes: [ { id: "n02124075", label: "Egyptian cat", prob: 0.87 }, ... ]
- heatmap: { type: "url", url: "https://signed...", expires_at: "..." }
Async batch (POST /v1/jobs) with Idempotency-Key:
- Body:
- job_type: "inference"
- model: "resnet50"
- inputs: [ { id: "img1", image: { storage_id: "obj_abc" } }, { id: "img2", image: { url: "https://..." } } ]
- callback_url: "https://client.example.com/hooks/infer"
- Response 202:
- job_id: "job_123", status: "queued"
Webhook delivery to callback_url:
- Headers: X-Event-Id, X-Timestamp, X-Signature: sha256=...
- Body: { job_id: "job_123", status: "succeeded", items_succeeded: 2, items_failed: 0, results_url: "https://..." }
Error example (413):
- error: { type: "limit_exceeded", code: "upload.too_large", message: "Image exceeds 100MB limit", status: 413, details: { limit_mb: 100, actual_mb: 180 }, request_id: "req_abc" }
# 17) Guardrails & Pitfalls
- Encourage async for large images; enforce server-side sync timeout (e.g., 10–15s).
- Provide clear Retry-After with queuing to reduce thundering herd.
- Ensure webhook verification and idempotency to avoid duplicate processing.
- Protect against MIME spoofing and decompression bombs.
- Avoid breaking changes: only add fields/enums in v1; use /v2 for breaking schema changes.
- Ensure stable label/versioning surfaces; document changes early and offer pinning.
This design balances usability (simple sync calls) with robustness (async jobs, idempotency, secure storage, and strong versioning) suitable for production workloads handling large images and explainability artifacts.