PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Coding & Algorithms/NVIDIA

Design and secure a REST inference API

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's ability to design secure, scalable REST APIs for machine learning inference, emphasizing API surface design, synchronous and asynchronous workflows, data validation, security controls, and operational concerns like idempotency and batching.

  • hard
  • NVIDIA
  • Coding & Algorithms
  • Data Scientist

Design and secure a REST inference API

Company: NVIDIA

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: hard

Interview Round: HR Screen

Design a REST API for an image‑inference service that accepts large images and returns class probabilities plus Grad‑CAM heatmaps. Specify endpoint paths/verbs, request/response schemas, idempotency, batching and pagination, async processing with job IDs and webhooks, rate limiting, auth (OAuth2/JWT), versioning, retries/timeouts/circuit breaking, and an error taxonomy. Discuss input validation, content‑type checks, secure storage, and ensuring backward compatibility during model upgrades and rollbacks.

Quick Answer: This question evaluates a candidate's ability to design secure, scalable REST APIs for machine learning inference, emphasizing API surface design, synchronous and asynchronous workflows, data validation, security controls, and operational concerns like idempotency and batching.

Solution

# Overview This design presents a versioned REST API for image classification with Grad-CAM explainability. It supports: - Synchronous single-image inference for fast/typical requests. - Asynchronous jobs for large images/batches with webhooks and polling. - Strong idempotency, rate limits, OAuth2/JWT auth, and a stable, additive versioning strategy. Guiding principles: - Additive, backward-compatible changes only in a given API version. - Consistent, typed, machine-readable errors. - Secure-by-default: TLS, signed URLs, encrypted storage, minimal retention. # 1) Endpoint Surface (v1) Base URL: /v1 - Health - GET /v1/health → 200 OK when service operational. - Models - GET /v1/models → List available models and their versions/labels. - Synchronous Inference - POST /v1/infer → Perform single-image synchronous inference. 200 on success; 202 if auto-upgraded to async due to size. - Asynchronous Jobs - POST /v1/jobs → Submit an inference job (single or batch). Returns job_id (202 Accepted). - GET /v1/jobs → List jobs (cursor-based pagination, filters by status, model, created_at). - GET /v1/jobs/{job_id} → Get job status and result (when complete). - GET /v1/jobs/{job_id}/results → Paginated results for batch jobs. - DELETE /v1/jobs/{job_id} → Cancel pending/running job. - Webhooks - POST /v1/webhooks → Register a webhook endpoint (optional; clients may also pass a callback URL per request). - GET /v1/webhooks → List registered webhooks. - DELETE /v1/webhooks/{webhook_id} → Delete webhook. - Limits - GET /v1/limits → Return per-tenant quotas and limits (rate, size, batch size). # 2) Authentication & Authorization - OAuth2 client credentials flow or JWT bearer tokens. - Scopes (examples): - inference.read, inference.write - jobs.read, jobs.write - webhooks.read, webhooks.write - Example header: Authorization: Bearer <jwt> - Tenancy derived from token; access restricted to tenant’s resources. # 3) Idempotency - All non-GET creation endpoints accept Idempotency-Key header (RFC-7231 semantics): - Same key + same payload within 24h returns the original response (status, body). - Response includes Idempotency-Replayed: true|false. - Include Request-Id in every response for tracing. # 4) Request/Response Schemas (representative) Common types: - ImageInput: one of - image.url (https URL, signed OK) - image.bytes (base64), Content-MD5 optional - image.storage_id (pre-uploaded object ref) - GradCAMOptions: - enabled (bool), layer (string|"auto"), colormap (e.g., "jet"), overlay (bool), format ("png"|"npy"), resolution ("input"|{width,height}) - ClassificationOptions: - top_k (1–1000, default 5), prob_threshold (0–1, default 0) Synchronous request (POST /v1/infer): - Content-Type: application/json or multipart/form-data (file field=image) - Body (JSON): - model: string (e.g., "resnet50") - model_version: string (e.g., "stable" or pinned version "2025-01-15") - image: ImageInput - grad_cam: GradCAMOptions { enabled: true } - classify: ClassificationOptions - response: { heatmap: "inline"|"url" } (default url) Synchronous response 200: - request_id: string - model: string; model_version: string - timings_ms: { queue: int, inference: int, total: int } - classes: [ { id: string, label: string, prob: float } ] (sorted desc) - heatmap: one of - { type: "png", data_b64: string, width: int, height: int, colormap: string } - { type: "url", url: string, expires_at: RFC3339 } Async submission (POST /v1/jobs): - Body: - job_type: "inference" - model, model_version - inputs: [ { id: string, image: ImageInput, grad_cam: GradCAMOptions, classify: ClassificationOptions } ] (1..N) - callback_url: optional (per-job webhook) - metadata: optional opaque JSON - ttl_hours: optional (retention policy) Async submit response 202: - job_id: string; status: "queued" - counts: { submitted: N } - estimated_wait_ms: int Job status (GET /v1/jobs/{job_id}) 200: - job_id, status: queued|running|succeeded|failed|canceled|expired - submitted_at, started_at, completed_at - model, model_version - error: nullable ErrorObject - result_summary: { items: int, succeeded: int, failed: int } Batch results (GET /v1/jobs/{job_id}/results?cursor=...) 200: - items: [ - { id: string, status, error?, result?: { classes: [...], heatmap: {url|inline}, timings_ms } } ] - page: { next_cursor?: string, size: int } ErrorObject (for all endpoints): - error: { type: string, code: string, message: string, status: int, details?: object, request_id: string } # 5) Batching & Pagination - Batching: POST /v1/jobs accepts up to max_batch_size (e.g., 256). Each input has a client-supplied id for correlation. - Pagination: cursor-based for listing jobs and retrieving batch results. - Request: ?cursor=opaque&limit=50 - Response: page.next_cursor # 6) Large Image Handling - Accept multipart/form-data for uploads; max Content-Length enforced (e.g., 100 MB). - Alternative: pre-signed upload: - Client uploads to object storage → receives storage_id → uses in API request. - Images may be downscaled server-side if resize options provided (e.g., resize: { longest_side: 1536 }). - Async auto-routing: if size/compute estimate exceeds sync thresholds, server returns 202 with job_id. # 7) Webhooks (Async Callbacks) - Delivery: POST to callback_url with body: { job_id, status, items_succeeded, items_failed, link_to_results } - Security: HMAC-SHA256 signature in header (X-Signature) using shared secret; timestamp and replay window enforced. - Retries: exponential backoff with jitter, up to N attempts; idempotency token in X-Event-Id to dedupe on receiver. # 8) Rate Limiting & Quotas - Per-tenant token bucket limits (requests/s), daily quotas (images/day), and concurrency caps. - On limit breach → 429 Too Many Requests with headers: - X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After # 9) Retries, Timeouts, Circuit Breaking Client guidance: - Sync calls: timeout ~15–30s; prefer async for large inputs. - Safe retries on network errors/timeouts and 5xx when Idempotency-Key is set. - Exponential backoff with jitter (e.g., base 500ms, factor 2.0, cap 30s). - Respect Retry-After on 429/503. Server-side: - Shed load with 429/503; include Retry-After. - Circuit breaking on GPU/backends; health probing, bulkheads per model. - Queue with fair scheduling; cancel on client abort for sync where possible. # 10) Versioning Strategy - Path versioning: /v1, /v2 for breaking changes. - Only additive changes within a major version (add fields, new enum values). - Model versioning is separate from API versioning: - model_version: "stable" by default; clients may pin to a specific version. - Deprecation policy: announce new API version, maintain old for a sunset window; return Deprecation and Sunset headers when applicable. # 11) Error Taxonomy HTTP status → error.type → error.code (examples): - 400 Bad Request → validation_error - image.too_large, image.invalid_format, param.out_of_range, json.malformed - 401 Unauthorized → auth_error - auth.missing_token, auth.invalid_token, auth.scope_insufficient - 403 Forbidden → authorization_error - access.denied - 404 Not Found → not_found - job.not_found, model.not_found - 409 Conflict → conflict - job.already_completed, resource.version_conflict - 413 Payload Too Large → limit_exceeded - upload.too_large, batch.too_large - 415 Unsupported Media Type → media_type_unsupported - 422 Unprocessable Entity → unprocessable - image.decoding_failed, image.animated_not_supported - 429 Too Many Requests → rate_limited - 500 Internal Server Error → server_error - 502/503/504 → upstream_error/service_unavailable/gateway_timeout Error response body (all cases): - error: { type, code, message, status, details?, request_id } # 12) Input Validation & Content-Type Checks - Size limits: e.g., <= 100 MB per image (configurable per tenant). - Dimensions: max width/height (e.g., 16k px); reject extremely skewed aspect ratios if needed. - Formats: image/jpeg, image/png, image/tiff, image/bmp. Reject mismatched extension vs MIME sniffing. - Disallow animated formats (GIF/WEBP) unless explicitly supported. - Validate base64 bytes; enforce Content-MD5 if provided. - Security scanning: AV scan, zip-bomb/Decompression bomb detection, EXIF stripping (optional), disable SVG/scriptable content. - Parameter validation: top_k in [1,1000]; prob_threshold in [0,1]; layer must exist or "auto". # 13) Secure Storage & Data Privacy - TLS 1.2+ in transit; server-side encryption at rest (SSE-KMS) for objects and results. - Signed URLs for temporary access; short expiry (e.g., 15 min). - Data minimization: default retention TTL (e.g., 24–72h); configurable per job. - Access control enforced per tenant; audit logs with request_id. - Secrets hygiene: don’t log image URLs/bytes; hash or redact PII; encrypt webhook secrets; rotate keys. # 14) Grad-CAM Options & Outputs - Default layer: auto-select final convolutional layer; allow explicit layer override. - Output options: - Inline PNG base64 for small responses - URL to signed object for large heatmaps or NPY arrays - Metadata: heatmap width/height, normalization (0–1), colormap, overlay flag. # 15) Backward Compatibility During Model Upgrades/Rollbacks - Expose model_version, labels_version, and calibration_version in responses. - Allow clients to pin model_version per request or via tenant setting. - Upgrade strategy: - Blue/green or canary by tenant/percent; monitor distribution drift, latency. - Keep previous model hot for instant rollback. - Maintain label set stability; if labels change, version labels and expose mapping. Never reorder without version bump. - Keep response shape stable; only add optional fields. - Rollbacks: - Continue honoring pinned versions. - Persist compatibility tests (golden inputs) to ensure identical shapes and tolerances. - Ensure idempotency-key routes to the same model version during job lifetime. # 16) Concrete Examples (abridged) Synchronous (POST /v1/infer): - Request JSON: - model: "resnet50" - model_version: "stable" - image: { url: "https://signed.example.com/cat.jpg" } - classify: { top_k: 5 } - grad_cam: { enabled: true, overlay: true, format: "png" } - Response 200: - classes: [ { id: "n02124075", label: "Egyptian cat", prob: 0.87 }, ... ] - heatmap: { type: "url", url: "https://signed...", expires_at: "..." } Async batch (POST /v1/jobs) with Idempotency-Key: - Body: - job_type: "inference" - model: "resnet50" - inputs: [ { id: "img1", image: { storage_id: "obj_abc" } }, { id: "img2", image: { url: "https://..." } } ] - callback_url: "https://client.example.com/hooks/infer" - Response 202: - job_id: "job_123", status: "queued" Webhook delivery to callback_url: - Headers: X-Event-Id, X-Timestamp, X-Signature: sha256=... - Body: { job_id: "job_123", status: "succeeded", items_succeeded: 2, items_failed: 0, results_url: "https://..." } Error example (413): - error: { type: "limit_exceeded", code: "upload.too_large", message: "Image exceeds 100MB limit", status: 413, details: { limit_mb: 100, actual_mb: 180 }, request_id: "req_abc" } # 17) Guardrails & Pitfalls - Encourage async for large images; enforce server-side sync timeout (e.g., 10–15s). - Provide clear Retry-After with queuing to reduce thundering herd. - Ensure webhook verification and idempotency to avoid duplicate processing. - Protect against MIME spoofing and decompression bombs. - Avoid breaking changes: only add fields/enums in v1; use /v2 for breaking schema changes. - Ensure stable label/versioning surfaces; document changes early and offer pinning. This design balances usability (simple sync calls) with robustness (async jobs, idempotency, secure storage, and strong versioning) suitable for production workloads handling large images and explainability artifacts.

Related Interview Questions

  • Return all file paths via DFS - NVIDIA (easy)
  • Implement a disk space manager with eviction - NVIDIA (medium)
  • Implement short algorithms on logs, grids, and strings - NVIDIA (hard)
  • Implement encode/decode for list of strings - NVIDIA (easy)
  • Solve small string and API tasks - NVIDIA (medium)
NVIDIA logo
NVIDIA
Oct 13, 2025, 9:49 PM
Data Scientist
HR Screen
Coding & Algorithms
1
0

Design a REST API for Image Inference with Grad-CAM

You are designing a public REST API for an image-inference service that accepts large images and returns both class probabilities and Grad-CAM heatmaps. Assume this is a multi-tenant service with both synchronous and asynchronous workflows and that clients may submit images via URL or upload.

Specify and justify the following:

Functional Scope

  • Accept large images; return top-K class probabilities and Grad-CAM heatmaps.
  • Support single-image sync requests and batch/async processing.

API Surface

  • Endpoint paths and HTTP verbs for:
    1. Synchronous inference
    2. Asynchronous jobs (submit, poll status, cancel)
    3. Batch processing and listing/paginating job results
    4. Webhook registration/usage (or per-request callback)
    5. Ancillary endpoints (e.g., models listing, health)
  • Request/response schemas (include fields, types, and examples)
  • Idempotency strategy for create/submit endpoints
  • Batching semantics and pagination scheme
  • Async processing with job IDs and webhooks

Cross-Cutting Concerns

  • Rate limiting and headers
  • Authentication/authorization (OAuth2/JWT, scopes)
  • API versioning strategy
  • Retries, timeouts, circuit breaking (client and server guidance)
  • Error taxonomy (structured errors, error codes, HTTP statuses)

Data Handling & Safety

  • Input validation rules (size, dimensions, formats)
  • Content-type checks and malware/zip-bomb defenses
  • Secure storage (encryption at rest, TTL, signed URLs)
  • Ensuring backward compatibility during model upgrades/rollbacks (pinning versions, deprecation)

Keep the design clear, minimal, and production-ready.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Coding & Algorithms•More NVIDIA•More Data Scientist•NVIDIA Data Scientist•NVIDIA Coding & Algorithms•Data Scientist Coding & Algorithms
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.