This question evaluates competency in ML system design for real-time, low-latency inference APIs, including multitenancy, SLO/SLI definition, feature retrieval, model serving and rollout strategies, observability, cost control, and security/compliance within the ML System Design category.
Design a low-latency ML inference API for real-time predictions. Specify target SLOs (p50/p95 latency, availability), request/response schema, authentication, rate limiting, and multitenancy. Propose an architecture covering load balancing, stateless API tier, feature retrieval, model serving (CPU/GPU), batching, quantization, caching, and autoscaling strategies. Explain model versioning, canary/rollbacks, online A/B, observability (metrics, tracing, drift, data-quality checks), cost controls, and fallback behavior during partial outages. Address security, PII handling, regionalization, and disaster recovery.
Quick Answer: This question evaluates competency in ML system design for real-time, low-latency inference APIs, including multitenancy, SLO/SLI definition, feature retrieval, model serving and rollout strategies, observability, cost control, and security/compliance within the ML System Design category.
System Design: Low-Latency ML Inference API (Real-Time)
Context
You are designing an in-region, synchronous ML inference API that sits on the critical path of product surfaces (e.g., ranking, fraud checks, personalization) which require tight latency and high availability. The service must support multiple tenants, safe model rollouts, and strong observability, while controlling cost.
This is an open-ended design discussion. State your assumptions, propose concrete numerical targets, and walk through the design so that the latency budget provably adds up to the target. The interviewer is looking for your reasoning and trade-off analysis as much as the final architecture.
Address each of the seven Parts below. State explicit numerical targets and trade-offs where applicable, and call out any assumptions that materially shape the design.
Constraints & Assumptions
These are anchoring assumptions to scope the discussion; confirm or adjust them with the interviewer, but design against a concrete operating point rather than leaving everything open:
Workload:
synchronous request/response; one or a small list of candidates scored per call. No long-running / streaming generation.
Models:
a mix of classical models (logistic regression, GBDT/XGBoost) servable on CPU and deep models (DNN/transformer) servable on GPU.
Traffic:
roughly 2k-10k RPS per region at steady state, with occasional 2-3x correlated spikes; bursty and skewed by tenant.
Clients are in-region
(geo-routed), so cross-region round-trip time is
not
part of the hot-path latency budget.
Features are mostly precomputed
and read from an online feature store; a few may be derived at request time from the request payload.
Clarifying Questions to Ask
Before designing, scope the whole problem with the interviewer:
Latency contract:
what p95/p99 do downstream callers actually require, and is the budget end-to-end (edge → response) or service-internal only?
Traffic shape:
steady-state and peak RPS per region, the spike multiplier, and how skewed traffic is across tenants?
Model mix:
what fraction of requests hit deep (GPU) models vs. classical (CPU) models? This drives GPU sizing directly.
Feature freshness:
how stale can features be? Is the online store eventually consistent vs. an offline/streaming pipeline, and what's the acceptable freshness SLA?
Multitenancy:
how many tenants, do any require hard isolation, and can tenants pin their own fine-tuned model versions?
Compliance footprint:
which jurisdictions / data-sovereignty rules apply, and what PII (if any) is in the request payload?
Part 1 — Target SLOs
Propose p50 / p95 (and optionally p99)
end-to-end
latency targets and an availability target.
Define your
SLIs
(how each SLO is measured) and the
error budget
that follows.
What This Part Should Cover
Defensible, concrete numbers
for p50/p95 (and ideally p99) latency and a stated availability target, with the percentiles tied back to what downstream callers need.
Disjoint SLIs
where a slow-but-successful response dings only the latency budget and a degraded-but-correct response is treated as a separate signal, not an availability failure.
An error-budget policy
derived from the availability target, with multi-window burn-rate alerting and a release-freeze rule when the budget is exhausted.
Part 2 — API Design
Define the
request/response schema
, including idempotency, model/version selection, and metadata for traceability.
Specify the
authentication and authorization
approach.
Specify
rate limiting and quotas
.
Describe
multitenancy
: tenant isolation, quotas, and model routing.
What This Part Should Cover
A clean request/response contract
with idempotency, alias-vs-version model selection, and traceability metadata echoed back to the caller.
AuthN/AuthZ
for external callers and a separate internal service-to-service identity story.
Rate limiting / quotas
with a defensible primitive (e.g., per-tenant token buckets) and explicit overflow behavior (429s, priority classes).
A multitenancy model
that decides per tenant where hard isolation is required vs. shared-pool-with-quota, and how
tenant_id
routes to a model.
Feature retrieval
from the online store: consistency model and TTLs.
Model serving
choices (CPU vs GPU), dynamic batching, quantization, and caching.
Autoscaling
strategies for the API tier, feature store, and model servers.
What This Part Should Cover
An additive per-stage latency budget
that sums to the p95 target, with the two highest-risk stages (feature fetch, inference) given hard deadlines and circuit breakers.
A stateless API tier
plus global/edge protections (LB, WAF, DDoS, schema validation).
A feature-retrieval design
with a stated consistency model, TTL-bounded staleness, and batched reads (single round trip).
A justified CPU-vs-GPU serving split
with an explicit batching/quantization choice tied to the latency budget, not throughput alone.
Per-tier autoscaling
keyed on the
earliest
leading indicator of tail pain (not just CPU), plus a cold-start mitigation.
Part 4 — Release Safety and Experimentation
Model versioning and registry
(what metadata an immutable version stores).
Canary / shadow
deployment and
rollback
criteria.
Online A/B
: assignment, per-arm metrics, and guardrails.
What This Part Should Cover
A registry of immutable versions
storing the metadata needed to reproduce and gate a model (schema signature, training-data hash, offline metrics, provenance), with aliases decoupled from artifacts.
A clear shadow-vs-canary distinction
with explicit auto-promote / auto-rollback criteria and a warm last-known-good for sub-second rollback.
An online A/B design
with deterministic assignment, per-arm business
and
infra/calibration metrics, and a kill-switch.
Part 5 — Observability and Quality
Metrics, logs, and tracing
, end-to-end and
per stage
.
Data / feature quality
checks and
drift detection
.
What This Part Should Cover
Per-stage latency/error visibility
(not just end-to-end) and distributed traces sliceable by tenant / model / experiment arm.
Online data-quality checks
(nulls, type/range/cardinality violations) with a defined action on violation.
Drift detection
via a distribution-distance metric on features and on the score distribution.
A train/serve-parity check
(e.g., feature schema-hash match) that fails closed rather than scoring on malformed input.
Fallback behavior
under partial outages or capacity shortfalls.
What This Part Should Cover
A steady-state utilization target
with spike headroom and right-sizing, plus a unit-economics view (cost per 1k predictions) to make tiering/caching decisions data-driven.
Traffic tiering under pressure
(cheaper/quantized models for low-value traffic, full-fat GPU reserved for high-value).
A fixed fallback ladder
where every dependency has a timeout, circuit breaker, and defined fallback, with the invariant that no single stage failure fails the whole request.
Part 7 — Security and Compliance
Request security
: mTLS, secrets management.
PII handling
, retention, and auditability.
Regionalization / data-sovereignty
and a
disaster recovery
plan.
What This Part Should Cover
Defense in depth
: external TLS, internal mTLS with service identities, secrets in a KMS, and data minimization at the edge.
Concrete PII handling
: tokenization, field-level encryption,
named
retention windows, and audit trails — not just "encrypt everything."
Regionalization / data-sovereignty
: where PII and models are allowed to live, and how failover avoids cross-region PII copy.
Concrete RPO/RTO
DR targets backed by tested failover and chaos drills.
What a Strong Answer Covers
These dimensions span all seven parts; the interviewer is listening for them throughout, not in any single section:
A back-of-envelope capacity estimate
that sizes each tier against the traffic
it actually serves
(not assuming every request hits a GPU).
End-to-end coherence
— the SLOs, latency budget, capacity estimate, and fallback ladder are mutually consistent rather than designed in isolation.
Explicit trade-off reasoning
throughout (batching vs. tail latency, dedicated vs. shared pools, caching vs. staleness, quantization vs. accuracy).
Clearly stated assumptions
that materially influence the design, surfaced rather than buried.
Follow-up Questions
Be ready for deeper probes after the main design:
How does this change at 10x scale (or 100x)?
What breaks first — the feature store, GPU pool, or the API tier — and how would you re-architect?
A canary's business KPI improves but p99 latency regresses.
How does your rollout policy resolve that conflict (Part 4), and what's the automated action?
Walk through what happens when the online feature store loses a shard mid-request.
Trace it through your fallback ladder (Part 6) and say which SLI (if any) it dings.
Where would you add caching, and where is it actively harmful?
Justify against hit rate and staleness for personalized, per-request scoring.