ML Feature Pipelines And Training-Serving Architecture

What's being tested

Candidates must demonstrate end-to-end thinking about building reliable, low-skew ML pipelines: consistent feature computation and storage, repeatable training pipelines, and robust serving architecture that preserves offline/online parity. Interviewers probe tradeoffs between materializing features vs. compute-on-read, how to monitor and detect data/model drift, and pragmatic deployment patterns (canarying, shadow traffic) that an ML Engineer owns.

Core knowledge

Feature store: centralizes feature definitions and lineage; separates batch store (historical joins for training) from online store (low-latency lookups). Popular implementations: `Feast`, custom `Redis`-backed stores.
Training-serving skew: difference between feature values available at training time and at inference time; minimize with identical transformation code, shared feature definitions, and replayable offline joins.
Materialize vs compute-on-read: materialized features (precomputed) improve latency and determinism; compute-on-read reduces storage and staleness but increases per-request compute and tail latency. Pick based on QPS, feature cost, and freshness needs.
Freshness and staleness: define feature timestamp semantics; staleness = now - feature_ts. For time-sensitive signals use streaming ingestion (e.g., `Kafka`) and windowed aggregations; for offline-only signals accept larger windows.
Online feature store design: low-latency stores (`Redis`, `DynamoDB`) with TTLs and versioning; design keys as (entity_type, entity_id, feature_set_id, version). Prefetch and local caches reduce `p99` latency.
Consistency & idempotency: ensure event ordering or use watermarking; for streaming transforms prefer exactly-once semantics when possible, but at MLE scope require tests that validate reproducibility of features from raw events.
Model training pipeline: reproducible DAGs with immutable inputs, deterministic feature joins, hyperparameter tracking, and artifact versioning (model + feature-set + preprocessing). Use metadata store and tie model version to feature-set version.
Serving architecture: consider model size, latency budget, and throughput: edge cache → online feature fetch → model inference → post-processing. Latency decomposition: $L_{total} = L_{network} + L_{feature\_fetch} + L_{model\_compute} + L_{postproc}$ .
Offline evaluation parity: simulate online lookup latency and missing features during validation; use holdout windows and backtesting to estimate staleness effects.
Monitoring & drift detection: log inputs/outputs and compute telemetry: feature distributions, feature missing-rate, prediction distribution, and business metrics. Add automated alerts when p-value drift tests or KL divergence exceed thresholds.
Safe rollout patterns: shadow traffic, canary percentage rollouts, and automatic rollback triggers on key telemetry (e.g., `p99` latency, feature missing-rate, negative business impact).
Privacy & compliance: avoid leaking PII in feature materialization, support feature deletion/retirement workflows, and ensure feature lineage for audits.

Worked example — Design a video recommendation system

First 30 seconds: clarify scope (offline vs real-time recommendations), latency SLO (e.g., 100–300ms), personalization degree, and cold-start constraints (new users/videos). Declare assumptions about available signals: watch events, impressions, likes, video metadata, and user profile.

Organize answer into pillars: (1) data collection & logging with immutable event logs for replay; (2) feature pipeline & store: real-time aggregations (session watch time, recency-weighted counts) materialized to an online store and batch features in `BigQuery` for training; (3) model training: offline ranking model (e.g., pairwise or pointwise gradient-boosted tree or retrieval+rerank with embeddings) with reproducible pipelines, feature-set versioning, and offline AUC/PR and business metric estimations; (4) serving: two-stage serving — fast candidate retrieval (ANN on embeddings) then reranker using online features; (5) monitoring & rollouts: shadow traffic, canaries, and drift alerts.

Flag a tradeoff: whether to precompute heavy session-level features (materialize) or compute at request time using a fast stream join. For high QPS and tight latency, prefer materialization with frequent refresh; if freshness is paramount, prefer compute-on-read with optimized caches.

Close by listing next steps: propose key metrics for canaries (`CTR`, `watch-time`, feature missing-rate), plan for A/B testing, and if more time — detail online embedding refresh cadence and cold-start strategies.

A second angle — tight-latency mobile feed

Now assume sub-50ms tail latency for a mobile feed with intermittent connectivity. Emphasize smaller, quantized models (distillation/quantization), and minimal online feature fetches. Shift heavy state to precomputed per-user embeddings synced to device or fetched via a `CDN`; use lightweight on-device ranking or server-side sparse scoring with cached features. Replace exact ANN with highly optimized approximate indices and consider batching requests when possible to amortize lookup cost. Cold-start emphasizes content-side features and popularity baselines; plan for incremental embedding updates rather than full retraining for every content add.

Common pitfalls

Pitfall: Treating raw logs as interchangeably usable for training and serving. Many candidates assume identical joins; the practical error is ignoring event-time semantics and transformation differences that cause training-serving skew. Always tie model artifacts to explicit feature-set versions.

Pitfall: Proposing a high-dimensional, heavy model without deployment constraints. Interviewers expect tradeoffs: a model that wins offline but breaks `p99` SLO is a failure. Quantify latency and memory and propose distillation, quantization, or tiered models.

Pitfall: Confusing ownership boundaries. MLEs should not design low-level ingestion plumbing, but must define validation tests, replayable transforms, and SLIs for upstream pipelines. Avoid specifying `Kafka` partitioning or DAG retries; instead specify required guarantees and tests for data consumers.

Connections

This topic commonly pivots to model monitoring & observability, online experimentation (rollout strategies and guardrails), and retrieval systems / ANN for candidate generation. Be prepared to discuss model interpretability and fairness checks for features and outputs.

What's being tested

Core knowledge

Worked example — Design a video recommendation system

A second angle — tight-latency mobile feed

Common pitfalls

Connections

Further reading

Practice questions

Related concepts