ML Feature Pipelines And Training-Serving Architecture
Asked of: Machine Learning Engineer
Last updated
What's being tested
Candidates must demonstrate end-to-end thinking about building reliable, low-skew ML pipelines: consistent feature computation and storage, repeatable training pipelines, and robust serving architecture that preserves offline/online parity. Interviewers probe tradeoffs between materializing features vs. compute-on-read, how to monitor and detect data/model drift, and pragmatic deployment patterns (canarying, shadow traffic) that an ML Engineer owns.
Core knowledge
-
Feature store: centralizes feature definitions and lineage; separates batch store (historical joins for training) from online store (low-latency lookups). Popular implementations:
`Feast`, custom`Redis`-backed stores. -
Training-serving skew: difference between feature values available at training time and at inference time; minimize with identical transformation code, shared feature definitions, and replayable offline joins.
-
Materialize vs compute-on-read: materialized features (precomputed) improve latency and determinism; compute-on-read reduces storage and staleness but increases per-request compute and tail latency. Pick based on QPS, feature cost, and freshness needs.
-
Freshness and staleness: define feature timestamp semantics; staleness = now - feature_ts. For time-sensitive signals use streaming ingestion (e.g.,
`Kafka`) and windowed aggregations; for offline-only signals accept larger windows. -
Online feature store design: low-latency stores (
`Redis`,`DynamoDB`) with TTLs and versioning; design keys as (entity_type, entity_id, feature_set_id, version). Prefetch and local caches reduce`p99`latency. -
Consistency & idempotency: ensure event ordering or use watermarking; for streaming transforms prefer exactly-once semantics when possible, but at MLE scope require tests that validate reproducibility of features from raw events.
-
Model training pipeline: reproducible DAGs with immutable inputs, deterministic feature joins, hyperparameter tracking, and artifact versioning (model + feature-set + preprocessing). Use metadata store and tie model version to feature-set version.
-
Serving architecture: consider model size, latency budget, and throughput: edge cache → online feature fetch → model inference → post-processing. Latency decomposition: .
-
Offline evaluation parity: simulate online lookup latency and missing features during validation; use holdout windows and backtesting to estimate staleness effects.
-
Monitoring & drift detection: log inputs/outputs and compute telemetry: feature distributions, feature missing-rate, prediction distribution, and business metrics. Add automated alerts when p-value drift tests or KL divergence exceed thresholds.
-
Safe rollout patterns: shadow traffic, canary percentage rollouts, and automatic rollback triggers on key telemetry (e.g.,
`p99`latency, feature missing-rate, negative business impact). -
Privacy & compliance: avoid leaking PII in feature materialization, support feature deletion/retirement workflows, and ensure feature lineage for audits.
Worked example — Design a video recommendation system
First 30 seconds: clarify scope (offline vs real-time recommendations), latency SLO (e.g., 100–300ms), personalization degree, and cold-start constraints (new users/videos). Declare assumptions about available signals: watch events, impressions, likes, video metadata, and user profile.
Organize answer into pillars: (1) data collection & logging with immutable event logs for replay; (2) feature pipeline & store: real-time aggregations (session watch time, recency-weighted counts) materialized to an online store and batch features in `BigQuery` for training; (3) model training: offline ranking model (e.g., pairwise or pointwise gradient-boosted tree or retrieval+rerank with embeddings) with reproducible pipelines, feature-set versioning, and offline AUC/PR and business metric estimations; (4) serving: two-stage serving — fast candidate retrieval (ANN on embeddings) then reranker using online features; (5) monitoring & rollouts: shadow traffic, canaries, and drift alerts.
Flag a tradeoff: whether to precompute heavy session-level features (materialize) or compute at request time using a fast stream join. For high QPS and tight latency, prefer materialization with frequent refresh; if freshness is paramount, prefer compute-on-read with optimized caches.
Close by listing next steps: propose key metrics for canaries (`CTR`, `watch-time`, feature missing-rate), plan for A/B testing, and if more time — detail online embedding refresh cadence and cold-start strategies.
A second angle — tight-latency mobile feed
Now assume sub-50ms tail latency for a mobile feed with intermittent connectivity. Emphasize smaller, quantized models (distillation/quantization), and minimal online feature fetches. Shift heavy state to precomputed per-user embeddings synced to device or fetched via a `CDN`; use lightweight on-device ranking or server-side sparse scoring with cached features. Replace exact ANN with highly optimized approximate indices and consider batching requests when possible to amortize lookup cost. Cold-start emphasizes content-side features and popularity baselines; plan for incremental embedding updates rather than full retraining for every content add.
Common pitfalls
Pitfall: Treating raw logs as interchangeably usable for training and serving. Many candidates assume identical joins; the practical error is ignoring event-time semantics and transformation differences that cause training-serving skew. Always tie model artifacts to explicit feature-set versions.
Pitfall: Proposing a high-dimensional, heavy model without deployment constraints. Interviewers expect tradeoffs: a model that wins offline but breaks
`p99`SLO is a failure. Quantify latency and memory and propose distillation, quantization, or tiered models.
Pitfall: Confusing ownership boundaries. MLEs should not design low-level ingestion plumbing, but must define validation tests, replayable transforms, and SLIs for upstream pipelines. Avoid specifying
`Kafka`partitioning or DAG retries; instead specify required guarantees and tests for data consumers.
Connections
This topic commonly pivots to model monitoring & observability, online experimentation (rollout strategies and guardrails), and retrieval systems / ANN for candidate generation. Be prepared to discuss model interpretability and fairness checks for features and outputs.
Further reading
-
Hidden Technical Debt in Machine Learning Systems — Sculley et al., 2015 — foundational read on maintenance burdens and pipeline coupling.
-
Feast Feature Store — practical reference for feature store patterns, online/batch separation, and feature registry concepts.
Practice questions
Related concepts
- Production ML Pipelines And System DesignML System Design
- Supervised ML Workflows, Interpretability And DeploymentMachine Learning
- ML Observability And Production MonitoringML System Design
- Production ML Validation And Monitoring
- ML System Design, Recommenders, Forecasting And AllocationMachine Learning
- Machine Learning Project LifecycleMachine Learning