ML System Design: Infra Stack, Feature Store, Reproducibility, and Monitoring
Context: You are designing and operating a machine learning platform that powers real-time, high-traffic use cases (for example: delivery ETA, dispatch/matching, ranking, fraud prevention). The system must support batch training, real-time inference, and stringent latency/availability SLAs.
1) Modern ML Infrastructure Stack
Describe the key components of a modern ML infrastructure stack and how they interact end-to-end from data generation to model impact in production.
2) Scalable Feature Store
Design a feature store that supports both:
-
Offline training (historical, point-in-time correct feature computation and backfills).
-
Online inference (low-latency feature retrieval, high freshness, and consistency with offline definitions).
Explain the architecture, data model, consistency model, and pipelines required.
3) Reproducibility and Versioning
Explain strategies to ensure reproducibility and versioning of data, code, configurations, features, and models throughout the ML pipeline.
4) Monitoring and Troubleshooting in Production
Describe how you would monitor and troubleshoot production ML services for:
-
Latency and availability (P50/P95/P99, error rates),
-
Data/feature drift and concept drift,
-
Model degradation (online metrics and delayed labels).
Include alerting, debugging playbooks, and safe-guard strategies.