System Design: Feature Store for Offline Training and Low‑Latency Online Inference
Context
You are designing a feature store to support machine learning models that require:
-
Offline training on historical data.
-
Low-latency online inference at high QPS.
Assume an internet-scale social platform with entities such as user, item/content, and community. Models include recommendations, feed ranking, and abuse/spam detection.
Requirements
Specify and design the following:
-
Functional and non-functional requirements (SLOs, scale assumptions).
-
Feature ingestion for batch and streaming data.
-
Computation and storage layers (offline/online) with point-in-time correctness.
-
Feature serving APIs for training data generation and online inference.
-
Offline–online consistency strategy.
-
Freshness SLAs, backfills, and bootstrap flows.
-
Lineage/metadata, governance, and access control.
-
Deployment architecture and environments.
-
CI/CD for feature definitions and pipelines.
-
Testing strategies (unit, integration, end-to-end, data validation).
-
Caching layers and eviction policies.
-
Reliability/SLOs and fault tolerance.
-
Monitoring/alerting and rollback processes.
-
Cost and rate-limiting considerations.
-
Component diagrams and justified technology choices.