Feature Store and Training-Serving Consistency

What's being tested

Interviewers are probing your ability to design, operate, and debug feature pipelines so that models see the same inputs in production as they did during training — i.e., training-serving consistency. They want to hear practical tradeoffs between compute-on-demand and materialized features, concrete strategies for point-in-time correctness, versioning, latency/SLA targets, and observability that an MLE would own. Expect to justify design choices against latency, cost, and correctness constraints and to outline how you'd detect and mitigate drift or skew.

Core knowledge

Feature Store: a system for serving and storing ML features; separates offline (batch) computation for training from online low-latency serving for inference. Popular implementations: `Feast`, `Tecton`, internal home-grown stores.
Point-in-time joins / time travel: joins must use event-time semantics to prevent label leakage; training features must be constructed using only data available at prediction-time (use event_time <= cutoff_time).
Materialized vs compute-on-read: materialized features (precomputed and cached) give low latency and stable behavior; compute-on-read (online compute) reduces storage but risks inconsistent logic and higher p99 latency.
Feature versioning & schema: version features and transformations (semantic versions, stored code hashes) and enforce schemas (protobuf/Avro) to prevent silent type/nullable drift.
Consistency window / staleness: define TTL or freshness S: staleness = now - last_update; stale threshold ties into accuracy vs cost. For streaming: micro-batch lateness handling and allowed-lateness watermarking (e.g., 5–60s or longer).
Normalization & transformation parity: store or serve the same transformation parameters (means/stds, encodings, bins). A model should not depend on a preprocessing pipeline that is only applied offline.
Backfills & retraining: ensure reproducible backfills using the exact offline features with point-in-time joins; maintain reproducible pipeline artifacts (Docker image + commit hash + feature registry entry).
Monitoring & drift detection: monitor feature distributions (KL-divergence, PSI), missingness rates, p99 serving latency, and model output drift; trigger alerts on threshold breaches.
Operational mechanics: online stores often use Redis, RocksDB, or specialized low-latency stores; offline stores use BigQuery, Snowflake, Postgres, or object storage plus batch compute (Spark, Beam).
Late-arriving events and idempotency: design pipelines to handle out-of-order events (watermarks, upserts) and idempotent writes (idempotency keys) to avoid inconsistent aggregations between training and serving.
Aggregation windows & alignment: clearly define sliding/fixed windows used for features (e.g., 7-day count ending at event_time). Mismatch between training window logic and serving runtime logic is a common root cause of skew.
Cost vs correctness calculus: quantify cost impacts — precomputing features increases storage and ETL costs but reduces inference-time variability; compute-on-read raises CPU and latency costs, complicating SLA compliance for p99.

Worked example — Design a feature store for low-latency personalization ensuring training-serving consistency

First 30 seconds: clarify workload (QPS, p99 latency SLO, size of state per user, freshness requirements), SLA, and allowable staleness for personalization signals. State assumptions: 10k QPS, p99 < 50ms, features include per-user counts (7d), last-played timestamp, and categorical embeddings.

Skeleton answer pillars:

Offline pipeline: batch compute historical features with point-in-time joins into an offline store (BigQuery) for training, materialize exact artifacts and record feature versions.
Online store: materialize frequently-used features into a low-latency store (Redis/RocksDB) updated by streaming ETL (Flink/Beam) with upserts keyed by user_id; ensure event-time processing and allowed lateness.
Feature registry & contracts: register features with schemas and versions in Feast-style registry so training uses identical transformation code and versions.
Observability & drift: instrument checks for freshness, null rates, distribution drift, and p99 latency.

Tradeoff flagged: choose whether to compute per-request aggregates on read (reduces storage) versus maintain materialized aggregates (increases storage and streaming complexity). For strict p99 SLOs and high QPS, materialize aggregates. Close: if more time, I'd prototype cold-start behavior, load-test the online store for failover, and add a canary pipeline to validate feature parity end-to-end.

A second angle — Diagnose an offline/online metric gap (offline AUC high, online AUC drops)

Frame the problem by first checking simple mismatches: feature distribution shift, missing features in serving, inconsistent normalization, or label leakage in offline eval. Steps: compare histograms, null/missingness rates, and joint distributions of top features between training data and live features; verify feature version used in serving matches training; validate event-time correctness for aggregated features. If discrepancies are subtle, run a replay: export live feature vectors for a sample of inference requests and compute offline predictions with the model artifact to find where predicted scores diverge — that isolates preprocessing or serving-time logic mismatches.

Common pitfalls

Pitfall: trusting names alone. Teams often assume feature names match semantics — e.g., user_clicks_7d computed on ingestion-time versus event-time. Always ask for exact aggregation window definitions and compare event-time semantics.

Pitfall: overlooking normalization parameters. A tempting quick fix is to recompute means/stddev from live data; that obscures whether drift or a bug caused the change. Instead, serve the same parameters used at training and separately track live statistics.

Pitfall: focusing only on latency or only on correctness. Designing purely for lower p99 by computing-on-read can introduce nondeterminism; optimizing only cost by materializing everything can explode ETL complexity. Explain the concrete SLO-cost-correctness tradeoffs and propose incremental rollout.

Connections

This topic naturally connects to model observability (feature/model drift, root-cause), data engineering (streaming joins, watermarking, upsert semantics), and CI/CD for ML (reproducible training pipelines, artifact/version management). Interviewers may pivot to dataset lineage, A/B testing of feature changes, or retraining cadence design.