How would you design delay and watchlist models?
Company: Capital One
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
You may be asked one or both of the following machine-learning case questions:
1. Flight-delay prediction case
An airline wants a model that predicts departure delay in minutes for each flight 2 hours before scheduled departure. You have historical flight operations data, airport congestion, aircraft and route information, weather forecasts, and crew or maintenance signals. Propose a regression-based approach and explain:
- how you define the target and avoid label leakage;
- which features you would engineer;
- how you would split training and validation data over time;
- which evaluation metrics you would use, such as MAE, RMSE, or quantile loss, and why;
- how you would handle missing data, outliers, and highly correlated variables;
- whether multicollinearity is harmful for prediction, interpretability, or both;
- what threshold would make you call a correlation high, and why;
- alternatives to dropping correlated features, such as regularization, feature clustering, PCA, or tree-based models;
- if you remove a feature, how you would estimate that feature's business impact;
- how you would turn model outputs into concrete operational recommendations for the airline.
Assume delays are right-skewed, severe delays are rare but costly, and airport-specific operational policies differ across hubs.
2. Watchlist face-recognition case
A bank wants to use branch camera feeds to flag whether an entering customer matches a watchlist of known robbers. Describe how you would design the model and decision system. Address:
- closed-set versus open-set recognition;
- data collection and labeling;
- low base rates and class imbalance;
- false-positive versus false-negative costs;
- threshold selection, calibration, and human review;
- fairness, privacy, consent, and legal risk;
- latency and on-device versus server inference;
- monitoring for drift, spoofing, and adversarial attacks.
For both cases, explain not only the modeling approach but also the business and ethical tradeoffs.
Quick Answer: This question evaluates competencies in end-to-end machine learning system design, covering time-series regression and label-leakage concerns, feature engineering, handling skewed targets and rare costly events, imbalanced and open-set face-recognition classification, evaluation and calibration, thresholding and decision systems, deployment and monitoring, and ethical/privacy trade-offs. It is commonly asked to assess the ability to balance statistical modeling with operational, business, and legal constraints; the domain is Machine Learning for a Data Scientist role and the required level spans both conceptual understanding and practical application.