This question evaluates competency in building a reproducible, production-ready end-to-end binary classification pipeline—covering data validation, missing-value handling, encoding and scaling, stratified splits, pipeline-based preprocessing to prevent leakage, baseline training, hyperparameter tuning, and test-set reporting; it belongs to the ML System Design category and targets practical application skills for Machine Learning Engineer roles. It is commonly asked to assess a candidate's ability to design modular, leakage-resistant workflows, demonstrate reproducibility and monitoring practices, and reason about trade-offs in preprocessing, model selection, evaluation metrics, and training reliability in real-world ML pipelines.
You are given a single CSV file that fits in memory. The dataset contains:
Your task is to implement a modular, reproducible, and command-line runnable training pipeline that prevents data leakage and produces a test-set report. Use Python with common ML tooling.
Login required