Analyze CTR Data and Train Model
Company: Reddit
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Onsite
You are given a notebook-based live coding task for click prediction.
A tabular dataset contains the following columns:
- `post_id`: unique identifier for a post
- `cat_hour`, `dog_hour`, `rabbit_hour`: numeric features representing recent hourly activity signals
- `current_tag`: categorical tag for the current post
- `is_click`: binary label indicating whether the post was clicked
Your task is to:
1. Perform exploratory data analysis on the dataset.
2. Identify important data quality issues, feature distributions, correlations, and potential leakage risks.
3. Prepare the data for modeling, including handling categorical features and deciding what to do with `post_id`.
4. Train a model to predict click probability (CTR) for each row.
5. Compare a simple baseline model with at least one stronger model.
6. Explain your evaluation strategy, model selection reasoning, and how you would interpret the results.
Assume this is an interview setting, so clarity of reasoning, sensible tradeoffs, and a clean end-to-end workflow matter as much as final model performance.
Quick Answer: This question evaluates end-to-end supervised machine learning competencies for tabular click-through rate prediction, including exploratory data analysis, data quality and leakage assessment, categorical feature handling, model training, baseline comparison, and evaluation interpretation.