Design a lead-scoring model
Company: IBM
Role: Data Scientist
Category: Machine Learning
Difficulty: easy
Interview Round: Technical Screen
##### Question
You are interviewing for a Data Scientist role on a marketing/growth team. Sales has limited outreach capacity, so the business wants a **lead-scoring** system that ranks or scores incoming leads (a user or account arriving through ads, email, organic, etc.) so Sales/Marketing can prioritize who to contact.
Assume you have a historical dataset of leads with a `lead_id`, a `created_at` timestamp, features available at scoring time (acquisition channel/campaign/geo/device, firmographics such as company size and industry, behavioral signals such as pages viewed, pricing-page hits, demo requests, email engagement), and one or more outcome labels (e.g. `converted` within a defined window, and optionally `time_to_convert_days`).
Design an end-to-end approach. Be explicit about assumptions (conversion window, label definition, scoring cadence) and call out key pitfalls and edge cases.
1. Define the prediction **target (label)** and the **prediction time** (when the score is computed). Address how you handle leads that are too recent to have observed the outcome window.
2. Propose **feature sets and data sources**, and explain how you would handle feature availability and **leakage**.
3. Propose both a **statistical (baseline) model and a more advanced machine-learning model**, and explain the interpretability/performance tradeoffs.
4. The stakeholder may either only care about predictive performance, or require understanding **which features are important and why**. Explain what you would deliver in each scenario.
5. Explain what **multicollinearity** is, why it matters (or doesn't) for different model families, how you would detect it, and how you would mitigate it.
6. Define how you would **evaluate** the model: a primary metric (and why), diagnostic metrics/plots, and guardrails (fairness, stability, operational constraints). Tie metrics to business constraints such as top-K capacity, lift, calibration, and revenue.
7. Describe how you would pick an **operating threshold / routing policy** to turn scores into actions.
8. Discuss key **risks** — class imbalance, selection bias (sales touches are not random), fairness, and drift — and how you would monitor and iterate.
9. Describe how you would **deploy and monitor** the score in production and how you would update it over time.
Quick Answer: Design an end-to-end lead-scoring model for a marketing/growth data science team: define the label and scoring time, engineer leakage-safe features, compare a logistic-regression baseline with gradient-boosted trees, and evaluate with capacity-aware ranking metrics. The question also probes multicollinearity, model interpretability (SHAP/reason codes), routing-threshold design, and operational risks like selection bias, class imbalance, fairness, and drift.