Context
You are interviewing for a Data Scientist role on a marketing/growth team. The business wants lead scoring: ranking or scoring incoming leads so Sales/Marketing can prioritize outreach.
Data
Assume you have a historical dataset of leads with:
-
lead_id
(string/int)
-
created_at
(timestamp)
-
Features available at scoring time, e.g.
-
acquisition channel, campaign, geography, device
-
firmographics (company size, industry)
-
behavioral signals (pages viewed, demo request, email opens)
-
any other engineered features available at lead creation time
-
Outcome label(s), e.g.
-
converted
(boolean): whether the lead converted within a defined window
-
time_to_convert_days
(numeric, optional)
Task
-
Propose an end-to-end approach to build a
statistical model and a machine learning model
for lead scoring.
-
Discuss what kinds of
variables/features
you would use and how you would handle feature availability and leakage.
-
The stakeholder may either:
-
only care about predictive performance, or
-
require understanding
which features are important
and why.
Explain what you would deliver in each scenario.
-
Explain what
multicollinearity
is, why it matters (or doesn’t) for different model families, how you would detect it, and how you would mitigate it.
-
Define how you would evaluate the model, including:
-
a primary metric (and why)
-
diagnostic metrics/plots
-
guardrails (fairness, stability, or operational constraints)
-
Describe how you would deploy and monitor the lead score in production and how you would update it over time.
Be explicit about assumptions (conversion window, label definition, scoring cadence) and call out key pitfalls/edge cases.