PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Machine Learning/IBM

Design a lead-scoring model

Last updated: Jun 15, 2026

Quick Overview

Design an end-to-end lead-scoring model for a marketing/growth data science team: define the label and scoring time, engineer leakage-safe features, compare a logistic-regression baseline with gradient-boosted trees, and evaluate with capacity-aware ranking metrics. The question also probes multicollinearity, model interpretability (SHAP/reason codes), routing-threshold design, and operational risks like selection bias, class imbalance, fairness, and drift.

  • easy
  • IBM
  • Machine Learning
  • Data Scientist

Design a lead-scoring model

Company: IBM

Role: Data Scientist

Category: Machine Learning

Difficulty: easy

Interview Round: Technical Screen

##### Question You are interviewing for a Data Scientist role on a marketing/growth team. Sales has limited outreach capacity, so the business wants a **lead-scoring** system that ranks or scores incoming leads (a user or account arriving through ads, email, organic, etc.) so Sales/Marketing can prioritize who to contact. Assume you have a historical dataset of leads with a `lead_id`, a `created_at` timestamp, features available at scoring time (acquisition channel/campaign/geo/device, firmographics such as company size and industry, behavioral signals such as pages viewed, pricing-page hits, demo requests, email engagement), and one or more outcome labels (e.g. `converted` within a defined window, and optionally `time_to_convert_days`). Design an end-to-end approach. Be explicit about assumptions (conversion window, label definition, scoring cadence) and call out key pitfalls and edge cases. 1. Define the prediction **target (label)** and the **prediction time** (when the score is computed). Address how you handle leads that are too recent to have observed the outcome window. 2. Propose **feature sets and data sources**, and explain how you would handle feature availability and **leakage**. 3. Propose both a **statistical (baseline) model and a more advanced machine-learning model**, and explain the interpretability/performance tradeoffs. 4. The stakeholder may either only care about predictive performance, or require understanding **which features are important and why**. Explain what you would deliver in each scenario. 5. Explain what **multicollinearity** is, why it matters (or doesn't) for different model families, how you would detect it, and how you would mitigate it. 6. Define how you would **evaluate** the model: a primary metric (and why), diagnostic metrics/plots, and guardrails (fairness, stability, operational constraints). Tie metrics to business constraints such as top-K capacity, lift, calibration, and revenue. 7. Describe how you would pick an **operating threshold / routing policy** to turn scores into actions. 8. Discuss key **risks** — class imbalance, selection bias (sales touches are not random), fairness, and drift — and how you would monitor and iterate. 9. Describe how you would **deploy and monitor** the score in production and how you would update it over time.

Quick Answer: Design an end-to-end lead-scoring model for a marketing/growth data science team: define the label and scoring time, engineer leakage-safe features, compare a logistic-regression baseline with gradient-boosted trees, and evaluate with capacity-aware ranking metrics. The question also probes multicollinearity, model interpretability (SHAP/reason codes), routing-threshold design, and operational risks like selection bias, class imbalance, fairness, and drift.

IBM logo
IBM
Jan 17, 2026, 12:00 AM
Data Scientist
Technical Screen
Machine Learning
25
0
Question

You are interviewing for a Data Scientist role on a marketing/growth team. Sales has limited outreach capacity, so the business wants a lead-scoring system that ranks or scores incoming leads (a user or account arriving through ads, email, organic, etc.) so Sales/Marketing can prioritize who to contact.

Assume you have a historical dataset of leads with a lead_id, a created_at timestamp, features available at scoring time (acquisition channel/campaign/geo/device, firmographics such as company size and industry, behavioral signals such as pages viewed, pricing-page hits, demo requests, email engagement), and one or more outcome labels (e.g. converted within a defined window, and optionally time_to_convert_days).

Design an end-to-end approach. Be explicit about assumptions (conversion window, label definition, scoring cadence) and call out key pitfalls and edge cases.

  1. Define the prediction target (label) and the prediction time (when the score is computed). Address how you handle leads that are too recent to have observed the outcome window.
  2. Propose feature sets and data sources , and explain how you would handle feature availability and leakage .
  3. Propose both a statistical (baseline) model and a more advanced machine-learning model , and explain the interpretability/performance tradeoffs.
  4. The stakeholder may either only care about predictive performance, or require understanding which features are important and why . Explain what you would deliver in each scenario.
  5. Explain what multicollinearity is, why it matters (or doesn't) for different model families, how you would detect it, and how you would mitigate it.
  6. Define how you would evaluate the model: a primary metric (and why), diagnostic metrics/plots, and guardrails (fairness, stability, operational constraints). Tie metrics to business constraints such as top-K capacity, lift, calibration, and revenue.
  7. Describe how you would pick an operating threshold / routing policy to turn scores into actions.
  8. Discuss key risks — class imbalance, selection bias (sales touches are not random), fairness, and drift — and how you would monitor and iterate.
  9. Describe how you would deploy and monitor the score in production and how you would update it over time.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More IBM•More Data Scientist•IBM Data Scientist•IBM Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.