How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Onsite rounds at Squarespace.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Squarespace during technical interviews.

Predict Seller Intent From Subscription Data | Squarespace Interview Question

Q: Predict Seller Intent From Subscription Data

Solve a Squarespace-style seller-intent take-home by defining a leakage-safe target, handling timestamp anomalies and censoring, engineering subscription and user-agent features, training interpretable logistic regression, evaluating ranking and calibration, and explaining predictive drivers without causal overclaiming.

You are given a take-home dataset, seller_intent_take_home_dataset.csv, containing about 5,000 new subscription records from a website-building platform. The business goal is to understand which factors are associated with a new user becoming an active seller and to build an interpretable prediction model for seller intent.

The dataset includes subscription_id, subscription_start, nullable first_sale, days_in_trial, subscription_plan, subscription_period, discount_amount, country, site_topic, site_need, and raw user_agent.

Important issues include missing seller labels, inconsistent timestamps, missing and long-tail onboarding fields, and raw user-agent strings.

Constraints & Assumptions

Define the target from first_sale and subscription_start ; do not use future information as features.
State how you handle rows where first_sale occurs before subscription_start .
Prioritize practical data reasoning, feature engineering, leakage prevention, interpretability, and presentation clarity over complex model tuning.
Distinguish predictive associations from causal claims.

Clarifying Questions to Ask Guidance

At what time will the model score a subscriber: at subscription start, after trial, or after onboarding?
What prediction horizon matters: any future sale, 7-day seller, 30-day seller, or another window?
What is the data extraction date so right-censoring can be handled?
What business action will use the prediction: lifecycle messaging, sales outreach, onboarding personalization, or product analysis?
What cost tradeoff matters more: missing potential sellers or contacting low-intent users?

Part 1 - Define The Target And Handle Timestamp Edge Cases

How would you define the seller-intent label and handle timestamp anomalies?

What This Part Should Cover Guidance

A fixed-horizon target such as seller within 30 days after subscription start.
Exclusion or flagging of records with first_sale < subscription_start , with a clear rule for minor timezone-like differences if used.
Right-censoring treatment for subscriptions without a full observation window.
Avoidance of leakage from first_sale or days_to_first_sale as features.

Part 2 - Perform Focused EDA

What exploratory analysis would you perform?

What This Part Should Cover Guidance

Data quality checks, uniqueness, timestamp parsing, missingness, class balance, and outliers.
Conversion by plan, period, country, trial length, discount, topic, need, and device/browser features.
Long-tail category analysis and high-signal tables or plots that support modeling decisions.

Part 3 - Engineer Features

What features would you create?

What This Part Should Cover Guidance

Timestamp, trial, discount, plan, billing period, country, topic, need, and missingness features.
Rare-category grouping and one-hot encoding.
User-agent-derived device type, operating system, and browser.
Train-only preprocessing to prevent validation leakage.

Part 4 - Train And Evaluate An Interpretable Model

Which model and metrics would you use?

What This Part Should Cover Guidance

Baseline model, regularized logistic regression, and optional comparison to a tree-based model.
Stratified or temporal validation based on the business use case.
ROC-AUC, PR-AUC, log loss or Brier score, calibration, precision/recall, and lift or precision at top K.

Part 5 - Explain Drivers And Limitations

How would you explain the main drivers of seller intent?

What This Part Should Cover Guidance

Coefficient or odds-ratio interpretation for logistic regression.
Clear language that features are predictive associations, not proven causes.
Limitations around right-censoring, timestamp quality, small sample size, missing fields, and observational data.

Part 6 - Prepare The Final Presentation

What should the final notebook or presentation include?

What This Part Should Cover Guidance

Goal, target definition, EDA findings, feature engineering, model choice, metrics, drivers, business recommendations, limitations, and next steps.
Concise visuals and an executive summary.
Recommendations for improved tracking, experiments, monitoring, and productionization.

What a Strong Answer Covers Guidance

Defines the prediction problem precisely before modeling.
Prevents leakage and handles censoring and timestamp anomalies honestly.
Uses interpretable modeling and business-relevant metrics.
Explains what the model can and cannot claim.

Follow-up Questions Guidance

What if the seller rate is only 5%?
How would you choose a threshold for outreach?
What if annual plans are highly predictive but not causal?
How would you monitor the model after launch?
How would you improve the data collection process?

Important issues include missing seller labels, inconsistent timestamps, missing and long-tail onboarding fields, and raw user-agent strings.

Constraints & Assumptions

Define the target from first_sale and subscription_start ; do not use future information as features.
State how you handle rows where first_sale occurs before subscription_start .
Prioritize practical data reasoning, feature engineering, leakage prevention, interpretability, and presentation clarity over complex model tuning.
Distinguish predictive associations from causal claims.

Clarifying Questions to Ask Guidance

At what time will the model score a subscriber: at subscription start, after trial, or after onboarding?
What prediction horizon matters: any future sale, 7-day seller, 30-day seller, or another window?
What is the data extraction date so right-censoring can be handled?
What business action will use the prediction: lifecycle messaging, sales outreach, onboarding personalization, or product analysis?
What cost tradeoff matters more: missing potential sellers or contacting low-intent users?

Part 1 - Define The Target And Handle Timestamp Edge Cases

How would you define the seller-intent label and handle timestamp anomalies?

What This Part Should Cover Guidance

A fixed-horizon target such as seller within 30 days after subscription start.
Exclusion or flagging of records with first_sale < subscription_start , with a clear rule for minor timezone-like differences if used.
Right-censoring treatment for subscriptions without a full observation window.
Avoidance of leakage from first_sale or days_to_first_sale as features.

Part 2 - Perform Focused EDA

What exploratory analysis would you perform?

What This Part Should Cover Guidance

Data quality checks, uniqueness, timestamp parsing, missingness, class balance, and outliers.
Conversion by plan, period, country, trial length, discount, topic, need, and device/browser features.
Long-tail category analysis and high-signal tables or plots that support modeling decisions.

Part 3 - Engineer Features

What features would you create?

What This Part Should Cover Guidance

Timestamp, trial, discount, plan, billing period, country, topic, need, and missingness features.
Rare-category grouping and one-hot encoding.
User-agent-derived device type, operating system, and browser.
Train-only preprocessing to prevent validation leakage.

Part 4 - Train And Evaluate An Interpretable Model

Which model and metrics would you use?

What This Part Should Cover Guidance

Baseline model, regularized logistic regression, and optional comparison to a tree-based model.
Stratified or temporal validation based on the business use case.
ROC-AUC, PR-AUC, log loss or Brier score, calibration, precision/recall, and lift or precision at top K.

Part 5 - Explain Drivers And Limitations

How would you explain the main drivers of seller intent?

What This Part Should Cover Guidance

Coefficient or odds-ratio interpretation for logistic regression.
Clear language that features are predictive associations, not proven causes.
Limitations around right-censoring, timestamp quality, small sample size, missing fields, and observational data.

Part 6 - Prepare The Final Presentation

What should the final notebook or presentation include?

What This Part Should Cover Guidance

Goal, target definition, EDA findings, feature engineering, model choice, metrics, drivers, business recommendations, limitations, and next steps.
Concise visuals and an executive summary.
Recommendations for improved tracking, experiments, monitoring, and productionization.

What a Strong Answer Covers Guidance

Defines the prediction problem precisely before modeling.
Prevents leakage and handles censoring and timestamp anomalies honestly.
Uses interpretable modeling and business-relevant metrics.
Explains what the model can and cannot claim.

Follow-up Questions Guidance

What if the seller rate is only 5%?
How would you choose a threshold for outreach?
What if annual plans are highly predictive but not causal?
How would you monitor the model after launch?
How would you improve the data collection process?

Predict Seller Intent From Subscription Data

Quick Overview

Predict Seller Intent From Subscription Data

Constraints & Assumptions

Clarifying Questions to Ask Guidance

Part 1 - Define The Target And Handle Timestamp Edge Cases

What This Part Should Cover Guidance

Part 2 - Perform Focused EDA

What This Part Should Cover Guidance

Part 3 - Engineer Features

What This Part Should Cover Guidance

Part 4 - Train And Evaluate An Interpretable Model

What This Part Should Cover Guidance

Part 5 - Explain Drivers And Limitations

What This Part Should Cover Guidance

Part 6 - Prepare The Final Presentation

What This Part Should Cover Guidance

What a Strong Answer Covers Guidance

Follow-up Questions Guidance

Write your answer

Predict Seller Intent From Subscription Data

Quick Overview

Predict Seller Intent From Subscription Data

Constraints & Assumptions

Clarifying Questions to Ask Guidance

Part 1 - Define The Target And Handle Timestamp Edge Cases

What This Part Should Cover Guidance

Part 2 - Perform Focused EDA

What This Part Should Cover Guidance

Part 3 - Engineer Features

What This Part Should Cover Guidance

Part 4 - Train And Evaluate An Interpretable Model

What This Part Should Cover Guidance

Part 5 - Explain Drivers And Limitations

What This Part Should Cover Guidance

Part 6 - Prepare The Final Presentation

What This Part Should Cover Guidance

What a Strong Answer Covers Guidance

Follow-up Questions Guidance

Write your answer