Predict Seller Intent From Subscription Data
Company: Squarespace
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Onsite
You are given a take-home dataset, `seller_intent_take_home_dataset.csv`, containing about 5,000 new subscription records from a website-building platform. The business goal is to understand which factors are associated with a new user becoming an active seller and to build an interpretable prediction model for seller intent.
The dataset has the following columns:
| Column | Type | Description |
|---|---:|---|
| `subscription_id` | string | Unique subscription identifier. |
| `subscription_start` | timestamp | When the subscription officially started. |
| `first_sale` | timestamp, nullable | When the user made their first sale, if any. |
| `days_in_trial` | integer | Number of days the user spent in trial. |
| `subscription_plan` | categorical | Subscription plan selected. |
| `subscription_period` | categorical | Billing period, such as monthly or annual. |
| `discount_amount` | numeric | Discount amount applied to the subscription. |
| `country` | categorical | User country. |
| `site_topic` | categorical/free text, nullable | User-provided site topic from onboarding. |
| `site_need` | categorical/free text, nullable | User-provided reason or need for creating the site. |
| `user_agent` | string | Raw browser user-agent string. |
Important data issues:
- There is no existing `is_seller` label. You must define the target variable using `first_sale` and `subscription_start`.
- Some rows have inconsistent timestamps, including cases where `first_sale` occurs before `subscription_start`.
- `site_topic` and `site_need` have substantial missingness, roughly 28% to 32%, and many long-tail categories.
- `user_agent` is a raw string, so useful features such as device type, operating system, and browser must be extracted.
- The final notebook or report must be concise enough to present clearly and should prioritize data reasoning, feature engineering, and interpretability over complex hyperparameter tuning.
Deliverables:
1. Define the target variable and explicitly state how you handle timestamp edge cases.
2. Perform exploratory data analysis focused on data quality, missingness, class balance, and feature distributions.
3. Engineer useful features from categorical fields, timestamps, discounts, trial length, and `user_agent`.
4. Train an interpretable model, such as logistic regression, and evaluate it using appropriate classification and ranking metrics.
5. Explain the main drivers of predicted seller intent, while distinguishing predictive associations from causal claims.
6. Prepare a short presentation explaining your assumptions, modeling choices, results, limitations, and recommended next steps.
Quick Answer: This question evaluates a data scientist's competency in defining target labels from temporal data, handling timestamp edge cases and missingness, engineering features from categorical, free-text and user-agent fields, and building and interpreting an interpretable binary classification model.