##### Question
You are a Data Scientist on a US C2C marketplace app (like Facebook Marketplace) where users buy and sell second-hand products.
**Current product behavior**
- Users browse product listings.
- If a buyer is interested in a listing, they can click **“Send message”** to contact the seller.
- Each message sent counts as **one listing interaction**.
**Proposed feature**
On a product listing, buyers can opt into **reminders/notifications** for **“similar listings you may like.”** When similar products become available, the buyer receives a notification.
Answer the following:
1. **Pre-launch / decision framing.** How would you decide whether this feature is a good idea for the product? Cover:
- The user problem and hypothesis you are testing.
- What success metrics you would expect to move (and why), and how you would distinguish primary vs. diagnostic vs. guardrail metrics.
- Key tradeoffs and risks (e.g., notification fatigue, adverse selection, cannibalization of search).
- What data you would analyze *before* building to validate demand and size the opportunity (e.g., backtesting against historical logs).
- What MVP / phased rollout plan you would propose if you were uncertain.
2. **Post-implementation impact evaluation.** Assume engineers have shipped the functionality (or it can be enabled for some users). How would you measure its impact and determine whether it is successful? Be specific about:
- The recommended experiment or causal design, the unit of randomization, control vs. treatment, and duration.
- Primary success metric(s) vs. secondary/diagnostic metrics vs. guardrail metrics.
- Key pitfalls (opt-in selection bias, notification fatigue, interference/network effects, seasonality, attribution) and how you would handle them.
- How you would interpret results and decide to iterate, roll out, or roll back.
Quick Answer: A Meta Data Scientist analytics-and-experimentation interview question: a US C2C second-hand marketplace is considering opt-in “similar listings” notifications. You must (1) decide whether to build it — hypotheses, opportunity sizing, risks, MVP plan — and (2) design the post-launch measurement: a user-randomized A/B test with a primary/diagnostic/guardrail metric stack, handling of opt-in selection bias, and a clear launch/rollback rule.
Solution
### Part 1 — Decide whether it’s worth building
#### 1) Clarify the goal and articulate hypotheses
A marketplace feature must ultimately improve marketplace health — liquidity and buyer–seller match rate — without harming the user experience. The candidate goal here: increase buyer-to-seller connections and purchases by helping buyers discover relevant inventory when it appears.
Example hypotheses:
- **H1 (engagement/liquidity):** Similar-listing notifications increase buyer re-engagement and listing interactions (messages) per buyer.
- **H2 (conversion efficiency):** Notifications increase downstream conversions (purchases / completed transactions), reduce time-to-purchase, and/or raise message-per-view among high-intent sessions.
- **H3 (retention):** Notifications bring users back, improving 7/28-day buyer retention.
- **H4 (risk / counter-hypothesis):** Excess or irrelevant notifications increase mute/opt-out/uninstall and reduce long-run engagement (fatigue).
Other risks to name: **adverse selection** (only highly engaged users opt in, so the effect may not generalize), **cannibalization** (users delay purchases waiting for a “better” similar listing, or notifications merely shift demand away from organic search without growing total transactions), and **marketplace interference** (promoting some listings reduces exposure for others — network effects / fairness concerns; more buyer messages can also overload sellers with low-quality inquiries).
#### 2) Pre-build opportunity sizing (use existing data)
The aim is to estimate *headroom* and where the feature could matter most before investing heavily.
- **Unmet demand:** sessions where a buyer views many listings but sends 0 messages (high intent, low match).
- **Inventory arrival rate:** for common categories, how often do “similar” items appear after a user views an item? If similar inventory is sparse, notifications won’t trigger enough to matter.
- **Time-to-message / time-to-purchase:** if buyers often return days later to message, reminders could accelerate actions.
- **Repeat-interest patterns:** % of users who view/save/search the same category or keywords repeatedly over days.
- **Notification baseline:** existing push/email volume and opt-out rates — can we add more without harming?
- **Backtest against logs:** identify users with repeated intent signals, then simulate — if we had notified them when similar inventory appeared, how often would there have been a plausible “match”? Evaluate a relevance proxy offline (e.g., precision@k using historic co-click / co-message patterns).
**Back-of-the-envelope sizing.** Let N = daily users who view listings, p = fraction with high intent but no interaction, r = fraction who would opt in, t = expected notifications per opted-in user per day, c = incremental click-through-to-view rate, and m = incremental message rate per notification-driven view. Then estimated incremental messages/day ≈ N · p · r · t · c · m. If this is tiny, deprioritize.
#### 3) Define “similar listing” and feasibility constraints
This is both a product and a data/ML problem:
- **Similarity definition:** category + price band + location radius + attributes (brand/size) + embeddings.
- **Cold start:** new users and sparse categories.
- **Latency and triggering:** real-time vs. batch; per-user/day caps.
#### 4) MVP / rollout if uncertain
- **MVP:** rule-based similarity (same category + price band + geo), opt-in on the listing page.
- **Safeguards:** frequency caps (e.g., ≤2/day), quiet hours, easy unsubscribe.
- **Phased rollout:** internal → 1% → 10% → 50% with monitoring.
- **Pre-registered success criteria:** decide ahead of time what lift and guardrail bounds are required.
---
### Part 2 — Measure impact after shipping
#### 1) Metric stack: primary, diagnostic, guardrails
Because “messages” is an intermediate metric, use a hierarchy rather than optimizing CTR alone (CTR can rise while marketplace health falls if notifications are spammy).
**Primary (pick 1–2, pre-registered):**
- **Incremental purchases / GMV per active (or eligible) buyer** — best if reliably measured.
- If purchases are rare or delayed: **listing interactions (messages) per eligible user** over a fixed window (e.g., 7 days) as a proxy, validated against downstream purchase. A robustness option: count only messages in threads that pass an intent threshold (e.g., seller reply), to avoid rewarding low-quality inquiries.
**Diagnostic / secondary (explain the “why”):**
- Notification deliveries, open rate, CTR to listing, view-to-message rate.
- Funnel: notification → listing view → message → seller reply → purchase.
- Time-to-next-session after a listing view; sessions per user.
- Search-usage change (does the feature complement or cannibalize search?).
- Seller-side effects: messages received per seller, response rate, conversion rate.
**Guardrails (must not worsen):**
- Notification opt-out / settings-disable rate, mute rate.
- App uninstall rate; DAU among exposed users.
- Spam/report/block rate; support tickets.
- Notification volume per user (the distribution, not just the mean).
- Seller burden (response rate, seller churn) and marketplace fairness (e.g., exposure concentration / Gini of impressions).
#### 2) Preferred approach: randomized controlled experiment (A/B test)
**Unit of randomization:** the **user** (buyer), to avoid cross-session/device contamination.
- **Control:** no similar-listing notifications (feature hidden, or placebo messaging if needed).
- **Treatment:** feature enabled and notifications sent.
- **Eligibility / denominator:** define it clearly — e.g., users who viewed ≥ N listings in a category, or messaged a seller but did not transact, or saved items.
**Handling opt-in selection bias.** If the user must opt in, do **not** naively compare opt-in vs. non-opt-in users — that confounds the feature with user intent. Instead:
- **Encouragement design:** randomize who *sees* the opt-in prompt (or who is eligible), and measure the **ITT (intent-to-treat)** effect of offering the feature. This is the clean primary readout.
- Optionally recover the **TOT** (effect on those who actually opt in) via instrumental variables, using eligibility/prompt as the instrument for opt-in — stating the exclusion-restriction assumptions explicitly.
- Alternatively, randomize **notification sending** among already-opted-in users — cleaner for measuring notification value, but it does not measure the value of the opt-in UI itself.
**Duration:** long enough to capture repeat visits and delayed purchases (typically 2–4 weeks minimum) and to see past novelty effects.
**Power / MDE:** size the test from the primary-metric variance. “Messages per user” is often zero-inflated, so consider a longer window, stratification by baseline activity, and **CUPED** (use pre-period messaging as a covariate) to cut variance. Use robust standard errors and compare per-user outcomes over the window.
**Attribution:** rely on **user-level totals** (messages/purchases per user) to capture net lift; report notification-driven sessions only as interpretive color. Avoid crediting outcomes purely on last-click.
#### 3) Pitfalls and how to handle them
- **Interference / network effects:** a treated buyer messaging a seller changes seller behavior and inventory, which can spill over to control users. Mitigate with a **cluster-randomized (geo-market) sensitivity arm** or a holdout geo, and measure seller-level spillovers.
- **Seasonality / holidays:** always use a contemporaneous control; avoid pre/post without a control group.
- **Multiple testing:** pre-register the primary metric; adjust or clearly label exploratory metrics.
- **Fatigue over time:** examine the treatment effect by week (week 1 vs. week 4) and by notification-frequency bucket.
#### 4) If a clean A/B test isn’t possible
Fall back to quasi-experiments, naming residual confounding:
- **Difference-in-differences** with a staggered rollout across geos/platforms/time.
- **Interrupted time series** with a control series.
- **Regression discontinuity** if notifications trigger above a threshold (e.g., a saved-search count).
- **Propensity matching** only as supplementary — it is weak here because of opt-in bias.
#### 5) Decision rule and segmentation
Set thresholds beforehand, e.g.: roll out if the **primary-metric lift** is statistically and practically meaningful (e.g., +1–2% purchases per buyer, or +X% messages) **and** guardrails stay within bounds (e.g., opt-out ≤ +0.2pp, uninstall not up, seller reply rate stable). Common readouts:
- **CTR up, messages/purchases flat:** clickbait / low-intent notifications — check view-to-message and seller reply.
- **Messages up, seller reply down / reports up:** low-quality inquiries — refine relevance, add friction (e.g., saved search), or cap frequency.
- **Short-term lift, long-term retention decline:** fatigue — enforce caps, personalization, snooze, category controls.
- **Heterogeneous effects:** segment by category supply density, price tier, intent (new vs. returning), and geography (urban vs. rural inventory density); roll out only to net-positive segments.
Overall, success is judged by **incremental marketplace outcomes** (purchases/GMV, match rate), supported by a healthy notification funnel and strong guardrail protection against fatigue and negative marketplace spillovers.
Explanation
Rubric: the strongest answers (1) tie success to marketplace health (liquidity/GMV/match rate) rather than CTR, (2) lay out a primary/diagnostic/guardrail metric hierarchy, (3) propose a user-randomized A/B test and explicitly defuse opt-in selection bias via an encouragement/ITT design, and (4) name marketplace-specific threats — interference/network effects, fatigue, cannibalization, seasonality — with concrete mitigations and a pre-registered decision rule.