Measure feature impact with switchback, PSM, and CACE
Company: Uber
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: easy
Interview Round: Technical Screen
You work at a ridesharing company and want to measure the impact of a new **membership** feature on **rides-per-user (RPU)**. Across the parts below you will measure this effect under three different evidentiary situations: a switchback experiment, an observational launch with no experiment, and a randomized experiment with non-compliance.
For each measurement approach, you are expected to **state the key assumptions, name the likely pitfalls, and propose at least one robustness or sensitivity check**.
### Constraints & Assumptions
- The outcome of interest is **rides-per-user (RPU)**; in Part C you also estimate impact on **profit-per-user (PPU)**, a derived metric (e.g. revenue per user minus cost per user).
- The marketplace has **two-sided interference**: a user's experience (price, ETA, availability) depends on other users and on the supply of drivers, so naive user-level randomization can leak across units. *Exception:* in **Part B you may assume supply is unlimited**, so supply constraints do not confound the outcome through availability.
- Treatment in Part A is the membership *experience/eligibility* being switched on or off at the (city × time) level; in Parts B and C treatment is an individual user actually *holding* membership.
- You have access to standard pre-treatment user history (past rides, spend, tenure, app activity, geography) and city-day operational signals (weather, surge, marketing spend).
### Clarifying Questions to Ask
- What exactly is the decision the measurement will inform — a global launch/no-launch call, or sizing the expected lift for forecasting?
- Is membership **reversible** (a per-trip benefit that can be toggled) or a **sticky enrollment** state that persists once a user joins? This determines whether a switchback is even valid.
- What is the **minimum detectable effect** and budget (number of cities, weeks of runway) we are working with?
- How is RPU defined — over what window, over which denominator (all eligible users, active users, or members)?
- Is the estimand of interest the effect on **everyone eligible** (ATE) or on **the users who actually take up membership** (ATT / effect on members)?
### Part A — Switchback experimentation
You run a **switchback experiment** with randomization at **(day × city)** granularity.
1. Propose a practical switchback design: what unit is randomized, the assignment scheme over time, and the duration.
2. Explain why you generally **should not** take the raw aggregated city-day results and run a vanilla two-sample **t-test**.
3. Describe an analysis approach that simultaneously accounts for time trends / seasonality, city-level heterogeneity, autocorrelation induced by switchbacks, and covariate adjustment.
```hint Why switchback at all
Ask what problem switchback solves that user-level randomization does not. The key word is *interference*: in a marketplace, one user's treatment changes prices/ETAs for others, breaking the independence of units. Randomizing a whole city for a time block makes the unit self-contained — what does that imply for how the marketplace equilibrium is captured?
```
```hint The t-test trap
A vanilla t-test assumes i.i.d. observations. Think about what is violated: outcomes on adjacent days within the same city are not independent of each other, and cities have very different baseline RPU. How do both of those facts break the t-test's guarantees, and in which direction does each failure push your error rate?
```
```hint Estimator direction
You need a model that can simultaneously absorb city-level baselines, calendar-level shocks, and within-city temporal dependence. Think about what class of panel regression handles fixed differences across units and fixed differences across time periods — and what kind of standard errors respect within-unit serial correlation.
```
```hint Reversibility pitfall
Before any of this, sanity-check whether membership is *reversible*. What happens to a "control" block if users who joined during a prior treated block remain members? Is that a problem for the design, and how would you address it?
```
#### What This Part Should Cover
- A concrete design: randomized/balanced time sequence within each city (not a single switch), multiple alternations, duration spanning weekday/weekend seasonality, and an explicit stance on washout/carryover.
- Correctly names the t-test failures: serial autocorrelation (understated SEs → false positives), city heterogeneity, and unequal exposure/weighting.
- A two-way fixed-effects (or mixed-effects) model with cluster-robust SEs at the city level, plus user-volume weighting and a diagnostic (pre-trend check, holiday exclusion).
### Part B — No A/B test available (observational measurement)
Assume the membership feature was launched **without an A/B test**, and assume **supply is unlimited** (so availability does not confound the outcome).
1. How would you estimate the causal impact of membership on **RPU** using **propensity score matching (PSM)**, or a closely related propensity-score method?
2. How would you assess whether your matching/weighting is "good enough" to trust the estimate?
```hint Identifying assumption
What load-bearing assumption makes the comparison of members vs. non-members interpretable as causal? Members self-select, so you must think carefully about which variables drive *both* adoption and RPU — and the flip side: what must be true about the probability of treatment for every covariate profile?
```
```hint Feature hygiene
The single biggest pitfall involves the timing of the features you use in the propensity model. Consider: if a variable was measured after a user joined, what role does it play in the causal graph relative to treatment and outcome? Why does including it bias the estimate rather than help it?
```
```hint Beyond plain matching
Matching on the propensity score is one option. What other propensity-score-based estimators exist, and how do they differ in their robustness properties? If you happen to have pre-period outcome data, what design upgrade would additionally cancel time-invariant unobservables?
```
```hint "Good enough" diagnostics
You cannot prove the identifying assumption holds. What observable quantities would give you converging evidence that the adjustment worked? Think about the covariate distributions between matched groups, a test using a pre-treatment window where you know the true effect, and how you'd quantify robustness to unobserved confounding.
```
#### What This Part Should Cover
- A clear estimand (ATT vs ATE) and a propensity model built strictly from pre-treatment confounders, with explicit warning against post-treatment / collider features.
- The choice among matching, IPW, doubly-robust, and PSM+DiD, with a reason.
- A concrete validation bundle: overlap, covariate balance (SMD), placebo/negative-control tests, and an unmeasured-confounding sensitivity analysis — plus the honest caveat that observational identification rests on an untestable assumption.
### Part C — A/B test exists but with non-compliance
Now assume you ran a **user-level randomized experiment**, but not everyone assigned to treatment actually takes up membership (one-sided or two-sided non-compliance).
1. How would you estimate the causal effect on RPU using the **Complier Average Causal Effect (CACE / LATE)**?
2. How would you compute a **confidence interval** for (a) the impact on RPU and (b) the impact on **profit-per-user (PPU)**, a derived metric?
```hint CACE vs ITT
Random assignment $Z$ affects take-up $D$ but doesn't force it. Think about what causal quantity you can identify from this setup, how it differs from simply comparing assigned-treatment vs. assigned-control groups, and what assumptions you need about the relationship between $Z$, $D$, and $Y$ to make that identification work.
```
```hint It's a ratio
The CACE estimator is a ratio of two estimated quantities. Why does that matter for constructing a confidence interval — and what happens to the stability of the ratio when the denominator is small?
```
```hint Derived metric PPU
PPU combines revenue and cost components at the user level. Think about whether it is cleaner to construct a single per-user number first and then run inference on it, versus estimating component means separately and combining — and what covariance structure that choice affects.
```
#### What This Part Should Cover
- The IV/LATE setup with all three assumptions (random assignment, exclusion restriction, monotonicity) and the Wald = ITT$_Y$ / ITT$_D$ estimator, equivalently 2SLS.
- Distinguishing the **ITT** estimand (effect of being offered) from **CACE** (effect on compliers), and noting that ITT is the right number for some launch decisions.
- Correct CI machinery for a **ratio** estimand (2SLS robust SE / delta method / bootstrap) and a clean treatment of the derived PPU metric via per-user aggregation, including a first-stage strength check.
### What a Strong Answer Covers
These dimensions span all three parts.
- **Estimand discipline:** the candidate states *what* causal quantity (ATE / ATT / ITT / CACE) is being estimated and matches it to the business decision, rather than reporting an undefined "lift".
- **Assumption honesty:** each method's identifying assumption is named (interference/SUTVA in A, unconfoundedness+overlap in B, exclusion+monotonicity in C), and the candidate is explicit about which are testable vs untestable.
- **Inference rigor:** standard errors respect the data-generating structure (clustering for serial correlation; ratio-aware inference for CACE; per-user aggregation for derived metrics).
- **A robustness/sensitivity check per approach**, as the prompt explicitly requires.
### Follow-up Questions
1. Suppose in Part A you find membership lift is much larger in a handful of dense cities. How would you test whether this is real treatment-effect heterogeneity versus an artifact of unbalanced switchback assignment?
2. In Part B, your placebo test on a pre-period outcome comes back significantly non-zero. What does that tell you, and what do you do next?
3. In Part C, the first-stage take-up gap $\mathbb{E}[D\mid Z=1]-\mathbb{E}[D\mid Z=0]$ is only a few percentage points. How does this affect your CACE estimate and its confidence interval, and how would you communicate the result to stakeholders?
4. The marketplace interference assumed away in Part B (unlimited supply) is dropped. How would the presence of a finite driver supply change your approach across all three parts?
Quick Answer: This question evaluates proficiency in causal inference and experimental design for product metrics, specifically testing knowledge of switchback experiments, time-series adjustments (seasonality and autocorrelation), propensity score methods for observational comparisons, and complier-average causal effect (LATE/CACE) estimation; it belongs to the Analytics & Experimentation and Data Science domain. It is commonly asked because it probes handling of real-world complications—time and city heterogeneity, non-compliance, and derived-metric inference—and tests both conceptual understanding of causal assumptions and practical application of statistical adjustment and inference.