A/B Testing And Experiment Design

What's being tested

PayPal is testing whether you can design, diagnose, and communicate online experiments where the business outcome is not just clicks, but money movement, fraud risk, user trust, and regulatory sensitivity. A strong Data Scientist should define the causal question, choose the right randomization unit, select primary and guardrail metrics, estimate power, analyze heterogeneous effects, and make a launch recommendation under uncertainty. Interviewers are probing whether you can move beyond “compare treatment vs. control p-value” into real product experimentation: interference, rare-event metrics, sequential monitoring, noisy revenue outcomes, and stakeholder-ready tradeoff framing. For PayPal specifically, experiment mistakes can mean lost checkout conversion, increased account-takeover exposure, false declines, promotional overspend, or biased treatment of customer segments.

Core knowledge

Randomization unit is often the most important design choice. For checkout cashback, user-level randomization may work; for an ATO rule, account-, device-, merchant-, or network-cluster randomization may be needed to avoid spillovers where fraudsters adapt across accounts or merchants.
Primary metric should map to the decision. For cashback, candidates might choose incremental checkout_conversion, TPV, net_revenue, or profit = take_rate * TPV - cashback_cost - fraud_loss. For an ATO rule, use avoided fraud loss or confirmed takeover rate, but include false-positive friction.
Guardrail metrics protect against local optimization. At PayPal, likely guardrails include login_success_rate, checkout_success_rate, false_decline_rate, customer_support_contacts, dispute_rate, chargeback_rate, latency, and segment-level degradation for new users, high-value users, or specific geographies.
Power and minimum detectable effect connect statistics to business stakes. For a two-sample proportion test, an approximation is:
$n \approx \frac{2(z_{1-\alpha/2}+z_{1-\beta})^2p(1-p)}{\Delta^2}$
where $\Delta$ is the absolute detectable lift. For rare fraud outcomes, required sample size can become impractically large, pushing you toward longer tests, composite metrics, variance reduction, or quasi-experimental evidence.
Variance reduction can materially improve sensitivity. CUPED uses pre-period behavior as a covariate:
$Y_{adj}=Y-\theta(X-\bar X), \quad \theta=\frac{Cov(Y,X)}{Var(X)}$
This works well for stable user-level metrics like historical TPV, prior checkout activity, or baseline fraud-risk score, but less well for brand-new users.
Sequential monitoring must be planned before launch. Repeatedly checking p-values inflates Type I error. Use pre-specified looks with alpha spending, group sequential testing, or a Bayesian decision framework; do not say “we will stop when p < 0.05” unless you explicitly control false positives.
Sample ratio mismatch is a mandatory integrity check. If the intended split is 50/50, test observed assignment counts using a chi-square test before interpreting outcomes. SRM can indicate assignment bugs, eligibility drift, bot filtering asymmetry, or post-treatment exclusion bias.
Intent-to-treat analysis preserves causal validity. Analyze users based on assigned variant, not only those who saw or used the feature. A “cashback exposed users only” analysis can be biased because exposure is often affected by treatment, checkout behavior, device, or merchant routing.
Treatment effect estimation should include uncertainty and business magnitude. Report absolute lift, relative lift, confidence interval, p-value or posterior probability, and translated dollars. “Conversion increased 0.12 pp” is weaker than “+ $1.8M annualized net profit, 95% CI [$ 0.4M, $3.2M], with no guardrail breach.”
Heterogeneous treatment effects matter, but subgroup analysis is dangerous if unplanned. Segment by risk score, tenure, geography, platform, merchant category, transaction size, or new vs. returning users, but control multiple comparisons using Benjamini-Hochberg, Bonferroni, hierarchical modeling, or treat segments as directional unless pre-registered.
Interference and network effects are common in payments. Merchant-level promotions can affect both treatment and control users through shared checkout pages; fraud rules can change attacker behavior across accounts. If SUTVA is violated, consider cluster randomization, geo-level rollout, switchback designs, or difference-in-differences.
Decision quality is not identical to statistical significance. A small p-value on gross conversion may still be a bad launch if cashback cost exceeds margin. Conversely, a non-significant but directionally strong fraud reduction with low downside may justify a ramp, especially if the cost of waiting is high and guardrails are clean.

Worked example

For “Design an A/B for ATO rule”, start by clarifying what the new rule does: does it block logins, step up authentication, hold transactions, or only add risk scoring? Then define the causal estimand: the effect of enabling the rule on confirmed account-takeover losses, customer friction, and net business value among eligible traffic. A strong answer would organize around four pillars: randomization, metrics, power, and monitoring. For randomization, you would flag that user-level randomization may be insufficient if fraudsters reuse devices, IPs, funding instruments, or merchants; a cluster-aware design by risk cluster, device graph, or merchant/account family may better prevent contamination. For metrics, use a primary value metric such as net avoided loss minus friction cost, with guardrails like login_success_rate, checkout_conversion, false_positive_rate, and support contacts. For power, explain that confirmed ATO is rare and delayed, so you may need a longer test, a high-risk eligible population, variance reduction using pre-period risk scores, or proxy metrics like step-up challenge success while still validating against confirmed loss. A key tradeoff is safety versus statistical purity: if the rule likely prevents severe fraud, you may not want a 50/50 global holdout; instead, use a smaller control, high-risk ramp, or sequential design with pre-defined stopping rules. Close by saying that if you had more time, you would add segment-level fairness checks, delayed outcome windows for disputes, and sensitivity analysis for underreported fraud labels.

A second angle

For “Design and Analyze A/B Test for Cashback Program”, the same experimental discipline applies, but the dominant risks shift from security harm to promotional incrementality and margin. Randomization is more likely to be user-level or session-level, but you must prevent contamination from users seeing the offer on one device and converting on another. The primary metric should not be raw conversion alone; it should account for incremental TPV, PayPal take rate, cashback expense, and possible cannibalization of transactions that would have happened anyway. Analysis should separate short-term checkout lift from longer-term retention, repeat purchase, and reward liability. Unlike an ATO rule, where rare labels and safety dominate, cashback experiments often hinge on unit economics, heterogeneous response by user value, and whether lift persists after the promotion ends.

Common pitfalls

Pitfall: Treating every experiment as a simple 50/50 user-level test.

This misses interference, delayed outcomes, and operational risk. A better answer explicitly asks whether users, devices, merchants, or fraud networks can affect each other, then chooses the randomization unit and analysis standard errors accordingly.

Pitfall: Optimizing for a single attractive metric like checkout_conversion.

At PayPal, conversion gains can be bought with excessive cashback, higher fraud loss, or worse customer trust. Strong candidates define a decision metric and guardrails together, then translate the observed effect into expected net dollars and risk.

Pitfall: Over-communicating statistical mechanics and under-communicating the recommendation.

Interviewers want rigor, but they also want a launch decision. Do not stop at “p = 0.03”; say whether you would launch, ramp, iterate, or hold, and explain the confidence interval, downside risk, and unresolved checks.

Connections

Interviewers may pivot from this topic into causal inference, especially difference-in-differences, matching, inverse propensity weighting, or synthetic controls when randomization is infeasible. They may also probe metric design, fraud/risk model evaluation, uplift modeling, or experimentation platforms from an analyst’s perspective: assignment integrity, exposure logging, variance reduction, and decision governance.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts