A/B Testing for Data Scientists: Conquering the Experimentation Interview

For Data Scientists and Product Analysts, the A/B Testing (Experimentation) interview is virtually unavoidable. Tech giants like Netflix, Uber, and Meta run thousands of concurrent experiments, making rigorous statistical knowledge a strict prerequisite.

While many candidates can comfortably run a SQL query, they often completely unravel when an interviewer asks, "Your p-value is 0.04. Should we launch the feature?" In this comprehensive guide, we will break down the mathematical foundations of A/B testing and the complex edge cases interviewers use to separate junior analysts from senior data scientists.

1. The Statistical Foundation: Hypothesis Testing

An A/B test is fundamentally a Two-Sample Hypothesis Test. Before running any experiment, you must clearly define:

The Null Hypothesis ( $H_0$ ): There is no difference between the Control (A) and the Variant (B). Any observed difference is purely due to random chance.
The Alternative Hypothesis ( $H_1$ ): There is a statistically significant difference between the Control and the Variant.

Demystifying the P-Value

The p-value is the most misunderstood metric in data science. A p-value is the probability of obtaining test results at least as extreme as the results actually observed, assuming that the null hypothesis is correct. It is not the probability that the Variant is better. If your p-value is 0.03, it means there is a 3% chance you would see this difference if the features were actually identical. Because $0.03$ is less than the standard significance level ( $\alpha = 0.05$ ), you reject the null hypothesis.

2. Sample Size and Minimum Detectable Effect (MDE)

A classic interview trap is asking, "How long should we run this test?" You cannot answer with "two weeks." You must calculate the required Sample Size.

To calculate sample size, you need four parameters:

Baseline Conversion Rate: The current performance of the Control group.
Minimum Detectable Effect (MDE): The smallest improvement you care about. If the MDE is a 0.1% increase, you will need a massive sample size to prove it wasn't random noise. If the MDE is 10%, you need far fewer users.
Significance Level ( $\alpha$ ): Usually 5% (0.05). This is your tolerance for a False Positive (Type I Error).
Statistical Power ( $1 - \beta$ ): Usually 80% (0.80). This is your ability to detect an effect if it truly exists, minimizing False Negatives (Type II Error).

Interview Answer: "I would determine the required sample size using a power calculation based on our MDE of X% and our baseline rate. Dividing that sample size by our daily active traffic will tell us exactly how many days the test needs to run."

3. The Senior Trap: Network Effects & Interference

If you are interviewing at Uber or Airbnb, standard A/B testing fails due to Network Effects (Interference).

Assume you are testing a new pricing algorithm for Uber drivers in New York. You randomly assign Driver A to the Variant and Driver B to the Control. If the Variant algorithm makes Driver A work more hours, Driver A will take more rides. This artificially decreases the number of rides available for Driver B in the Control group. The Variant looks incredibly successful, but it merely cannibalized the Control group.

Solutions for Network Effects

When asked how to test in a two-sided marketplace, senior candidates must propose:

Geo-based Testing (Switchbacks): Instead of randomizing by user, randomize by geography or time. Apply the Variant to all of Manhattan on Monday, and the Control to all of Manhattan on Tuesday, comparing the aggregated marketplace metrics.
Cluster Randomization: Group highly connected nodes (e.g., friend groups on Facebook) and randomize the entire cluster into the Control or Variant together.

4. The Peeking Problem

"The test has been running for 3 days. The p-value is 0.02. The PM wants to stop the test early and launch. What do you do?"

The Answer is NO. This is called the "Peeking Problem." Because p-values fluctuate wildly in the early days of an experiment, continuously checking the p-value and stopping as soon as it drops below 0.05 drastically inflates your False Positive rate. You must commit to running the test for the pre-calculated duration, or use advanced Sequential Testing frameworks.

Master Data Science Interviews on PracHub

Understanding the math behind a Z-test is easy. Pushing back against a hypothetical aggressive Product Manager who wants to launch a feature with compromised data is the behavioral friction you will face in a real interview.

PracHub is the ultimate practice ground for Data Scientists. You can find actual interview questions to practice to help you nail your next interview. Also our platform connects you with elite data professionals for rigorous mock interviews. Practice explaining complex statistical concepts like Network Effects and Power Calculations in simple terms, ensuring you are fully prepared to conquer the experimentation round at any major tech company.