Analyze data duplication effects in linear regression
Company: Google
Role: Data Scientist
Category: Statistics & Math
Difficulty: medium
Interview Round: Technical Screen
Consider OLS y = Xβ + ε with full-rank X and i.i.d. ε ~ N(0, σ²I). Suppose you duplicate the entire dataset k times (stack X and y vertically k identical times) and refit OLS.
- Prove algebraically that β̂ is unchanged, while Var(β̂) scales by 1/k, so each standard error scales by 1/√k, and explain the impact on t-statistics, p-values, and confidence interval widths.
- Numerical check: originally, for a single coefficient, SE = 0.20 and t = 5.0. If you (incorrectly) treat 4 duplicates as independent (k=4), what are the new SE, t, and two-sided p approximately? Show calculations.
- Interpret a p-value correctly in this context, and explain why duplication violates its assumptions. When might analyst actions (e.g., oversampling, data augmentation, bootstrapping) inadvertently mimic duplication effects?
- Chi-square tests: explain how very large n can yield tiny p-values for negligible effects. Propose two remedies: (1) report and threshold on an effect size (e.g., Cramér’s V or odds ratio with CI), and (2) a penalized or Bayesian alternative that shrinks spurious significance (e.g., penalized likelihood, Firth correction for small cells, or weakly informative priors). Describe when each is preferable and how you would calibrate thresholds.
Quick Answer: This question evaluates understanding of OLS estimation, the effects of duplicated observations on estimator variance and inference, and the interplay between sample size, p-values, and effect-size measures within the Statistics & Math domain for a Data Scientist role.