Analyze data duplication effects in linear regression

Q: Analyze data duplication effects in linear regression

This question evaluates understanding of OLS estimation, the effects of duplicated observations on estimator variance and inference, and the interplay between sample size, p-values, and effect-size measures within the Statistics & Math domain for a Data Scientist role.

Q: How do I approach Statistics & Math interview questions?

Statistics & Math questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master statistics & math interviews.

Question

OLS With Duplicated Observations: Estimator, Variance, and Inference Pitfalls

Context: You have the linear model y = Xβ + ε with full-rank X ∈ ℝ^{n×p} and i.i.d. errors ε ~ N(0, σ²I). You form a new dataset by stacking the original data k times (k ≥ 2): X_k = [X; X; …; X] and y_k = [y; y; …; y], and you refit OLS as if all rows were independent.

Tasks

Algebraic proof
- Show that the OLS estimate is unchanged by exact duplication: β̂_k = β̂.
- Under the usual OLS variance formula Var(β̂ | X) = σ² (X'X)^{-1} (i.i.d. errors across rows), prove that Var(β̂_k) = (1/k) Var(β̂). Conclude that each standard error scales by 1/√k.
- Explain the resulting impact on t-statistics, p-values, and confidence interval widths.
Numerical check
- Originally, for one coefficient, SE = 0.20 and t = 5.0. If you (incorrectly) treat 4 duplicates as independent (k = 4), compute the new SE, the new t, and the approximate two-sided p-value. Show your calculations.
Interpreting p-values under duplication
- State the correct interpretation of a p-value in this setting.
- Explain why duplication violates the assumptions behind that interpretation (e.g., independence), and how this makes the resulting p-values invalid.
- Give examples of analyst actions that can inadvertently mimic duplication effects (e.g., random oversampling of a minority class, certain data augmentation schemes, expanding counts into micro-rows, naïve use of bootstrap outputs for classical p-values).
Chi-square tests and large n
- Explain why very large sample sizes can produce tiny p-values for negligible effects in chi-square tests of independence.
- Propose two remedies: (1) Report and threshold on an effect size (e.g., Cramér’s V or an odds ratio with a confidence interval). State when this is preferable and how you would calibrate thresholds. (2) Use a penalized or Bayesian alternative that shrinks spurious significance (e.g., penalized likelihood, Firth correction for small cells, or weakly informative priors). Describe when each is preferable and how to set hyperparameters or thresholds.

Analyze data duplication effects in linear regression

OLS With Duplicated Observations: Estimator, Variance, and Inference Pitfalls

Tasks

Solution

Comments (0)

Analyze data duplication effects in linear regression

Overview

OLS With Duplicated Observations: Estimator, Variance, and Inference Pitfalls

Tasks

Solution

Comments (0)