OLS With Duplicated Observations: Estimator, Variance, and Inference Pitfalls
Context: You have the linear model y = Xβ + ε with full-rank X ∈ ℝ^{n×p} and i.i.d. errors ε ~ N(0, σ²I). You form a new dataset by stacking the original data k times (k ≥ 2): X_k = [X; X; …; X] and y_k = [y; y; …; y], and you refit OLS as if all rows were independent.
Tasks
-
Algebraic proof
-
Show that the OLS estimate is unchanged by exact duplication: β̂_k = β̂.
-
Under the usual OLS variance formula Var(β̂ | X) = σ² (X'X)^{-1} (i.i.d. errors across rows), prove that Var(β̂_k) = (1/k) Var(β̂). Conclude that each standard error scales by 1/√k.
-
Explain the resulting impact on t-statistics, p-values, and confidence interval widths.
-
Numerical check
-
Originally, for one coefficient, SE = 0.20 and t = 5.0. If you (incorrectly) treat 4 duplicates as independent (k = 4), compute the new SE, the new t, and the approximate two-sided p-value. Show your calculations.
-
Interpreting p-values under duplication
-
State the correct interpretation of a p-value in this setting.
-
Explain why duplication violates the assumptions behind that interpretation (e.g., independence), and how this makes the resulting p-values invalid.
-
Give examples of analyst actions that can inadvertently mimic duplication effects (e.g., random oversampling of a minority class, certain data augmentation schemes, expanding counts into micro-rows, naïve use of bootstrap outputs for classical p-values).
-
Chi-square tests and large n
-
Explain why very large sample sizes can produce tiny p-values for negligible effects in chi-square tests of independence.
-
Propose two remedies:
(1) Report and threshold on an effect size (e.g., Cramér’s V or an odds ratio with a confidence interval). State when this is preferable and how you would calibrate thresholds.
(2) Use a penalized or Bayesian alternative that shrinks spurious significance (e.g., penalized likelihood, Firth correction for small cells, or weakly informative priors). Describe when each is preferable and how to set hyperparameters or thresholds.