Analyze duplicating data in linear regression

Q: Analyze duplicating data in linear regression

This question evaluates understanding of linear regression theory and statistical inference, specifically how duplicating observations affects OLS coefficient estimates, estimated standard errors, t-statistics, R^2, and adjusted R^2.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Loading...

You fit a standard linear regression model (with intercept) using ordinary least squares (OLS). Suppose you have:

Design matrix $X$ of size $n \times p$ ( $p$ parameters including the intercept).
Response vector $y$ of length $n$ .

You now duplicate every observation once, forming a new dataset by stacking the original data under itself:

New design matrix $X^* = \begin{bmatrix} X \\ X \end{bmatrix}$ of size $2n \times p$ .
New response vector $y^* = \begin{bmatrix} y \\ y \end{bmatrix}$ of length $2n$ .

You refit the same regression model on this duplicated dataset using OLS and compute the usual summary statistics.

How do the following quantities change, if at all, compared with the original fit?

The OLS coefficient estimates $\hat{\beta}$ .
The standard errors of the coefficients.
The t-statistics for the coefficients.
$R^2$ .
Adjusted $R^2$ .

Explain your reasoning mathematically (you may use matrix notation) and also interpret the result intuitively.

Analyze duplicating data in linear regression

Solution

Comments (0)

Analyze duplicating data in linear regression

Overview

Solution

Comments (0)