Estimate OLS via streaming sufficient statistics

Q: Estimate OLS via streaming sufficient statistics

This question evaluates proficiency in streaming/out-of-core linear regression, including computing sufficient statistics with an intercept, assessing numerical stability of normal equations versus QR/SVD or incremental methods, incorporating ridge penalties, and designing parallel fault-tolerant computations.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Streaming OLS and Ridge for Out-of-Core, High-Dimensional Linear Regression

You need to estimate linear regression coefficients when the dataset is too large to fit in memory. Assume we can read data in mini-batches of rows. Let X ∈ R^{n×p} be the feature matrix and y ∈ R^{n} the target. Include an intercept.

Show how to compute the sufficient statistics XᵀX and Xᵀy in streaming mini-batches (with an intercept), then recover β and standard errors.
Discuss numerical stability of using the normal equations vs. more stable QR/SVD or incremental/online methods.
Extend to ridge regression and show how to incorporate the λI penalty in the out-of-core computation.
Explain how you would checkpoint for fault tolerance and parallelize the computation across workers.

Estimate OLS via streaming sufficient statistics

Overview

Streaming OLS and Ridge for Out-of-Core, High-Dimensional Linear Regression

Comments (0)