PracHub
QuestionsPremiumLearningGuidesInterview PrepCoaches
|Home/Machine Learning/Citadel

Estimate OLS via streaming sufficient statistics

Last updated: Mar 29, 2026

Quick Overview

This question evaluates proficiency in streaming/out-of-core linear regression, including computing sufficient statistics with an intercept, assessing numerical stability of normal equations versus QR/SVD or incremental methods, incorporating ridge penalties, and designing parallel fault-tolerant computations.

  • hard
  • Citadel
  • Machine Learning
  • Data Scientist

Estimate OLS via streaming sufficient statistics

Company: Citadel

Role: Data Scientist

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

You must estimate OLS coefficients β for very high-dimensional linear regression with data too large to fit in memory. (1) Derive how to compute XᵀX and Xᵀy in streaming mini-batches (include an intercept), then recover β and standard errors. (2) Discuss numerical stability vs. using QR or incremental/online methods. (3) Extend to ridge regression and show how to update with λI. (4) Explain how you would checkpoint and parallelize the computation.

Quick Answer: This question evaluates proficiency in streaming/out-of-core linear regression, including computing sufficient statistics with an intercept, assessing numerical stability of normal equations versus QR/SVD or incremental methods, incorporating ridge penalties, and designing parallel fault-tolerant computations.

Related Interview Questions

  • Analyze Correlations and Generate Gaussians - Citadel (medium)
  • Determine When a Quadratic Has Finite Minimum - Citadel (medium)
  • Choose models for trading tasks - Citadel (hard)
  • Design city home-price prediction system - Citadel (hard)
  • Diagnose outliers and influence in linear regression - Citadel (hard)
Citadel logo
Citadel
Oct 13, 2025, 9:49 PM
Data Scientist
Technical Screen
Machine Learning
12
0

Streaming OLS and Ridge for Out-of-Core, High-Dimensional Linear Regression

You need to estimate linear regression coefficients when the dataset is too large to fit in memory. Assume we can read data in mini-batches of rows. Let X ∈ R^{n×p} be the feature matrix and y ∈ R^{n} the target. Include an intercept.

  1. Show how to compute the sufficient statistics XᵀX and Xᵀy in streaming mini-batches (with an intercept), then recover β and standard errors.
  2. Discuss numerical stability of using the normal equations vs. more stable QR/SVD or incremental/online methods.
  3. Extend to ridge regression and show how to incorporate the λI penalty in the out-of-core computation.
  4. Explain how you would checkpoint for fault tolerance and parallelize the computation across workers.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Citadel•More Data Scientist•Citadel Data Scientist•Citadel Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.