Explain core probability and ML statistics concepts
Company: Bank of America
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
Answer the following short theory questions (you may use equations and brief examples):
## Probability
1. You roll two fair six-sided dice.
- What is the probability that one die shows a strictly larger value than the other (i.e., the two values are different and one is greater)?
- What is the probability that a *specific* die (e.g., the first die) is strictly larger than the other?
## Basic statistics
2. Define **mean** and **variance** of a random variable.
3. Is the usual sample variance estimator “biased”? If yes, what correction makes it unbiased?
## Correlation vs. independence
4. Let \(X, Y\) be (marginally) normally distributed with \(\mathrm{Corr}(X,Y)=0\). Are \(X\) and \(Y\) necessarily independent? State the condition under which zero correlation *does* imply independence.
## Linear regression / OLS
5. Explain linear regression and list common assumptions behind Ordinary Least Squares (OLS).
6. Write the closed-form OLS estimator.
7. Why is (multi)collinearity a problem in regression? How can you detect it, and how can you mitigate it?
8. Briefly explain **Ridge** and **Lasso** regression and how they relate to collinearity.
## PCA
9. Explain Principal Component Analysis (PCA).
10. What are eigenvalues and eigenvectors in this context, and what do they represent?
11. List key limitations of PCA.
Quick Answer: This question evaluates core probability and machine learning statistics competencies including probability basics, descriptive statistics, correlation versus independence, linear regression and regularization, and dimensionality reduction (PCA), falling under the Machine Learning domain for Data Scientist roles and testing both conceptual understanding and practical application. Such multi-part theory questions are commonly asked to probe understanding of fundamental statistical concepts and modeling assumptions, ensuring the candidate can reason about uncertainty, estimator properties, multicollinearity, and eigenstructure without relying solely on implementation details.