Explain Core ML Interview Concepts
Company: Amazon
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
You are in a phone screen for an applied scientist role and are asked to verbally explain a set of machine learning fundamentals. For each part, give a precise, conceptually correct answer and be ready to justify *why*, not just *what*. Treat each question as an invitation to demonstrate depth: state the core idea, then explain the reasoning or intuition behind it.
### Constraints & Assumptions
- This is a conceptual / whiteboard-style discussion, not a coding exercise. No data, libraries, or runnable code are provided.
- Answers are expected to be verbal explanations with light math notation where helpful (e.g. loss functions, update rules).
- Assume standard supervised-learning settings unless a part specifies otherwise.
- Depth and correctness of reasoning matter more than breadth; the interviewer probes the "why" behind each answer.
### Clarifying Questions to Ask
- For the regression and classification parts, should I focus on the modeling assumptions, the estimation/optimization view, or both?
- When discussing loss functions, do you want the probabilistic (maximum-likelihood) justification or just the optimization properties?
- For the optimizer comparison, are you interested in a specific regime (e.g. large-scale vision, NLP, sparse features), or a general comparison?
- For the neural-network part, are we reasoning about classical small-network intuition or modern overparameterized deep-learning theory?
### What a Strong Answer Covers
The interviewer is listening for these signals across the five parts (this is a checklist of *dimensions*, not the answers):
- **Assumptions stated explicitly** for linear and logistic regression, and awareness of which ones matter for point estimates vs. inference.
- **Probabilistic grounding**: connecting squared loss and log-loss to maximum likelihood under specific noise/label models.
- **Mechanism of randomness** in ensembles and *why* it helps (variance reduction / decorrelation), not just "it's a bunch of trees."
- **Optimizer internals**: what state Adam maintains, the update rule, and honest trade-offs vs. SGD (memory, generalization, tuning).
- **Non-convex optimization intuition** for narrow vs. wide networks, including capacity, local minima, and overfitting risk.
- **Calibrated nuance**: acknowledging where the textbook answer is incomplete or where practice diverges from theory.
---
### Part 1 — Linear Regression
What are the main assumptions of linear regression? Why is squared loss commonly used?
```hint Where to start
List the classical assumptions one at a time (think: form of the model, the error term's mean, correlation/independence of errors, error variance, relationships among features). Then separate "needed for unbiased point estimates" from "needed for valid inference."
```
```hint Why squared loss
Consider what probabilistic noise model makes least-squares the **maximum-likelihood** estimator. Also think about convexity, differentiability, and which statistic of $y$ squared loss ends up estimating.
```
### Part 2 — Logistic Regression
What is logistic regression? Why do logarithms appear in its formulation or loss function?
```hint Where the log enters
The log isn't there by accident — it shows up in more than one place once you write the model out. Trace the path from a raw probability in $(0,1)$ to the linear score, and separately think about how the model's parameters are actually fit. Ask what each step would look like *without* a log and why that breaks.
```
```hint The loss
For Bernoulli labels, maximum likelihood is a *product* of probabilities. What does taking a $\log$ do to a product, and why is that helpful both mathematically and numerically?
```
### Part 3 — Random Forest
What is a random forest? During tree construction, how is the set of candidate features selected?
```hint Two sources of randomness
A random forest injects randomness in two ways: how the *data* for each tree is drawn, and how *features* are considered at each split. Name both.
```
```hint Feature selection at a split
Consider whether each split gets to look at *all* features or only a restricted set of them, and what a tuning knob controlling that count would be. Then push on *why* deliberately hiding features from a split could make the overall ensemble better rather than worse.
```
### Part 4 — Adam vs. SGD
Explain the Adam optimizer. What are its advantages and disadvantages compared with vanilla stochastic gradient descent?
```hint What state Adam keeps
Adam combines two ideas you've likely seen in other optimizers, and it does so by keeping per-parameter running statistics of the gradient stream. What two quantities about the recent gradients would each idea want to track, and how would the update use them together? Once you've named them, write the moving-average updates and the final parameter update.
```
```hint Trade-offs to weigh
Be honest about both sides: faster early convergence and per-parameter adaptive rates vs. extra memory and the well-documented generalization concerns relative to well-tuned SGD with momentum. Mention how weight decay interacts with Adam.
```
### Part 5 — Narrow vs. Wide Networks and Local Minima
Consider two neural networks with the same two-layer structure. One has only a few neurons per layer, while the other has many neurons per layer. Which one is more likely to get trapped in a poor local minimum, and why?
```hint Frame it as capacity
Both objectives are non-convex. Think about how the number of parameters affects the *number of low-loss configurations* and how "connected" the good solutions are in the loss landscape.
```
```hint Don't forget the trade-off
A complete answer names which network is more prone to poor local minima / underfitting, but also flags the *cost* of the easier-to-optimize one (what does extra capacity risk if data or regularization is limited?).
```
---
### Follow-up Questions
- For squared loss: how would your answer change if the noise were heavy-tailed (e.g. Laplacian) instead of Gaussian — what loss would maximum likelihood give you then?
- For random forests: how do `n_estimators` and the feature-subset size $m_{try}$ trade off bias, variance, and decorrelation between trees?
- For Adam: in what concrete settings have you seen (or would you expect) SGD with momentum to generalize better, and what would you try to close the gap?
- For the narrow-vs-wide question: how does the modern overparameterization view (loss-landscape connectivity, flat minima) reconcile with classical "more parameters → more overfitting" intuition?
Quick Answer: This question evaluates core machine learning fundamentals including statistical modeling assumptions and loss functions (linear and logistic regression), ensemble methods and feature sampling in random forests, optimization algorithms (Adam versus stochastic gradient descent), and neural network capacity and training dynamics.