Explain PCA and L2 Normalization in Machine Learning
Scenario
Experian DataLabs Data Scientist technical screen — a machine-learning deep-dive on the modelling choices used in your project, mixed with conceptual questions (some OA / multiple-choice style).
Question
Walk through the core ML concepts behind a binary-classification project, covering preprocessing, modelling, optimization, and evaluation:
-
Explain how PCA achieves dimensionality reduction and why you would (or would not) apply L2 normalization before training. Distinguish per-column standardization from per-sample (row) L2 normalization, and say when each matters.
-
Derive the logistic-regression gradient via back-propagation, then generalize: describe how backpropagation works in modern multi-layer neural nets.
-
What baseline models did you compare against, and why did you ultimately choose logistic regression?
-
Define knowledge-informed machine learning and give a concrete example.
-
When and how would you move the classification threshold to improve FPR or TPR? Can you improve both FPR and TPR at the same time by moving a single threshold?
Hints
Discuss eigenvectors/explained variance, maximum-likelihood gradients (prediction error × input), the chain rule through layers, ROC/PR curves and cost-sensitive thresholds, model-selection criteria, and domain priors/constraints.
Constraints & Assumptions
-
Preserve the scope, facts, inputs, and requested outputs from the prompt above.
-
If the prompt leaves a detail unspecified, state a reasonable assumption before relying on it.
-
Keep the answer interview-ready: concise enough to present, but concrete enough to implement or evaluate.
Clarifying Questions to Ask
-
Clarify the task, data shape, labels, constraints, and evaluation metric.
-
State assumptions behind the math or modeling technique you choose.
-
Connect theory to practical training, debugging, and deployment implications.
What a Strong Answer Covers
-
Correct definitions and formulas where the prompt requires them.
-
A practical explanation of how the method behaves on real data.
-
Trade-offs, failure modes, diagnostics, and mitigation strategies.
-
Evaluation choices that match the product or modeling objective.
Follow-up Questions
-
How would noisy labels, class imbalance, or distribution shift affect the answer?
-
What would you monitor after deployment?
-
Which baseline would you compare against first?