Context
You built a binary sentiment classification model (e.g., positive vs. negative) and need to deploy it in a product where actions depend on the model’s output.
Questions
-
Walk through your ML pipeline
end-to-end:
-
Data sourcing/labeling and dataset construction (train/validation/test splits).
-
Feature design or model choice (e.g., TF-IDF + linear model vs. transformer).
-
Training procedure and evaluation setup.
-
Key practical challenges (class imbalance, noisy labels, distribution shift) and how you handled them.
-
Modeling choices:
-
Why did you choose method/model
X
over alternatives?
-
What assumptions does it make, and what trade-offs does it introduce (latency, interpretability, cost, robustness)?
-
Iterative refinement:
-
Describe how you
iteratively improved
the system (e.g., error analysis → new features/data → retrain → re-evaluate).
-
What were your biggest learnings and what would you do differently next time?
-
Threshold selection for deployment:
-
Your model outputs a probability score. How do you choose the
decision threshold
?
-
Which metrics would you consider (precision, recall, F1, ROC-AUC, PR-AUC, cost-weighted loss), and which would be
primary vs. diagnostic vs. guardrail
?
-
How would the answer change under:
-
Severe class imbalance
-
Different costs for false positives vs. false negatives
-
A fixed review/ops capacity (e.g., only 1,000 items/day can be escalated)
-
Metric definition:
-
If a stakeholder proposes defining “success” as metric
XXX
, how do you evaluate whether that definition is appropriate?
-
What data issues (label leakage, delayed labels, sampling bias) could make the metric misleading?