How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a easy difficulty Machine Learning question, commonly asked during Onsite rounds at Amazon.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Amazon during technical interviews.

Evaluate NLP Classification Models | Amazon Interview Question

Q: Evaluate NLP Classification Models

This question evaluates competency in NLP classification model evaluation, covering understanding of confusion matrices, precision/recall/F1/AUC metrics, thresholding and trade-offs, handling class imbalance, and connecting metric choices to operational actions and business costs.

You are interviewing for a Data Scientist internship at Amazon. The interviewer asks you to walk through how you think about an NLP classification project — for example, classifying customer messages, search queries, or support tickets into categories — and then probes your understanding of model evaluation fundamentals.

Work through the parts below. The goal is to demonstrate that you can explain core metrics clearly, reason about tradeoffs in terms of business cost, and connect evaluation choices to real product decisions.

Constraints & Assumptions

The setting is a real-world classification system (binary or multi-class), not a clean academic benchmark.
Classes may be imbalanced — the category you care most about (e.g. a severe policy violation, fraud, an escalation) is often rare.
The model outputs scores or probabilities , and a decision threshold converts those into actions.
Predictions may feed an automated action (banning, blocking, routing) or a human-in-the-loop review queue.
Some answers (Part 1) must be understandable by a non-technical audience; others (Parts 2–5) expect precise definitions and formulas.

Part 1 — Explain a confusion matrix to high school students

Explain what a confusion matrix is to a group of high school students. Avoid jargon; use a concrete, relatable example.

Part 2 — Define precision, recall, F1, and AUC

Give the precise definition of precision, recall, F1 score, and AUC, including formulas where they exist. State in one sentence what each one tells you.

Part 3 — Precision vs. recall: which to prioritize when

Explain when you would prioritize precision over recall, and when you would prioritize recall over precision. Tie your reasoning to a decision criterion, not just examples.

Part 4 — Evaluating a multi-class classification model

How would you evaluate a multi-class classification model (one of $K$ categories)? Go beyond a single accuracy number.

Part 5 — Cross-entropy loss

What is cross-entropy loss, and why is it so commonly used for classification? Give the binary and multi-class forms.

Part 6 — When human evaluation beats an automatic metric

In what situations can human evaluation be better than using an automatic objective function or metric? Also note the costs and pitfalls of human evaluation.

Part 7 — Describing an NLP project you've done

If asked to describe an NLP project you have done, what technical details and tradeoffs should you discuss? Lay out the structure of a strong answer.

Clarifying Questions to Ask

Is the task binary or multi-class, and how many categories are there?
How imbalanced are the classes, and which class matters most to the business?
What happens downstream with a prediction — an automated action, or routing to a human reviewer?
What are the relative costs of a false positive vs. a false negative in this product?
Is there a latency or cost budget that constrains the model choice?

What a Strong Answer Covers

Clarity for the audience: Part 1 is genuinely accessible to non-experts; Parts 2–5 are precise and formula-correct.
Cost-based reasoning: precision/recall and threshold choices are justified by the cost of each error type, not by rote examples.
Imbalance awareness: recognizes that accuracy misleads on rare classes, and reaches for macro F1, PR AUC, or per-class recall.
Correct, well-stated formulas for precision, recall, F1, $F_\beta$ , and cross-entropy (binary and multi-class).
Evaluation depth: confusion matrix + per-class metrics + averaging choice + calibration + subgroup slices.
Honest tradeoffs: automatic vs. human evaluation, model complexity vs. interpretability, offline metric vs. online impact.
Connection to decisions: the best metric is the one that captures the product's cost of mistakes, not the most sophisticated one.

Follow-up Questions

Your model has ROC AUC of 0.95 but business stakeholders complain it's "useless" on the rare class. What's likely going on, and what would you measure instead?
Predictions feed an automated action. The model's probabilities are used as confidence to decide whether to auto-action or send to human review. How do you check the probabilities are trustworthy?
Suppose your offline macro F1 improved after a model change but the online business metric got worse. How do you reconcile and diagnose that?
How would you design a human evaluation study so its results are reproducible and not just one annotator's opinion?

Constraints & Assumptions

The setting is a real-world classification system (binary or multi-class), not a clean academic benchmark.
Classes may be imbalanced — the category you care most about (e.g. a severe policy violation, fraud, an escalation) is often rare.
The model outputs scores or probabilities , and a decision threshold converts those into actions.
Predictions may feed an automated action (banning, blocking, routing) or a human-in-the-loop review queue.
Some answers (Part 1) must be understandable by a non-technical audience; others (Parts 2–5) expect precise definitions and formulas.

Part 1 — Explain a confusion matrix to high school students

Explain what a confusion matrix is to a group of high school students. Avoid jargon; use a concrete, relatable example.

Part 2 — Define precision, recall, F1, and AUC

Give the precise definition of precision, recall, F1 score, and AUC, including formulas where they exist. State in one sentence what each one tells you.

Part 3 — Precision vs. recall: which to prioritize when

Explain when you would prioritize precision over recall, and when you would prioritize recall over precision. Tie your reasoning to a decision criterion, not just examples.

Part 4 — Evaluating a multi-class classification model

How would you evaluate a multi-class classification model (one of $K$ categories)? Go beyond a single accuracy number.

Part 5 — Cross-entropy loss

What is cross-entropy loss, and why is it so commonly used for classification? Give the binary and multi-class forms.

Part 6 — When human evaluation beats an automatic metric

In what situations can human evaluation be better than using an automatic objective function or metric? Also note the costs and pitfalls of human evaluation.

Part 7 — Describing an NLP project you've done

If asked to describe an NLP project you have done, what technical details and tradeoffs should you discuss? Lay out the structure of a strong answer.

Clarifying Questions to Ask

Is the task binary or multi-class, and how many categories are there?
How imbalanced are the classes, and which class matters most to the business?
What happens downstream with a prediction — an automated action, or routing to a human reviewer?
What are the relative costs of a false positive vs. a false negative in this product?
Is there a latency or cost budget that constrains the model choice?

What a Strong Answer Covers

Clarity for the audience: Part 1 is genuinely accessible to non-experts; Parts 2–5 are precise and formula-correct.
Cost-based reasoning: precision/recall and threshold choices are justified by the cost of each error type, not by rote examples.
Imbalance awareness: recognizes that accuracy misleads on rare classes, and reaches for macro F1, PR AUC, or per-class recall.
Correct, well-stated formulas for precision, recall, F1, $F_\beta$ , and cross-entropy (binary and multi-class).
Evaluation depth: confusion matrix + per-class metrics + averaging choice + calibration + subgroup slices.
Honest tradeoffs: automatic vs. human evaluation, model complexity vs. interpretability, offline metric vs. online impact.
Connection to decisions: the best metric is the one that captures the product's cost of mistakes, not the most sophisticated one.

Follow-up Questions

Your model has ROC AUC of 0.95 but business stakeholders complain it's "useless" on the rare class. What's likely going on, and what would you measure instead?
Predictions feed an automated action. The model's probabilities are used as confidence to decide whether to auto-action or send to human review. How do you check the probabilities are trustworthy?
Suppose your offline macro F1 improved after a model change but the online business metric got worse. How do you reconcile and diagnose that?
How would you design a human evaluation study so its results are reproducible and not just one annotator's opinion?

Evaluate NLP Classification Models

Quick Overview

Constraints & Assumptions

Part 1 — Explain a confusion matrix to high school students

Part 2 — Define precision, recall, F1, and AUC

Part 3 — Precision vs. recall: which to prioritize when

Part 4 — Evaluating a multi-class classification model

Part 5 — Cross-entropy loss

Part 6 — When human evaluation beats an automatic metric

Part 7 — Describing an NLP project you've done

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Evaluate NLP Classification Models

Quick Overview

Constraints & Assumptions

Part 1 — Explain a confusion matrix to high school students

Part 2 — Define precision, recall, F1, and AUC

Part 3 — Precision vs. recall: which to prioritize when

Part 4 — Evaluating a multi-class classification model

Part 5 — Cross-entropy loss

Part 6 — When human evaluation beats an automatic metric

Part 7 — Describing an NLP project you've done

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP