PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Amazon

Evaluate NLP Classification Models

Last updated: Jun 18, 2026

Quick Overview

This question evaluates competency in NLP classification model evaluation, covering understanding of confusion matrices, precision/recall/F1/AUC metrics, thresholding and trade-offs, handling class imbalance, and connecting metric choices to operational actions and business costs.

  • easy
  • Amazon
  • Machine Learning
  • Data Scientist

Evaluate NLP Classification Models

Company: Amazon

Role: Data Scientist

Category: Machine Learning

Difficulty: easy

Interview Round: Onsite

You are interviewing for a **Data Scientist** internship at Amazon. The interviewer asks you to walk through how you think about an NLP **classification** project — for example, classifying customer messages, search queries, or support tickets into categories — and then probes your understanding of model evaluation fundamentals. Work through the parts below. The goal is to demonstrate that you can explain core metrics clearly, reason about tradeoffs in terms of business cost, and connect evaluation choices to real product decisions. ### Constraints & Assumptions - The setting is a real-world classification system (binary or multi-class), not a clean academic benchmark. - Classes may be **imbalanced** — the category you care most about (e.g. a severe policy violation, fraud, an escalation) is often rare. - The model outputs **scores or probabilities**, and a decision threshold converts those into actions. - Predictions may feed an automated action (banning, blocking, routing) or a human-in-the-loop review queue. - Some answers (Part 1) must be understandable by a non-technical audience; others (Parts 2–5) expect precise definitions and formulas. ### Part 1 — Explain a confusion matrix to high school students Explain what a confusion matrix is to a group of high school students. Avoid jargon; use a concrete, relatable example. ```hint Anchor on one example Pick a single binary yes/no scenario they already understand (spam-or-not, sick-or-healthy) and name the four cells in plain words before you ever say "true positive." ``` ### Part 2 — Define precision, recall, F1, and AUC Give the precise definition of **precision**, **recall**, **F1 score**, and **AUC**, including formulas where they exist. State in one sentence what each one tells you. ```hint Build on the four cells Every one of these is a ratio or summary of the TP/FP/FN/TN counts from Part 1 — except AUC, which is a threshold-free ranking measure. Think about which mistake each ratio puts in its denominator. ``` ```hint AUC interpretation ROC AUC has a clean one-sentence probabilistic reading that uses the words "randomly chosen positive" and "randomly chosen negative" — can you state it? Also think about when PR AUC is the better choice than ROC AUC. ``` ### Part 3 — Precision vs. recall: which to prioritize when Explain when you would prioritize **precision** over **recall**, and when you would prioritize **recall** over **precision**. Tie your reasoning to a decision criterion, not just examples. ```hint The deciding question Frame it as: which is more expensive here, a false positive or a false negative? Then connect that to where you set the decision **threshold** on the model's score — moving it trades one metric for the other. An $F_\beta$ score lets you bake the chosen tradeoff into a single number. ``` ### Part 4 — Evaluating a multi-class classification model How would you evaluate a multi-class classification model (one of $K$ categories)? Go beyond a single accuracy number. ```hint Don't trust one number A $K\times K$ confusion matrix plus **per-class** precision/recall/F1 reveals failures that overall accuracy hides — especially on rare classes. ``` ```hint Averaging matters Micro vs. macro vs. weighted averaging answer different questions. Decide which one surfaces rare-class performance, and consider calibration and subgroup slices too. ``` ### Part 5 — Cross-entropy loss What is **cross-entropy loss**, and why is it so commonly used for classification? Give the binary and multi-class forms. ```hint What it measures Think about what two things you are comparing when you write the formula — and consider how severely the loss behaves when the model is very confident but completely wrong. Why does that behavior, combined with differentiability, make it gradient-descent-friendly? ``` ### Part 6 — When human evaluation beats an automatic metric In what situations can **human evaluation** be better than using an automatic objective function or metric? Also note the costs and pitfalls of human evaluation. ```hint Where proxies break down Think about tasks where the automatic metric only measures *surface* similarity, not meaning (summarization, search relevance, generation). Then weigh human eval's own weaknesses (cost, rater inconsistency) and how you'd control for them. ``` ### Part 7 — Describing an NLP project you've done If asked to describe an NLP project you have done, what technical details and tradeoffs should you discuss? Lay out the structure of a strong answer. ```hint Walk the lifecycle Cover problem framing, data and labeling, baselines vs. complex models, the metric you chose *and why*, error analysis, and deployment/impact. Interviewers reward the candidate who justifies a *simpler* model and shows what they learned from mistakes. ``` ### Clarifying Questions to Ask - Is the task binary or multi-class, and how many categories are there? - How imbalanced are the classes, and which class matters most to the business? - What happens downstream with a prediction — an automated action, or routing to a human reviewer? - What are the relative costs of a false positive vs. a false negative in this product? - Is there a latency or cost budget that constrains the model choice? ### What a Strong Answer Covers - **Clarity for the audience:** Part 1 is genuinely accessible to non-experts; Parts 2–5 are precise and formula-correct. - **Cost-based reasoning:** precision/recall and threshold choices are justified by the cost of each error type, not by rote examples. - **Imbalance awareness:** recognizes that accuracy misleads on rare classes, and reaches for macro F1, PR AUC, or per-class recall. - **Correct, well-stated formulas** for precision, recall, F1, $F_\beta$, and cross-entropy (binary and multi-class). - **Evaluation depth:** confusion matrix + per-class metrics + averaging choice + calibration + subgroup slices. - **Honest tradeoffs:** automatic vs. human evaluation, model complexity vs. interpretability, offline metric vs. online impact. - **Connection to decisions:** the best metric is the one that captures the product's cost of mistakes, not the most sophisticated one. ### Follow-up Questions - Your model has ROC AUC of 0.95 but business stakeholders complain it's "useless" on the rare class. What's likely going on, and what would you measure instead? - Predictions feed an automated action. The model's probabilities are used as confidence to decide whether to auto-action or send to human review. How do you check the probabilities are trustworthy? - Suppose your offline macro F1 improved after a model change but the online business metric got worse. How do you reconcile and diagnose that? - How would you design a human evaluation study so its results are reproducible and not just one annotator's opinion?

Quick Answer: This question evaluates competency in NLP classification model evaluation, covering understanding of confusion matrices, precision/recall/F1/AUC metrics, thresholding and trade-offs, handling class imbalance, and connecting metric choices to operational actions and business costs.

Related Interview Questions

  • Explain Transformer and MoE Fundamentals - Amazon (medium)
  • Explain Core ML Interview Concepts - Amazon (hard)
  • Explain overfitting, regularization, and LLM techniques - Amazon (medium)
  • Explain NLP/RL concepts used in LLM agents - Amazon (hard)
  • Design and evaluate a RAG system - Amazon (easy)
Amazon logo
Amazon
Apr 3, 2026, 12:00 AM
Data Scientist
Onsite
Machine Learning
14
0

You are interviewing for a Data Scientist internship at Amazon. The interviewer asks you to walk through how you think about an NLP classification project — for example, classifying customer messages, search queries, or support tickets into categories — and then probes your understanding of model evaluation fundamentals.

Work through the parts below. The goal is to demonstrate that you can explain core metrics clearly, reason about tradeoffs in terms of business cost, and connect evaluation choices to real product decisions.

Constraints & Assumptions

  • The setting is a real-world classification system (binary or multi-class), not a clean academic benchmark.
  • Classes may be imbalanced — the category you care most about (e.g. a severe policy violation, fraud, an escalation) is often rare.
  • The model outputs scores or probabilities , and a decision threshold converts those into actions.
  • Predictions may feed an automated action (banning, blocking, routing) or a human-in-the-loop review queue.
  • Some answers (Part 1) must be understandable by a non-technical audience; others (Parts 2–5) expect precise definitions and formulas.

Part 1 — Explain a confusion matrix to high school students

Explain what a confusion matrix is to a group of high school students. Avoid jargon; use a concrete, relatable example.

Part 2 — Define precision, recall, F1, and AUC

Give the precise definition of precision, recall, F1 score, and AUC, including formulas where they exist. State in one sentence what each one tells you.

Part 3 — Precision vs. recall: which to prioritize when

Explain when you would prioritize precision over recall, and when you would prioritize recall over precision. Tie your reasoning to a decision criterion, not just examples.

Part 4 — Evaluating a multi-class classification model

How would you evaluate a multi-class classification model (one of KKK categories)? Go beyond a single accuracy number.

Part 5 — Cross-entropy loss

What is cross-entropy loss, and why is it so commonly used for classification? Give the binary and multi-class forms.

Part 6 — When human evaluation beats an automatic metric

In what situations can human evaluation be better than using an automatic objective function or metric? Also note the costs and pitfalls of human evaluation.

Part 7 — Describing an NLP project you've done

If asked to describe an NLP project you have done, what technical details and tradeoffs should you discuss? Lay out the structure of a strong answer.

Clarifying Questions to Ask

  • Is the task binary or multi-class, and how many categories are there?
  • How imbalanced are the classes, and which class matters most to the business?
  • What happens downstream with a prediction — an automated action, or routing to a human reviewer?
  • What are the relative costs of a false positive vs. a false negative in this product?
  • Is there a latency or cost budget that constrains the model choice?

What a Strong Answer Covers

  • Clarity for the audience: Part 1 is genuinely accessible to non-experts; Parts 2–5 are precise and formula-correct.
  • Cost-based reasoning: precision/recall and threshold choices are justified by the cost of each error type, not by rote examples.
  • Imbalance awareness: recognizes that accuracy misleads on rare classes, and reaches for macro F1, PR AUC, or per-class recall.
  • Correct, well-stated formulas for precision, recall, F1, FβF_\betaFβ​ , and cross-entropy (binary and multi-class).
  • Evaluation depth: confusion matrix + per-class metrics + averaging choice + calibration + subgroup slices.
  • Honest tradeoffs: automatic vs. human evaluation, model complexity vs. interpretability, offline metric vs. online impact.
  • Connection to decisions: the best metric is the one that captures the product's cost of mistakes, not the most sophisticated one.

Follow-up Questions

  • Your model has ROC AUC of 0.95 but business stakeholders complain it's "useless" on the rare class. What's likely going on, and what would you measure instead?
  • Predictions feed an automated action. The model's probabilities are used as confidence to decide whether to auto-action or send to human review. How do you check the probabilities are trustworthy?
  • Suppose your offline macro F1 improved after a model change but the online business metric got worse. How do you reconcile and diagnose that?
  • How would you design a human evaluation study so its results are reproducible and not just one annotator's opinion?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Amazon•More Data Scientist•Amazon Data Scientist•Amazon Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.