PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates a data scientist's competency in model evaluation for imbalanced binary classification, covering understanding of precision–recall dynamics, calibration, AUPRC, cost‑sensitive metrics, and practical implementation concerns like computing thresholds and handling ties and numerical edge cases; domain: Coding & Algorithms and machine learning model evaluation. It is commonly asked in technical interviews because it assesses both conceptual understanding of evaluation trade‑offs and the practical application of efficient, robust metric computation and threshold selection under asymmetric costs and operational constraints.

  • Medium
  • Google
  • Coding & Algorithms
  • Data Scientist

Compute precision–recall curve on imbalanced data

Company: Google

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

You receive a CSV with columns: actual_label ∈ {0,1} and predicted_prob ∈ [0,1]; the positive class rate is ≈5%. a) Which evaluation metrics would you prioritize and why (PR curve/AUPRC, calibration, cost‑sensitive metrics; pitfalls of accuracy/ROC in heavy imbalance)? b) Write a Python function that returns thresholds, precision, recall, and F1 across all unique predicted_prob values; handle ties, empty denominators, and enforce monotonic precision if needed. c) Compute AUPRC efficiently and discuss the effect of score calibration on the PR curve. d) Describe how you would pick an operating threshold given asymmetric costs and volume constraints.

Quick Answer: This question evaluates a data scientist's competency in model evaluation for imbalanced binary classification, covering understanding of precision–recall dynamics, calibration, AUPRC, cost‑sensitive metrics, and practical implementation concerns like computing thresholds and handling ties and numerical edge cases; domain: Coding & Algorithms and machine learning model evaluation. It is commonly asked in technical interviews because it assesses both conceptual understanding of evaluation trade‑offs and the practical application of efficient, robust metric computation and threshold selection under asymmetric costs and operational constraints.

Part 1: Metric Priority Report for Imbalanced Binary Classification

You are given true binary labels and predicted probabilities for a binary classifier. Implement a deterministic evaluator that summarizes class imbalance, compares threshold accuracy against the majority-class baseline, and recommends which metric families should be prioritized. The report should also flag common pitfalls such as relying on accuracy under heavy imbalance or trusting ROC-style summaries when precision is operationally important.

Constraints

  • 0 <= len(actual_labels) <= 100000
  • len(actual_labels) == len(predicted_probs)
  • Each actual label is either 0 or 1
  • 0.0 <= each predicted probability <= 1.0
  • 0.0 <= threshold <= 1.0

Examples

Input: ([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.4], 0.5)

Expected Output: ['positive_rate=0.050000', 'majority_accuracy=0.950000', 'threshold_accuracy=0.950000', 'balanced_accuracy=0.500000', 'primary=AUPRC|precision_recall_curve|calibration|cost_sensitive_utility', 'pitfalls=accuracy_can_match_majority_baseline|roc_auc_can_hide_low_precision_under_imbalance|model_accuracy_not_better_than_majority_baseline']

Explanation: The dataset has only 5% positives. Predicting all negatives achieves 95% accuracy, so accuracy is misleading.

Input: ([0, 1, 0, 1], [0.1, 0.8, 0.6, 0.7], 0.5)

Expected Output: ['positive_rate=0.500000', 'majority_accuracy=0.500000', 'threshold_accuracy=0.750000', 'balanced_accuracy=0.750000', 'primary=ROC_AUC|AUPRC|calibration|cost_sensitive_utility', 'pitfalls=none']

Explanation: The data is balanced and threshold accuracy is better than the majority baseline.

Input: ([], [], 0.5)

Expected Output: ['positive_rate=0.000000', 'majority_accuracy=0.000000', 'threshold_accuracy=0.000000', 'balanced_accuracy=0.000000', 'primary=none', 'pitfalls=empty_dataset']

Explanation: The empty-input edge case cannot support metric prioritization.

Input: ([0, 0, 0], [0.2, 0.7, 0.1], 0.5)

Expected Output: ['positive_rate=0.000000', 'majority_accuracy=1.000000', 'threshold_accuracy=0.666667', 'balanced_accuracy=0.666667', 'primary=calibration|class_distribution_check', 'pitfalls=single_class_dataset_auc_undefined|accuracy_can_match_majority_baseline|model_accuracy_not_better_than_majority_baseline']

Explanation: A single-class dataset makes ROC/PR AUC undefined as a model-comparison signal.

Hints

  1. The majority-class baseline is max(number_of_positives, number_of_negatives) / n.
  2. When one class is very rare, accuracy can be high even for a useless classifier; prioritize PR/AUPRC-style metrics and cost-sensitive evaluation.

Part 2: Compute Precision, Recall, and F1 at Every Unique Probability Threshold

Given true binary labels and predicted probabilities, compute the precision, recall, and F1 score for every unique predicted probability used as a threshold. For a threshold t, predict positive for every row whose predicted probability is at least t. Equal scores must be handled as a group, so tied probabilities enter the confusion counts together. Optionally, return an interpolated monotonic precision envelope.

Constraints

  • 0 <= len(actual_labels) <= 100000
  • len(actual_labels) == len(predicted_probs)
  • Each actual label is either 0 or 1
  • 0.0 <= each predicted probability <= 1.0
  • enforce_monotonic is either True or False

Examples

Input: ([1, 0, 1, 0], [0.9, 0.8, 0.8, 0.1], False)

Expected Output: [[0.9, 1.0, 0.5, 0.666667], [0.8, 0.666667, 1.0, 0.8], [0.1, 0.5, 1.0, 0.666667]]

Explanation: The two rows with score 0.8 are added together when the threshold reaches 0.8.

Input: ([0, 1, 1], [0.9, 0.8, 0.7], True)

Expected Output: [[0.9, 0.666667, 0.0, 0.0], [0.8, 0.666667, 0.5, 0.571429], [0.7, 0.666667, 1.0, 0.8]]

Explanation: Raw precision would rise from 0 to 0.5 to 0.666667; the monotonic envelope replaces earlier precision values with the best later precision.

Input: ([0, 0], [0.7, 0.2], False)

Expected Output: [[0.7, 0.0, 0.0, 0.0], [0.2, 0.0, 0.0, 0.0]]

Explanation: There are no actual positives, so recall and F1 are defined as 0.0.

Input: ([], [], False)

Expected Output: []

Explanation: Empty input has no thresholds.

Input: ([1, 0, 1], [0.5, 0.5, 0.5], False)

Expected Output: [[0.5, 0.666667, 1.0, 0.8]]

Explanation: All examples have the same score, so there is exactly one threshold row.

Hints

  1. Sort unique scores in descending order and update TP/FP counts once per score group.
  2. For the monotonic precision envelope, scan precision values from right to left and keep the maximum precision seen so far.

Part 3: Efficient AUPRC and Calibration-Ranking Impact

Compute the area under the precision-recall curve using a threshold-grouped average precision definition. You are also given a second set of scores, representing calibrated scores. Return both AUPRC values and whether calibration preserved the complete score ordering and tie structure. A strictly monotonic calibration transformation should preserve the PR curve, while a transformation that changes ordering or ties can change it.

Constraints

  • 0 <= len(actual_labels) <= 100000
  • len(actual_labels) == len(raw_scores) == len(calibrated_scores)
  • Each actual label is either 0 or 1
  • Scores are finite floats
  • Tied scores must be treated as one threshold group

Examples

Input: ([1, 0, 1, 0], [0.9, 0.8, 0.4, 0.1], [0.99, 0.64, 0.16, 0.01])

Expected Output: [0.833333, 0.833333, 1.0]

Explanation: The calibrated scores are a monotonic transformation of the raw scores, so the PR curve is unchanged.

Input: ([1, 0, 1, 0], [0.9, 0.8, 0.4, 0.1], [0.6, 0.9, 0.5, 0.1])

Expected Output: [0.833333, 0.583333, 0.0]

Explanation: The calibrated scores move a negative example above a positive one, changing the ranking and lowering AUPRC.

Input: ([1, 0, 1], [0.5, 0.5, 0.1], [0.7, 0.7, 0.2])

Expected Output: [0.583333, 0.583333, 1.0]

Explanation: The tie structure is preserved, so both grouped PR computations match.

Input: ([0, 0], [0.9, 0.1], [0.2, 0.8])

Expected Output: [0.0, 0.0, 0.0]

Explanation: With no positives, AUPRC is defined as 0.0; the score ordering is not preserved.

Input: ([], [], [])

Expected Output: [0.0, 0.0, 1.0]

Explanation: Empty score lists have no ordering conflict.

Hints

  1. Sort examples by score descending and process equal-score groups together.
  2. AUPRC depends on ranking, not on the numeric calibration values themselves, unless calibration changes the ranking or tie structure.

Part 4: Choose an Operating Threshold with Costs and Alert Volume Limits

You are deploying a binary classifier where false positives and false negatives have different costs, and only a limited number of positive alerts can be reviewed. Given labels, predicted probabilities, asymmetric costs, and a maximum alert volume, choose the threshold that minimizes empirical cost while satisfying the volume constraint. A threshold includes all rows with predicted probability greater than or equal to it, so tied scores must be selected together.

Constraints

  • 0 <= len(actual_labels) <= 100000
  • len(actual_labels) == len(predicted_probs)
  • Each actual label is either 0 or 1
  • 0.0 <= each predicted probability <= 1.0
  • 0.0 <= cost_fp, cost_fn <= 1000000.0
  • 0 <= max_alerts <= 100000

Examples

Input: ([1, 0, 1, 0], [0.9, 0.8, 0.4, 0.3], 1.0, 5.0, 4)

Expected Output: [0.4, 2, 1, 0, 1, 3, 1.0]

Explanation: Selecting down to 0.4 catches both positives with one false positive, giving cost 1.

Input: ([1, 0, 1, 0], [0.9, 0.8, 0.4, 0.3], 1.0, 5.0, 2)

Expected Output: [0.9, 1, 0, 1, 2, 1, 5.0]

Explanation: The cost-minimizing unconstrained threshold would send 3 alerts, so it is not allowed.

Input: ([1, 0, 1], [0.8, 0.8, 0.2], 1.0, 2.0, 1)

Expected Output: [-1.0, 0, 0, 2, 1, 0, 4.0]

Explanation: The top score is tied across two rows, so selecting threshold 0.8 would exceed max_alerts=1.

Input: ([1, 0], [0.9, 0.1], 1.0, 10.0, 0)

Expected Output: [-1.0, 0, 0, 1, 1, 0, 10.0]

Explanation: No alerts are allowed, so the only feasible policy predicts everything negative.

Input: ([], [], 1.0, 5.0, 10)

Expected Output: [-1.0, 0, 0, 0, 0, 0, 0.0]

Explanation: The empty-input edge case has zero cost and zero confusion-matrix counts.

Hints

  1. Sort unique scores descending and update TP/FP counts by score group; FN and TN can be derived from totals.
  2. Always include the no-alert policy, because volume constraints or high false-positive cost can make it optimal.
Last updated: Jun 22, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Infection Spread on a Grid (Cellular Automaton) - Google (hard)
  • Boolean Expression Tree with Leaf Flips - Google (medium)
  • Streaming Points: Remove Any Pair Within a Distance - Google (medium)
  • Most Active Users in a Live Communication Stream - Google (medium)
  • Solve Rooms and Top-K Streams - Google (medium)