Compute precision–recall curve on imbalanced data
Company: Google
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates a data scientist's competency in model evaluation for imbalanced binary classification, covering understanding of precision–recall dynamics, calibration, AUPRC, cost‑sensitive metrics, and practical implementation concerns like computing thresholds and handling ties and numerical edge cases; domain: Coding & Algorithms and machine learning model evaluation. It is commonly asked in technical interviews because it assesses both conceptual understanding of evaluation trade‑offs and the practical application of efficient, robust metric computation and threshold selection under asymmetric costs and operational constraints.
Part 1: Metric Priority Report for Imbalanced Binary Classification
Constraints
- 0 <= len(actual_labels) <= 100000
- len(actual_labels) == len(predicted_probs)
- Each actual label is either 0 or 1
- 0.0 <= each predicted probability <= 1.0
- 0.0 <= threshold <= 1.0
Examples
Input: ([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.4], 0.5)
Expected Output: ['positive_rate=0.050000', 'majority_accuracy=0.950000', 'threshold_accuracy=0.950000', 'balanced_accuracy=0.500000', 'primary=AUPRC|precision_recall_curve|calibration|cost_sensitive_utility', 'pitfalls=accuracy_can_match_majority_baseline|roc_auc_can_hide_low_precision_under_imbalance|model_accuracy_not_better_than_majority_baseline']
Explanation: The dataset has only 5% positives. Predicting all negatives achieves 95% accuracy, so accuracy is misleading.
Input: ([0, 1, 0, 1], [0.1, 0.8, 0.6, 0.7], 0.5)
Expected Output: ['positive_rate=0.500000', 'majority_accuracy=0.500000', 'threshold_accuracy=0.750000', 'balanced_accuracy=0.750000', 'primary=ROC_AUC|AUPRC|calibration|cost_sensitive_utility', 'pitfalls=none']
Explanation: The data is balanced and threshold accuracy is better than the majority baseline.
Input: ([], [], 0.5)
Expected Output: ['positive_rate=0.000000', 'majority_accuracy=0.000000', 'threshold_accuracy=0.000000', 'balanced_accuracy=0.000000', 'primary=none', 'pitfalls=empty_dataset']
Explanation: The empty-input edge case cannot support metric prioritization.
Input: ([0, 0, 0], [0.2, 0.7, 0.1], 0.5)
Expected Output: ['positive_rate=0.000000', 'majority_accuracy=1.000000', 'threshold_accuracy=0.666667', 'balanced_accuracy=0.666667', 'primary=calibration|class_distribution_check', 'pitfalls=single_class_dataset_auc_undefined|accuracy_can_match_majority_baseline|model_accuracy_not_better_than_majority_baseline']
Explanation: A single-class dataset makes ROC/PR AUC undefined as a model-comparison signal.
Hints
- The majority-class baseline is max(number_of_positives, number_of_negatives) / n.
- When one class is very rare, accuracy can be high even for a useless classifier; prioritize PR/AUPRC-style metrics and cost-sensitive evaluation.
Part 2: Compute Precision, Recall, and F1 at Every Unique Probability Threshold
Constraints
- 0 <= len(actual_labels) <= 100000
- len(actual_labels) == len(predicted_probs)
- Each actual label is either 0 or 1
- 0.0 <= each predicted probability <= 1.0
- enforce_monotonic is either True or False
Examples
Input: ([1, 0, 1, 0], [0.9, 0.8, 0.8, 0.1], False)
Expected Output: [[0.9, 1.0, 0.5, 0.666667], [0.8, 0.666667, 1.0, 0.8], [0.1, 0.5, 1.0, 0.666667]]
Explanation: The two rows with score 0.8 are added together when the threshold reaches 0.8.
Input: ([0, 1, 1], [0.9, 0.8, 0.7], True)
Expected Output: [[0.9, 0.666667, 0.0, 0.0], [0.8, 0.666667, 0.5, 0.571429], [0.7, 0.666667, 1.0, 0.8]]
Explanation: Raw precision would rise from 0 to 0.5 to 0.666667; the monotonic envelope replaces earlier precision values with the best later precision.
Input: ([0, 0], [0.7, 0.2], False)
Expected Output: [[0.7, 0.0, 0.0, 0.0], [0.2, 0.0, 0.0, 0.0]]
Explanation: There are no actual positives, so recall and F1 are defined as 0.0.
Input: ([], [], False)
Expected Output: []
Explanation: Empty input has no thresholds.
Input: ([1, 0, 1], [0.5, 0.5, 0.5], False)
Expected Output: [[0.5, 0.666667, 1.0, 0.8]]
Explanation: All examples have the same score, so there is exactly one threshold row.
Hints
- Sort unique scores in descending order and update TP/FP counts once per score group.
- For the monotonic precision envelope, scan precision values from right to left and keep the maximum precision seen so far.
Part 3: Efficient AUPRC and Calibration-Ranking Impact
Constraints
- 0 <= len(actual_labels) <= 100000
- len(actual_labels) == len(raw_scores) == len(calibrated_scores)
- Each actual label is either 0 or 1
- Scores are finite floats
- Tied scores must be treated as one threshold group
Examples
Input: ([1, 0, 1, 0], [0.9, 0.8, 0.4, 0.1], [0.99, 0.64, 0.16, 0.01])
Expected Output: [0.833333, 0.833333, 1.0]
Explanation: The calibrated scores are a monotonic transformation of the raw scores, so the PR curve is unchanged.
Input: ([1, 0, 1, 0], [0.9, 0.8, 0.4, 0.1], [0.6, 0.9, 0.5, 0.1])
Expected Output: [0.833333, 0.583333, 0.0]
Explanation: The calibrated scores move a negative example above a positive one, changing the ranking and lowering AUPRC.
Input: ([1, 0, 1], [0.5, 0.5, 0.1], [0.7, 0.7, 0.2])
Expected Output: [0.583333, 0.583333, 1.0]
Explanation: The tie structure is preserved, so both grouped PR computations match.
Input: ([0, 0], [0.9, 0.1], [0.2, 0.8])
Expected Output: [0.0, 0.0, 0.0]
Explanation: With no positives, AUPRC is defined as 0.0; the score ordering is not preserved.
Input: ([], [], [])
Expected Output: [0.0, 0.0, 1.0]
Explanation: Empty score lists have no ordering conflict.
Hints
- Sort examples by score descending and process equal-score groups together.
- AUPRC depends on ranking, not on the numeric calibration values themselves, unless calibration changes the ranking or tie structure.
Part 4: Choose an Operating Threshold with Costs and Alert Volume Limits
Constraints
- 0 <= len(actual_labels) <= 100000
- len(actual_labels) == len(predicted_probs)
- Each actual label is either 0 or 1
- 0.0 <= each predicted probability <= 1.0
- 0.0 <= cost_fp, cost_fn <= 1000000.0
- 0 <= max_alerts <= 100000
Examples
Input: ([1, 0, 1, 0], [0.9, 0.8, 0.4, 0.3], 1.0, 5.0, 4)
Expected Output: [0.4, 2, 1, 0, 1, 3, 1.0]
Explanation: Selecting down to 0.4 catches both positives with one false positive, giving cost 1.
Input: ([1, 0, 1, 0], [0.9, 0.8, 0.4, 0.3], 1.0, 5.0, 2)
Expected Output: [0.9, 1, 0, 1, 2, 1, 5.0]
Explanation: The cost-minimizing unconstrained threshold would send 3 alerts, so it is not allowed.
Input: ([1, 0, 1], [0.8, 0.8, 0.2], 1.0, 2.0, 1)
Expected Output: [-1.0, 0, 0, 2, 1, 0, 4.0]
Explanation: The top score is tied across two rows, so selecting threshold 0.8 would exceed max_alerts=1.
Input: ([1, 0], [0.9, 0.1], 1.0, 10.0, 0)
Expected Output: [-1.0, 0, 0, 1, 1, 0, 10.0]
Explanation: No alerts are allowed, so the only feasible policy predicts everything negative.
Input: ([], [], 1.0, 5.0, 10)
Expected Output: [-1.0, 0, 0, 0, 0, 0, 0.0]
Explanation: The empty-input edge case has zero cost and zero confusion-matrix counts.
Hints
- Sort unique scores descending and update TP/FP counts by score group; FN and TN can be derived from totals.
- Always include the no-alert policy, because volume constraints or high false-positive cost can make it optimal.