How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a Medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Google.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Google during technical interviews.

Compute precision–recall curve on imbalanced data

Quick Overview

This question evaluates a data scientist's competency in model evaluation for imbalanced binary classification, covering understanding of precision–recall dynamics, calibration, AUPRC, cost‑sensitive metrics, and practical implementation concerns like computing thresholds and handling ties and numerical edge cases; domain: Coding & Algorithms and machine learning model evaluation. It is commonly asked in technical interviews because it assesses both conceptual understanding of evaluation trade‑offs and the practical application of efficient, robust metric computation and threshold selection under asymmetric costs and operational constraints.

Compute precision–recall curve on imbalanced data

Company: Google

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

You receive a CSV with columns: actual_label ∈ {0,1} and predicted_prob ∈ [0,1]; the positive class rate is ≈5%. a) Which evaluation metrics would you prioritize and why (PR curve/AUPRC, calibration, cost‑sensitive metrics; pitfalls of accuracy/ROC in heavy imbalance)? b) Write a Python function that returns thresholds, precision, recall, and F1 across all unique predicted_prob values; handle ties, empty denominators, and enforce monotonic precision if needed. c) Compute AUPRC efficiently and discuss the effect of score calibration on the PR curve. d) Describe how you would pick an operating threshold given asymmetric costs and volume constraints.

Quick Answer: This question evaluates a data scientist's competency in model evaluation for imbalanced binary classification, covering understanding of precision–recall dynamics, calibration, AUPRC, cost‑sensitive metrics, and practical implementation concerns like computing thresholds and handling ties and numerical edge cases; domain: Coding & Algorithms and machine learning model evaluation. It is commonly asked in technical interviews because it assesses both conceptual understanding of evaluation trade‑offs and the practical application of efficient, robust metric computation and threshold selection under asymmetric costs and operational constraints.

Part 1: Metric Priority Report for Imbalanced Binary Classification

You are given true binary labels and predicted probabilities for a binary classifier. Implement a deterministic evaluator that summarizes class imbalance, compares threshold accuracy against the majority-class baseline, and recommends which metric families should be prioritized. The report should also flag common pitfalls such as relying on accuracy under heavy imbalance or trusting ROC-style summaries when precision is operationally important.

Constraints

0 <= len(actual_labels) <= 100000
len(actual_labels) == len(predicted_probs)
Each actual label is either 0 or 1
0.0 <= each predicted probability <= 1.0
0.0 <= threshold <= 1.0

Examples

Input: ([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.4], 0.5)

Expected Output: ['positive_rate=0.050000', 'majority_accuracy=0.950000', 'threshold_accuracy=0.950000', 'balanced_accuracy=0.500000', 'primary=AUPRC|precision_recall_curve|calibration|cost_sensitive_utility', 'pitfalls=accuracy_can_match_majority_baseline|roc_auc_can_hide_low_precision_under_imbalance|model_accuracy_not_better_than_majority_baseline']

Explanation: The dataset has only 5% positives. Predicting all negatives achieves 95% accuracy, so accuracy is misleading.

Input: ([0, 1, 0, 1], [0.1, 0.8, 0.6, 0.7], 0.5)

Expected Output: ['positive_rate=0.500000', 'majority_accuracy=0.500000', 'threshold_accuracy=0.750000', 'balanced_accuracy=0.750000', 'primary=ROC_AUC|AUPRC|calibration|cost_sensitive_utility', 'pitfalls=none']

Explanation: The data is balanced and threshold accuracy is better than the majority baseline.

Input: ([], [], 0.5)

Expected Output: ['positive_rate=0.000000', 'majority_accuracy=0.000000', 'threshold_accuracy=0.000000', 'balanced_accuracy=0.000000', 'primary=none', 'pitfalls=empty_dataset']

Explanation: The empty-input edge case cannot support metric prioritization.

Input: ([0, 0, 0], [0.2, 0.7, 0.1], 0.5)

Expected Output: ['positive_rate=0.000000', 'majority_accuracy=1.000000', 'threshold_accuracy=0.666667', 'balanced_accuracy=0.666667', 'primary=calibration|class_distribution_check', 'pitfalls=single_class_dataset_auc_undefined|accuracy_can_match_majority_baseline|model_accuracy_not_better_than_majority_baseline']

Explanation: A single-class dataset makes ROC/PR AUC undefined as a model-comparison signal.

Hints

The majority-class baseline is max(number_of_positives, number_of_negatives) / n.
When one class is very rare, accuracy can be high even for a useless classifier; prioritize PR/AUPRC-style metrics and cost-sensitive evaluation.

Part 2: Compute Precision, Recall, and F1 at Every Unique Probability Threshold

Given true binary labels and predicted probabilities, compute the precision, recall, and F1 score for every unique predicted probability used as a threshold. For a threshold t, predict positive for every row whose predicted probability is at least t. Equal scores must be handled as a group, so tied probabilities enter the confusion counts together. Optionally, return an interpolated monotonic precision envelope.

Constraints

0 <= len(actual_labels) <= 100000
len(actual_labels) == len(predicted_probs)
Each actual label is either 0 or 1
0.0 <= each predicted probability <= 1.0
enforce_monotonic is either True or False

Examples

Input: ([1, 0, 1, 0], [0.9, 0.8, 0.8, 0.1], False)

Expected Output: [[0.9, 1.0, 0.5, 0.666667], [0.8, 0.666667, 1.0, 0.8], [0.1, 0.5, 1.0, 0.666667]]

Explanation: The two rows with score 0.8 are added together when the threshold reaches 0.8.

Input: ([0, 1, 1], [0.9, 0.8, 0.7], True)

Expected Output: [[0.9, 0.666667, 0.0, 0.0], [0.8, 0.666667, 0.5, 0.571429], [0.7, 0.666667, 1.0, 0.8]]

Explanation: Raw precision would rise from 0 to 0.5 to 0.666667; the monotonic envelope replaces earlier precision values with the best later precision.

Input: ([0, 0], [0.7, 0.2], False)

Expected Output: [[0.7, 0.0, 0.0, 0.0], [0.2, 0.0, 0.0, 0.0]]

Explanation: There are no actual positives, so recall and F1 are defined as 0.0.

Input: ([], [], False)

Expected Output: []

Explanation: Empty input has no thresholds.

Input: ([1, 0, 1], [0.5, 0.5, 0.5], False)

Expected Output: [[0.5, 0.666667, 1.0, 0.8]]

Explanation: All examples have the same score, so there is exactly one threshold row.

Hints

Sort unique scores in descending order and update TP/FP counts once per score group.
For the monotonic precision envelope, scan precision values from right to left and keep the maximum precision seen so far.

Part 3: Efficient AUPRC and Calibration-Ranking Impact

Compute the area under the precision-recall curve using a threshold-grouped average precision definition. You are also given a second set of scores, representing calibrated scores. Return both AUPRC values and whether calibration preserved the complete score ordering and tie structure. A strictly monotonic calibration transformation should preserve the PR curve, while a transformation that changes ordering or ties can change it.

Constraints

0 <= len(actual_labels) <= 100000
len(actual_labels) == len(raw_scores) == len(calibrated_scores)
Each actual label is either 0 or 1
Scores are finite floats
Tied scores must be treated as one threshold group

Examples

Input: ([1, 0, 1, 0], [0.9, 0.8, 0.4, 0.1], [0.99, 0.64, 0.16, 0.01])

Expected Output: [0.833333, 0.833333, 1.0]

Explanation: The calibrated scores are a monotonic transformation of the raw scores, so the PR curve is unchanged.

Input: ([1, 0, 1, 0], [0.9, 0.8, 0.4, 0.1], [0.6, 0.9, 0.5, 0.1])

Expected Output: [0.833333, 0.583333, 0.0]

Explanation: The calibrated scores move a negative example above a positive one, changing the ranking and lowering AUPRC.

Input: ([1, 0, 1], [0.5, 0.5, 0.1], [0.7, 0.7, 0.2])

Expected Output: [0.583333, 0.583333, 1.0]

Explanation: The tie structure is preserved, so both grouped PR computations match.

Input: ([0, 0], [0.9, 0.1], [0.2, 0.8])

Expected Output: [0.0, 0.0, 0.0]

Explanation: With no positives, AUPRC is defined as 0.0; the score ordering is not preserved.

Input: ([], [], [])

Expected Output: [0.0, 0.0, 1.0]

Explanation: Empty score lists have no ordering conflict.

Hints

Sort examples by score descending and process equal-score groups together.
AUPRC depends on ranking, not on the numeric calibration values themselves, unless calibration changes the ranking or tie structure.

Part 4: Choose an Operating Threshold with Costs and Alert Volume Limits

You are deploying a binary classifier where false positives and false negatives have different costs, and only a limited number of positive alerts can be reviewed. Given labels, predicted probabilities, asymmetric costs, and a maximum alert volume, choose the threshold that minimizes empirical cost while satisfying the volume constraint. A threshold includes all rows with predicted probability greater than or equal to it, so tied scores must be selected together.

Constraints

0 <= len(actual_labels) <= 100000
len(actual_labels) == len(predicted_probs)
Each actual label is either 0 or 1
0.0 <= each predicted probability <= 1.0
0.0 <= cost_fp, cost_fn <= 1000000.0
0 <= max_alerts <= 100000

Examples

Input: ([1, 0, 1, 0], [0.9, 0.8, 0.4, 0.3], 1.0, 5.0, 4)

Expected Output: [0.4, 2, 1, 0, 1, 3, 1.0]

Explanation: Selecting down to 0.4 catches both positives with one false positive, giving cost 1.

Input: ([1, 0, 1, 0], [0.9, 0.8, 0.4, 0.3], 1.0, 5.0, 2)

Expected Output: [0.9, 1, 0, 1, 2, 1, 5.0]

Explanation: The cost-minimizing unconstrained threshold would send 3 alerts, so it is not allowed.

Input: ([1, 0, 1], [0.8, 0.8, 0.2], 1.0, 2.0, 1)

Expected Output: [-1.0, 0, 0, 2, 1, 0, 4.0]

Explanation: The top score is tied across two rows, so selecting threshold 0.8 would exceed max_alerts=1.

Input: ([1, 0], [0.9, 0.1], 1.0, 10.0, 0)

Expected Output: [-1.0, 0, 0, 1, 1, 0, 10.0]

Explanation: No alerts are allowed, so the only feasible policy predicts everything negative.

Input: ([], [], 1.0, 5.0, 10)

Expected Output: [-1.0, 0, 0, 0, 0, 0, 0.0]

Explanation: The empty-input edge case has zero cost and zero confusion-matrix counts.

Hints

Sort unique scores descending and update TP/FP counts by score group; FN and TN can be derived from totals.
Always include the no-alert policy, because volume constraints or high false-positive cost can make it optimal.

Quick Overview

Compute precision–recall curve on imbalanced data

Company: Google

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

Part 1: Metric Priority Report for Imbalanced Binary Classification

Constraints

0 <= len(actual_labels) <= 100000
len(actual_labels) == len(predicted_probs)
Each actual label is either 0 or 1
0.0 <= each predicted probability <= 1.0
0.0 <= threshold <= 1.0

Examples

Input: ([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.4], 0.5)

Explanation: The dataset has only 5% positives. Predicting all negatives achieves 95% accuracy, so accuracy is misleading.

Input: ([0, 1, 0, 1], [0.1, 0.8, 0.6, 0.7], 0.5)

Explanation: The data is balanced and threshold accuracy is better than the majority baseline.

Input: ([], [], 0.5)

Expected Output: ['positive_rate=0.000000', 'majority_accuracy=0.000000', 'threshold_accuracy=0.000000', 'balanced_accuracy=0.000000', 'primary=none', 'pitfalls=empty_dataset']

Explanation: The empty-input edge case cannot support metric prioritization.

Input: ([0, 0, 0], [0.2, 0.7, 0.1], 0.5)

Explanation: A single-class dataset makes ROC/PR AUC undefined as a model-comparison signal.

Hints

The majority-class baseline is max(number_of_positives, number_of_negatives) / n.
When one class is very rare, accuracy can be high even for a useless classifier; prioritize PR/AUPRC-style metrics and cost-sensitive evaluation.

Part 2: Compute Precision, Recall, and F1 at Every Unique Probability Threshold

Constraints

0 <= len(actual_labels) <= 100000
len(actual_labels) == len(predicted_probs)
Each actual label is either 0 or 1
0.0 <= each predicted probability <= 1.0
enforce_monotonic is either True or False

Examples

Input: ([1, 0, 1, 0], [0.9, 0.8, 0.8, 0.1], False)

Expected Output: [[0.9, 1.0, 0.5, 0.666667], [0.8, 0.666667, 1.0, 0.8], [0.1, 0.5, 1.0, 0.666667]]

Explanation: The two rows with score 0.8 are added together when the threshold reaches 0.8.

Input: ([0, 1, 1], [0.9, 0.8, 0.7], True)

Expected Output: [[0.9, 0.666667, 0.0, 0.0], [0.8, 0.666667, 0.5, 0.571429], [0.7, 0.666667, 1.0, 0.8]]

Explanation: Raw precision would rise from 0 to 0.5 to 0.666667; the monotonic envelope replaces earlier precision values with the best later precision.

Input: ([0, 0], [0.7, 0.2], False)

Expected Output: [[0.7, 0.0, 0.0, 0.0], [0.2, 0.0, 0.0, 0.0]]

Explanation: There are no actual positives, so recall and F1 are defined as 0.0.

Input: ([], [], False)

Expected Output: []

Explanation: Empty input has no thresholds.

Input: ([1, 0, 1], [0.5, 0.5, 0.5], False)

Expected Output: [[0.5, 0.666667, 1.0, 0.8]]

Explanation: All examples have the same score, so there is exactly one threshold row.

Hints

Sort unique scores in descending order and update TP/FP counts once per score group.
For the monotonic precision envelope, scan precision values from right to left and keep the maximum precision seen so far.

Part 3: Efficient AUPRC and Calibration-Ranking Impact

Constraints

0 <= len(actual_labels) <= 100000
len(actual_labels) == len(raw_scores) == len(calibrated_scores)
Each actual label is either 0 or 1
Scores are finite floats
Tied scores must be treated as one threshold group

Examples

Input: ([1, 0, 1, 0], [0.9, 0.8, 0.4, 0.1], [0.99, 0.64, 0.16, 0.01])

Expected Output: [0.833333, 0.833333, 1.0]

Explanation: The calibrated scores are a monotonic transformation of the raw scores, so the PR curve is unchanged.

Input: ([1, 0, 1, 0], [0.9, 0.8, 0.4, 0.1], [0.6, 0.9, 0.5, 0.1])

Expected Output: [0.833333, 0.583333, 0.0]

Explanation: The calibrated scores move a negative example above a positive one, changing the ranking and lowering AUPRC.

Input: ([1, 0, 1], [0.5, 0.5, 0.1], [0.7, 0.7, 0.2])

Expected Output: [0.583333, 0.583333, 1.0]

Explanation: The tie structure is preserved, so both grouped PR computations match.

Input: ([0, 0], [0.9, 0.1], [0.2, 0.8])

Expected Output: [0.0, 0.0, 0.0]

Explanation: With no positives, AUPRC is defined as 0.0; the score ordering is not preserved.

Input: ([], [], [])

Expected Output: [0.0, 0.0, 1.0]

Explanation: Empty score lists have no ordering conflict.

Hints

Sort examples by score descending and process equal-score groups together.
AUPRC depends on ranking, not on the numeric calibration values themselves, unless calibration changes the ranking or tie structure.

Part 4: Choose an Operating Threshold with Costs and Alert Volume Limits

Constraints

0 <= len(actual_labels) <= 100000
len(actual_labels) == len(predicted_probs)
Each actual label is either 0 or 1
0.0 <= each predicted probability <= 1.0
0.0 <= cost_fp, cost_fn <= 1000000.0
0 <= max_alerts <= 100000

Examples

Input: ([1, 0, 1, 0], [0.9, 0.8, 0.4, 0.3], 1.0, 5.0, 4)

Expected Output: [0.4, 2, 1, 0, 1, 3, 1.0]

Explanation: Selecting down to 0.4 catches both positives with one false positive, giving cost 1.

Input: ([1, 0, 1, 0], [0.9, 0.8, 0.4, 0.3], 1.0, 5.0, 2)

Expected Output: [0.9, 1, 0, 1, 2, 1, 5.0]

Explanation: The cost-minimizing unconstrained threshold would send 3 alerts, so it is not allowed.

Input: ([1, 0, 1], [0.8, 0.8, 0.2], 1.0, 2.0, 1)

Expected Output: [-1.0, 0, 0, 2, 1, 0, 4.0]

Explanation: The top score is tied across two rows, so selecting threshold 0.8 would exceed max_alerts=1.

Input: ([1, 0], [0.9, 0.1], 1.0, 10.0, 0)

Expected Output: [-1.0, 0, 0, 1, 1, 0, 10.0]

Explanation: No alerts are allowed, so the only feasible policy predicts everything negative.

Input: ([], [], 1.0, 5.0, 10)

Expected Output: [-1.0, 0, 0, 0, 0, 0, 0.0]

Explanation: The empty-input edge case has zero cost and zero confusion-matrix counts.

Hints

Sort unique scores descending and update TP/FP counts by score group; FN and TN can be derived from totals.
Always include the no-alert policy, because volume constraints or high false-positive cost can make it optimal.