Healthcare Fraud, Waste, And Abuse Risk Scoring

What's being tested

Interviewers are probing your ability to design, evaluate, and operationalize a risk-scoring model for detecting Healthcare Fraud, Waste, and Abuse (FWA) at scale. They expect fluency in handling severely imbalanced labels, noisy and delayed investigations feedback, calibration-to-business-value, and experiment/metric design that aligns model outputs to limited investigator capacity and ROI. At CVS Health this maps to measurable reductions in improper payments and higher investigator productivity rather than just classifier accuracy.

Core knowledge

Class imbalance: FWA prevalence is typically <<1%; standard classifiers optimize loss dominated by negatives. Use resampling, class-weighting, or algorithms robust to imbalance (XGBoost, LightGBM) and evaluate with precision-oriented metrics, not only AUC.
Evaluation metrics aligned to business: prefer precision@k, lift, expected value per alert $EV(k)=\text{precision@k}\times B - (1-\text{precision@k})\times C$ where B is benefit per true detection and C is cost per false alert; include Recall only in context of capacity.
Ranking vs classification: operational systems often consume a prioritized list; optimize ranking metrics (precision@k, NDCG) and build well-calibrated scores so thresholds correspond to expected ROI.
Labeling bias & censoring: investigation labels are outcome of selection; unreviewed claims are unlabeled. Use positive-unlabeled (PU) learning, inverse probability weighting (IPW) to correct selection bias, or randomized auditing to estimate true precision.
Temporal validation & leakage: always use time-based splits to avoid leakage—train on older claims, validate on later windows. Watch features that implicitly encode future outcomes (investigation results, paid amount adjustments).
Weak/noisy labels: labels may be noisy or delayed; model label noise with noisy-label methods, treat investigations as noisy oracle, and consider distant supervision or label-propagation from provider histories.
Utility-driven thresholding: select operating point by maximizing expected utility under investigator capacity constraint; solve constrained optimization (maximize EV subject to alerts ≤ capacity) or compute marginal benefit per additional alert.
Monitoring & drift: monitor score distribution, precision@k over time, label delay distribution, and feature drift; set alerting for rapid drop in lift or sudden calibration shifts. Use holdout sampling to re-estimate true precision periodically.
Explainability & triage signals: produce SHAP or rule-based signals for investigator triage; prioritize features that map to actionable evidence (billing codes, provider behavior).
Experiment design: measure causal impact using randomized assignment at the investigator/queue level (to avoid interference), or stepped-wedge designs; key metric should be net recovered amount per investigator-hour rather than classifier metrics.
Sample size & power: when expected precision improvement is small but high-value, compute power for detecting changes in expected recovered dollars; for rare events, randomized audits of O(1–5k) claims may be required to estimate baseline precision with acceptable CI.
Scalability constraints: model complexity is allowed, but training on >10M claims may require distributed training or sampling; tree ensembles scale well but ensure feature precomputation fits Postgres/feature-store latencies for real-time scoring.

Worked example — "Design a claims-level FWA risk scoring model"

First 30 seconds: clarify the unit of prediction (claim, line-item, provider), label definition (what counts as confirmed FWA), investigation turnaround and capacity, and business costs B and C per true/false alert. Skeleton answer pillars: (1) Label strategy — use historical confirmed investigations plus randomized audit labels to estimate true positives; (2) Features — claim metadata, provider history, peer-group deviation, network features; (3) Model & objective — gradient-boosted trees optimized for ranking and calibrated probabilities; (4) Evaluation & thresholding — precision@k, EV curve, time-based validation; (5) Monitoring & feedback loop — periodic randomized audits and retraining cadence. Key tradeoff to call out: prioritize precision at the top of the list because investigation cost is high; sacrificing some recall increases ROI. If given more time: quantify B and C, run a power calculation to size randomized audits, and prototype an IPW estimator to correct selection bias from investigator-driven labels.

A second angle — "Evaluating model when labels are scarce and biased"

When investigator-reviewed labels are scarce and non-random, the focus shifts to label-estimation and unbiased evaluation. Use PU learning to train with positives and unlabeled claims, and deploy randomized auditing: sample unreviewed claims to obtain an unbiased estimate of precision. Alternatively, build a propensity model that predicts which claims were selected for review, then apply IPW to reweight observed labels when estimating population metrics. This framing emphasizes rigorous offline performance estimation and uncertainty quantification (confidence intervals for precision@k) rather than just improving cross-validated AUC.

Common pitfalls

Pitfall: Evaluating with random cross-validation and reporting high AUC — this ignores temporal leakage and is optimistic; time-split evaluation with label-delay handling is required.

Pitfall: Optimizing for Recall or AUC without mapping to investigator capacity — a high-recall model may flood teams with low-value alerts and reduce net recovered dollars.

Pitfall: Ignoring label-selection bias — treating investigated labels as ground-truth without correction leads to biased precision estimates and misguided thresholding.

Connections

Interviewers may pivot to experiment design (how to measure causal impact of alerts), adversarial robustness (providers gaming the model), or operational ML (feature-store and scoring latency tradeoffs) — be prepared to translate model outputs into measurable operational metrics like recovered dollars per investigator-hour.