CVS Health Data Scientist Interview Prep Guide
Everything CVS Health actually asks Data Scientist candidates — concept walkthroughs, worked examples, and the real interview questions, drawn from candidate reports. Free to read.
Last updated
You're broadly comfortable with SQL/Python/Power BI, so focus most on CVS/pharma audit scenarios, PBM controls, healthcare fraud-risk scoring, CAPA recurrence, supply-chain risk, and SOX-style compliance analytics. Merely review core SQL and pandas patterns, with a lighter pass on MMM/PCA because they are not your stated ramp-up area and your wizard concept ratings default to solid. The CVS Health-specific emphasis is pharmacy-benefit/PBM audit analytics, PHI-aware claims review, FWA detection, payment accuracy, and internal control evidence. With less than a week, budget about 70–75 focused minutes for this screen cheatsheet, then use remaining time on scenario drills.
Technical Screen — 72 min
Machine Learning
Focus area — Your fraud-risk-scoring focus fits CVS claims analytics: suspicious billing patterns, provider outliers, rare positives, review queues, and precision-oriented thresholds.
What's being tested
Interviewers are probing your ability to design, evaluate, and operationalize a risk-scoring model for detecting Healthcare Fraud, Waste, and Abuse (FWA) at scale. They expect fluency in handling severely imbalanced labels, noisy and delayed investigations feedback, calibration-to-business-value, and experiment/metric design that aligns model outputs to limited investigator capacity and ROI. At CVS Health this maps to measurable reductions in improper payments and higher investigator productivity rather than just classifier accuracy.
Core knowledge
-
Class imbalance: FWA prevalence is typically <<1%; standard classifiers optimize loss dominated by negatives. Use resampling, class-weighting, or algorithms robust to imbalance (
XGBoost,LightGBM) and evaluate with precision-oriented metrics, not onlyAUC. -
Evaluation metrics aligned to business: prefer
precision@k, lift, expected value per alert where B is benefit per true detection and C is cost per false alert; includeRecallonly in context of capacity. -
Ranking vs classification: operational systems often consume a prioritized list; optimize ranking metrics (
precision@k,NDCG) and build well-calibrated scores so thresholds correspond to expected ROI. -
Labeling bias & censoring: investigation labels are outcome of selection; unreviewed claims are unlabeled. Use positive-unlabeled (PU) learning, inverse probability weighting (IPW) to correct selection bias, or randomized auditing to estimate true precision.
-
Temporal validation & leakage: always use time-based splits to avoid leakage—train on older claims, validate on later windows. Watch features that implicitly encode future outcomes (investigation results, paid amount adjustments).
-
Weak/noisy labels: labels may be noisy or delayed; model label noise with noisy-label methods, treat investigations as noisy oracle, and consider distant supervision or label-propagation from provider histories.
-
Utility-driven thresholding: select operating point by maximizing expected utility under investigator capacity constraint; solve constrained optimization (maximize EV subject to alerts ≤ capacity) or compute marginal benefit per additional alert.
-
Monitoring & drift: monitor score distribution,
precision@kover time, label delay distribution, and feature drift; set alerting for rapid drop in lift or sudden calibration shifts. Use holdout sampling to re-estimate true precision periodically. -
Explainability & triage signals: produce SHAP or rule-based signals for investigator triage; prioritize features that map to actionable evidence (billing codes, provider behavior).
-
Experiment design: measure causal impact using randomized assignment at the investigator/queue level (to avoid interference), or stepped-wedge designs; key metric should be net recovered amount per investigator-hour rather than classifier metrics.
-
Sample size & power: when expected precision improvement is small but high-value, compute power for detecting changes in expected recovered dollars; for rare events, randomized audits of O(1–5k) claims may be required to estimate baseline precision with acceptable CI.
-
Scalability constraints: model complexity is allowed, but training on >10M claims may require distributed training or sampling; tree ensembles scale well but ensure feature precomputation fits
Postgres/feature-store latencies for real-time scoring.
Worked example — "Design a claims-level FWA risk scoring model"
First 30 seconds: clarify the unit of prediction (claim, line-item, provider), label definition (what counts as confirmed FWA), investigation turnaround and capacity, and business costs B and C per true/false alert. Skeleton answer pillars: (1) Label strategy — use historical confirmed investigations plus randomized audit labels to estimate true positives; (2) Features — claim metadata, provider history, peer-group deviation, network features; (3) Model & objective — gradient-boosted trees optimized for ranking and calibrated probabilities; (4) Evaluation & thresholding — precision@k, EV curve, time-based validation; (5) Monitoring & feedback loop — periodic randomized audits and retraining cadence. Key tradeoff to call out: prioritize precision at the top of the list because investigation cost is high; sacrificing some recall increases ROI. If given more time: quantify B and C, run a power calculation to size randomized audits, and prototype an IPW estimator to correct selection bias from investigator-driven labels.
A second angle — "Evaluating model when labels are scarce and biased"
When investigator-reviewed labels are scarce and non-random, the focus shifts to label-estimation and unbiased evaluation. Use PU learning to train with positives and unlabeled claims, and deploy randomized auditing: sample unreviewed claims to obtain an unbiased estimate of precision. Alternatively, build a propensity model that predicts which claims were selected for review, then apply IPW to reweight observed labels when estimating population metrics. This framing emphasizes rigorous offline performance estimation and uncertainty quantification (confidence intervals for precision@k) rather than just improving cross-validated AUC.
Common pitfalls
Pitfall: Evaluating with random cross-validation and reporting high
AUC— this ignores temporal leakage and is optimistic; time-split evaluation with label-delay handling is required.
Pitfall: Optimizing for
RecallorAUCwithout mapping to investigator capacity — a high-recall model may flood teams with low-value alerts and reduce net recovered dollars.
Pitfall: Ignoring label-selection bias — treating investigated labels as ground-truth without correction leads to biased precision estimates and misguided thresholding.
Connections
Interviewers may pivot to experiment design (how to measure causal impact of alerts), adversarial robustness (providers gaming the model), or operational ML (feature-store and scoring latency tradeoffs) — be prepared to translate model outputs into measurable operational metrics like recovered dollars per investigator-hour.
Further reading
-
[Liu, et al. "Learning from Positive and Unlabeled Examples" (2003)] — foundational PU-learning methods useful for biased investigator labels.
-
[S. B. K. Scott, "A Tutorial on Positive Unlabeled Learning"] — practical methods and pitfalls for PU scenarios.
-
[Kohavi et al., "Practical Guide to Controlled Experiments on the Web"] — experiment design patterns applicable to randomized auditing and treatment assignment.
Practice questions
Focus area — You explicitly selected probability of CAPA recurrence, so focus on quality-event features, recurrence labels, controls, and explainable risk scoring.
What's being tested
Interviewers probe whether you can turn regulated-quality problems into rigorous, actionable analytics: define recurrence precisely, choose monitoring and causal methods that respect censoring and repeated events, and evaluate CAPA effectiveness with defensible metrics. For CVS Health this demonstrates you can reduce repeat failures without over-alerting operations, while producing auditable, statistically sound evidence for regulators and stakeholders.
Core knowledge
-
GxP and CAPA context: understand that analyses must be reproducible, auditable, and time-stamped; models and metrics are evidence used in regulatory review, not just operational signals.
-
Unit-of-analysis choice: decide between site-level, batch-level, or product-unit metrics; different aggregation changes recurrence counts, denominators, and censoring behavior.
-
Event definition & labeling: precisely define recurrence (same failure mode within a time window, same root cause tag, or same CAPA ID) and document deterministic rules; ambiguous labels break downstream inference.
-
Censoring and left-truncation: handle right-censoring for ongoing follow-up and left truncation for systems with incomplete history; use survival frameworks to avoid biased rates.
-
Recurrent-event models: use Andersen–Gill, Prentice–Williams–Peterson, or gap-time models when multiple failures per unit occur; each encodes different risk assumptions about independence and ordering.
-
Monitoring & detection: use CUSUM and EWMA for low-latency recurrence detection, and Shewhart/u-charts when counts are independent per period; calibrate control limits to desired false-alarm rates.
-
Survival analysis basics: Kaplan–Meier estimator for event-free survival, and Cox proportional hazards for covariate effects; hazard function and hazard ratios quantify time-varying risk.
-
Time-dependent covariates: model process changes, CAPA implementation dates, and seasonal effects as time-varying covariates in hazard or recurrent-event models to avoid immortal time bias.
-
Causal attribution: for CAPA effectiveness use difference-in-differences, interrupted time series, or randomized trials when feasible; control confounding via matching or fixed effects.
-
Metric design & alerting: define key metrics (recurrence rate per K units, mean time to recurrence, hazard ratio) and operationalize thresholds with precision/recall tradeoffs and business cost weights.
-
Model evaluation: prefer concordance index (C-index), calibration plots, lift curves for risk models, and alarm-level precision/false-alarm-rate for monitoring systems; quantify statistical uncertainty (CI, bootstrap).
-
Sample size & power: detect relative risk reduction with baseline rate using standard binomial power approximations; small base rates () require large N or longer follow-up to detect modest effects.
Worked example — "Detecting CAPA recurrence from operational metrics"
Frame: ask how the interviewer defines a recurrence (same root cause tag? same CAPA ID?), the unit of observation, the observation window, and data latency or censoring. Skeleton approach: (1) operationalize event label and denominator; (2) build descriptive cohorts and Kaplan–Meier curves to show event-free probability over time; (3) fit a recurrent-event Cox model (or Andersen–Gill) with CAPA implemented as a time-varying covariate to estimate effect size; (4) build an EWMA or CUSUM monitor on recurrence counts for near-real-time detection and set thresholds by expected false-alarm rate. Tradeoff to flag: sensitivity vs alarm fatigue — tighter thresholds detect smaller recurrences but increase operational cost and noise. Closing: if time allowed, propose a randomized pilot across sites or an interrupted-time-series with matched controls to strengthen causal claims and sketch a monitoring dashboard with drilldown per-site event attribution.
A second angle — "Attributing CAPA recurrence to ineffective CAPA versus new causes"
Same analytic toolkit applies but the framing shifts to causal decomposition: create a competing-risks setup or use multi-state models where transitions are labelled by cause. Use propensity-score matching or synthetic controls to compare sites that implemented the CAPA versus similar sites that did not, and run an interrupted time series to rule out temporal confounders. For repeated events, model cause-specific hazards or use multilevel logistic regression for per-event attribution, and include process-change indicators as time-dependent covariates. This emphasizes isolating CAPA effectiveness from background drift and new failure modes.
Common pitfalls
Pitfall: treating any repeat incident as recurrence without linking to the same root cause. This inflates recurrence rates and misattributes CAPA failure.
Label matching rules and use deterministic or fuzzy linkage logic (same CAPA ID, same failure code, or text-similarity thresholds) to define recurrence precisely.
Pitfall: ignoring censoring and follow-up time differences across units. Comparing raw counts biases conclusions if exposure windows differ.
Use survival or rate-based metrics (events per exposure time) and include right-censoring to yield unbiased estimates.
Pitfall: proposing black-box ML predictions without interpretable attribution for regulators. High AUC isn't enough for remediation decisions.
Prefer interpretable models or provide post-hoc explanations (feature importance, SHAP) and always report effect sizes with confidence intervals, not just p-values.
Connections
Analysts may be asked next about process mining for workflow bottlenecks, root-cause analysis using causal graphs, or production anomaly detection for early-warning that feeds CAPA triage. Be ready to show how your recurrence metrics feed experiments or prioritization.
Further reading
-
Applied Survival Analysis, Hosmer / Lemeshow / May — practical guide for Kaplan–Meier and Cox models, including time-dependent covariates.
-
Statistical Methods for Recurrent Events, Cook & Lawless (2007) — covers Andersen–Gill and gap-time formulations for repeated failures.
Practice questions
Pharmacy Supply Chain Risk Analytics
Focus areaFocus area — You selected supply-chain risk, so practice shortages, vendor reliability, cold-chain exceptions, inventory anomalies, and operational risk prioritization.
What's being tested
Interviewers probe your ability to turn pharmacy operations data into actionable risk signals: forecasting demand and lead-time, defining and validating stockout/risk metrics, diagnosing anomalies, and separating correlation from causation. At CVS Health this maps directly to patient continuity and regulatory/financial risk, so expect emphasis on model robustness, rare-event handling, and explainability for stakeholders.
Core knowledge
-
Stockout / fill rate definitions: fill rate = shipped units / ordered units; measure at SKU-store-day and aggregated by week, supplier, and RDC (regional distribution center) for operational actionability.
-
Lead time demand & safety stock formula: safety stock = z * σ_{demand_LT}, where σ_{demand_LT} = sqrt(Lσ_D^2 + D̄^2σ_L^2); use z from desired service level (e.g., 95%).
-
Intermittent demand handling: Croston’s method or intermittent-aware loss (MASE or RMSSE); avoid naive MAPE for zero-heavy series—use MASE or sMAPE instead.
-
Time-series CV: use rolling-origin (walk-forward) cross-validation with embargoed windows to avoid leakage across promotions or seasonality; evaluate multiple horizons (1,7,30 days).
-
Forecast models & tradeoffs: statistical (ETS/
ARIMA) good for interpretability;Prophetfor holiday effects; tree models (XGBoost,LightGBM) for cross-sectional features and cold-start. For N up to ~100k SKUs, ensemble + reconciliation scales; millions require hierarchical or demand-compression. -
Hierarchical forecasting & reconciliation: forecast at SKU-store and SKU-region levels, reconcile with MinT or bottom-up/top-down to preserve aggregate accuracy and avoid inventory inconsistencies.
-
Rare-event classification: predicting stockout events needs class-imbalance tactics (resampling, focal loss, class weights) and metrics like PR-AUC, precision@K, and recall for top-risk stores—optimize for operational cost of false negatives.
-
Causal vs predictive: use causal inference tools (DiD, synthetic control, ITS) when evaluating supplier changes or policy; predictive models can flag risk but not prove impact of interventions.
-
Anomaly detection & monitoring: use statistical control charts (CUSUM, EWMA) on lead time, fill rate, and outbound volume; pair automated detection with supplier-level drilldowns and changepoint analysis.
-
Feature engineering signals: lagged demand, promo flags, inventory on hand, days-of-supply, supplier reliability (late shipments rate), PO lead time distribution, and external signals (flu season indices, ICD-coded claims).
-
Evaluation tied to business cost: translate model outcomes to dollar or patient-impact metrics (stockout-days avoided, lost-dispensations, emergency substitutions); use cost-sensitive thresholds for deployment.
Worked example — Predict pharmacy stockouts 7 days ahead
Clarify scope: ask whether prediction is per SKU-store, per store (all-SKU), and whether the label is "any stockout" or "X% fill rate below threshold"; confirm available features and labeling delay. Frame answer around three pillars: (1) target definition and exploratory analysis to understand intermittency and seasonality, (2) modeling approach (time-series models for high-volume SKUs, gradient-boosted trees for cross-sectional cold-start), and (3) evaluation and deployment (rolling-origin CV, PR-AUC and recall at fixed operational budgets, and weekly alerting). For features, emphasize lead-time-aware variables (days-of-supply, incoming POs) and calendar effects (holidays, known health events). Flag the key tradeoff: optimize for recall (avoid missed stockouts) at the expense of more false positives that operational teams can triage, or optimize precision if triage capacity is limited. Close by saying: if more time, I’d run an ablation to quantify feature importance, build a simple causal check for supplier-level confounding, and run a pilot with human-in-the-loop threshold tuning.
A second angle — Causal impact of supplier consolidation on fill rates
If asked to evaluate whether consolidating suppliers caused a drop in fill rate, pivot from prediction to causal design: define pre/post windows, identify comparable control pharmacies (matching on baseline fill rate, volume, geography), and run a difference-in-differences or synthetic control. Emphasize checks: parallel trends test, placebo periods, and sensitivity to spillovers (a supplier outage may affect multiple stores). Data limitations (non-random consolidation assignments) drive choice of method; if randomization isn’t possible, rely on rich covariates and robustness checks rather than claiming definitive causality.
Common pitfalls
Pitfall: Treating forecasting like a generic ML problem.
Building a single global model without accounting for intermittent demand, hierarchical structure, and SKU heterogeneity often yields misleading average metrics and misses tail risks.
Pitfall: Using mean-based error metrics for zero-inflated series.
Reporting MAPE or RMSE on intermittent SKU series understates practical failure modes; evaluate with MASE, service-level attainment, and event-based recall/precision.
Pitfall: Overstating causality from correlated signals.
Claiming a supplier caused stockouts because of concurrent demand spikes or seasonality will harm credibility; always present robustness checks, alternative explanations, and confidence intervals.
Connections
Interviewers may pivot to inventory optimization and ask about reorder policies (s,Q,R) or to ML-engineering topics like feature-store design and real-time scoring for alerts. They may also move into prescriptive analytics (optimization to minimize stockouts given budget).
Further reading
-
Forecasting: principles and practice (Hyndman & Athanasopoulos) — practical time-series methods, intermittent demand, and cross-validation guidance.
-
Brodersen et al., “Inferring causal impact using Bayesian structural time-series models” (CausalImpact) — good primer for evaluation of interventions in time series.
Practice questions
Scikit-Learn Classification Pipelines
Focus areaFocus area — Your selected risk-modeling and fraud-scoring topics map directly to leak-free classification pipelines, temporal splits, calibration, and thresholding.
What's being tested
Candidates must demonstrate building leak-free, reproducible classification pipelines using `scikit-learn`: correct temporal splitting to avoid future information leakage; preprocessing (imputation, encoding, scaling) inside a pipeline; handling missingness and class imbalance; choosing evaluation metrics (ROC AUC, PR AUC, Brier score) and applying calibration and hyperparameter tuning without overfitting. Interviewers at CVS Health care about robust, auditable models for sensitive business and clinical decisions where leakage, miscalibration, or biased evaluation can cause wrong actions and poor patient/member outcomes.
Core knowledge
-
Temporal leakage — any feature containing future information (timestamps derived after label), or aggregations computed with full history, will inflate validation performance; always split by time and compute aggregations using only past windows.
-
PipelineandColumnTransformer— use`Pipeline`to chain`Estimator`s and`Transformer`s so`fit_transform`is applied per CV split; use`ColumnTransformer`to apply different transforms to numeric/categorical groups without leaking preprocessing. -
Imputation patterns — distinguish MCAR/MAR/MNAR; prefer
`SimpleImputer`or model-based imputation inside the pipeline; include missing indicator features (`add_indicator=True`) when missingness is informative. -
Encoding categorical vars — use
`OneHotEncoder(handle_unknown='ignore')`for nominals;`OrdinalEncoder`only if order is meaningful; avoid leaking rare-category statistics computed across folds. -
Scaling and regularization — apply
`StandardScaler`inside the pipeline before regularized models (`LogisticRegression`with`penalty='l2'); scaling must be fitted on training fold only. -
Class imbalance tactics — prefer
`class_weight='balanced'or penalized loss for calibration-friendly probability outputs; use resampling (`SMOTE`, random oversample) carefully and inside CV pipeline to avoid synthetic samples leaking. -
Calibration & probabilities — evaluate calibration with Brier score: and reliability diagrams; use
`CalibratedClassifierCV`(sigmoid/isotonic) nested inside CV to avoid overfitting calibration. -
Evaluation metrics — use PR AUC when positives are rare (gives precision/recall tradeoff); ROC AUC can be misleading with extreme imbalance; report both and calibration metrics, plus business thresholds (precision@k).
-
Cross-validation designs — for churn/time-series, use
`TimeSeriesSplit`or holdout by customer cohort/time window; hyperparameter tuning should be nested CV to avoid optimistic estimates. -
Feature engineering safely — compute aggregations (e.g., recency, frequency) with explicit lookback windows; store metadata (window length, anchor time) so features are reproducible at scoring time.
-
Threshold selection & cost-sensitivity — map threshold to business costs: choose threshold by maximizing expected utility or by precision/recall at the operating point rather than raw accuracy.
-
Operational considerations that matter for DS — keep a backtest (chronological holdout) that mirrors deployment horizon; monitor population and calibration drift over time post-deployment.
Worked example — Build a leak-free sklearn churn pipeline
First 30 seconds: ask clarifying questions — what is the label definition and its lookback/horizon (e.g., churn within 30 days), what event timestamps are available, and whether customers can appear in multiple folds. Frame the solution around three pillars: (1) data splitting: create a chronological train/validation/test split by anchor date or customer cohort, avoiding overlap; (2) pipeline composition: construct a `ColumnTransformer` that imputes (`SimpleImputer` with indicator), encodes (`OneHotEncoder(handle_unknown='ignore')`), scales (`StandardScaler`), and then fits a classifier; (3) evaluation and calibration: use nested `TimeSeriesSplit` CV for hyperparameter search, evaluate with both PR AUC and ROC AUC, and apply `CalibratedClassifierCV` if probabilities are used for risk scoring. A concrete tradeoff to call out: oversampling (e.g., `SMOTE`) can improve recall but often harms probability calibration — prefer `class_weight` for calibrated scores unless you recalibrate. Close by saying: if more time, I would add feature lookback validation, business-cost-based threshold selection, and monitor post-deployment calibration drift.
A second angle — Design classification under missingness and imbalance
Start by diagnosing missingness patterns (per-feature missing rates and correlation with label). For MAR patterns, use model-based or KNN imputation inside the pipeline and keep missing indicators; for MNAR, consult domain owners — sometimes missingness itself is a predictive signal. For severe class imbalance, prioritize metrics that reflect operational goals: PR AUC and precision at fixed recall or top-k precision. Prefer calibrated `LogisticRegression` with regularization and `class_weight` to produce reliable probabilities; if resampling is used, do it only inside the training fold and follow with calibration on a validation fold. This reframes the same pipeline principles but emphasizes diagnostic steps and calibration-first choices given noisy/missing data.
Common pitfalls
Pitfall: fitting preprocessing outside cross-validation
A common error is imputing or scaling on the full dataset before CV, leaking summary statistics. Always encapsulate transforms inside a`Pipeline`so each fold learns transforms from training data only.
Pitfall: optimizing for ROC AUC on highly imbalanced data
ROC AUC can mask poor precision for rare positives; report PR AUC and precision@k or thresholded metrics that reflect the action you’ll take (e.g., outreach capacity).
Pitfall: ignoring calibration after using resampling
Synthetic oversampling can distort predicted probabilities. If resampling is necessary, explicitly recalibrate (e.g.,`CalibratedClassifierCV`) using a validation split consistent with production scoring.
Connections
Interviewers may pivot to model monitoring and drift detection (population shift, calibration drift) or to causal inference/experiment design for measuring the impact of interventions on churn. They might also ask about moving the pipeline to production and what metrics you'd monitor post-deployment.
Further reading
-
`scikit-learn`Pipelines and`ColumnTransformer`— authoritative guide to building leak-free preprocessing. -
`imbalanced-learn`documentation — practical patterns for resampling methods and integration with`scikit-learn`.
Practice questions
Focus area — Audit and risk-scoring cases depend on missingness, rare outcomes, leakage checks, deduplication, and defensible data-quality validation.
What's being tested
Candidates must show practical competence diagnosing and handling missing data, class imbalance, and data-quality issues inside a predictive workflow. Interviewers look for safe, leak-free pipeline construction, appropriate imputation/resampling choices, and correct metric selection and interpretation for skewed classes — all from a data-scientist (analysis/modeling) lens.
Patterns & templates
-
Temporal / group-safe splitting — use time-based holdouts or
`GroupKFold`to prevent future leakage; O(n) data copy, check max(event_time) per fold. -
Scikit-learn pipeline pattern —
`Pipeline`+`ColumnTransformer`+`SimpleImputer`/`StandardScaler`for reproducible fit/transform lifecycle. -
Missingness strategy matrix — use
`SimpleImputer`for MCAR,`IterativeImputer`/model-based for MAR, flag missingness with indicator columns for MNAR. -
Imputation pragmatics — prefer median for skewed numeric, mode for categorical; always fit imputers on train only (
`fit`/`transform`split). -
Imbalance handling — prefer evaluation-first: use
`class_weight='balanced'`or`CalibratedClassifierCV`; use`SMOTE`/undersampling after train/validation split to avoid leakage. -
Metrics for skew — report
`roc_auc_score`plus`average_precision_score`(PR AUC); calibrate probabilities and report Brier score for calibration. -
Data audit checklist —
`pd.to_datetime`,`astype`casts,`isnull().sum()`,`describe()`; in SQL use`COALESCE`and explicit casts to avoid silent truncation.
Common pitfalls
Pitfall: Imputing using test+train pooled statistics — produces optimistic performance and feature leakage.
Pitfall: Relying solely on ROC AUC for 1% positive-rate problems — PR AUC and calibration matter more.
Pitfall: Running
`SMOTE`or feature-engineering before splitting — synthetic samples or leakage inflate validation scores.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
Data Manipulation (SQL/Python)
Focus area — CVS analytics is healthcare-flavored, and your requested audit nuance makes claims cohorts, paid amounts, and trend analysis high-value.
What's being tested
Candidates must show practical cohort analysis and claims spend aggregation skills using SQL or Python: correct deduplication, time-bucketing (calendar vs fiscal), and clean YoY/percentage computations. Interviewers probe for robust data-manipulation idioms (joins, window functions, group-by) and defensible handling of NULLs, zeros, and attribution windows.
Patterns & templates
- Last-event-per-claim / de-dup using window functions:
`ROW_NUMBER()`OVER (PARTITION BY claim_id ORDER BY paid_date DESC) to keep canonical record. - Conditional aggregation for percentages:
`SUM(CASE WHEN condition THEN amount ELSE 0 END)` `/ NULLIF(SUM(amount),0)to avoid divide-by-zero. - Cohort-by-first-event: assign cohort as
`DATE_TRUNC('year', first_service_date)`then`GROUP BY`cohort, fiscal_month for retention/spend. - Fiscal month derivation:
`(EXTRACT(month FROM date) + offset -1) % 12 +1`or`date_trunc('month', date - interval 'X months')`for fiscal-year shifts. - Pandas equivalents:
`pd.to_datetime`,`df.drop_duplicates(subset='claim_id', keep='last')`,`df.groupby(['cohort','year']).agg(...)`and`pivot_table`for cross-tabs. - Performance rule of thumb: single-pass aggregations are O(N); avoid repeated joins on large tables—pre-aggregate before joining.
- Edge-case joins: prefer
`LEFT JOIN`+ IS NULL checks to detect missing enrollment/payer rows; use`COUNT(DISTINCT member_id)`carefully for large cardinality.
Common pitfalls
Pitfall: Using raw
`COUNT(*)`instead of`COUNT(DISTINCT)`when duplicates exist, inflating denominators and misreporting percentages.
Pitfall: Failing to define cohort anchoring (first claim vs first paid) so YoY comparisons mix different populations.
Pitfall: Dividing by zero when prior-year spend is zero—always wrap with
`NULLIF(...,0)`or explicit guard logic.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
Focus area — You asked for pharma audit nuance; practice PBM claim adjudication, formulary, rebate, prior authorization, and network pharmacy audit scenarios.
What's being tested
Interviewers are probing your ability to turn claims and audit outcomes into actionable, measurable models for fraud/waste/abuse prioritization, anomaly detection, and audit ROI optimization. They expect clear metric design (what success looks like), careful handling of extreme class imbalance and selection bias, and defensible evaluation tied to business cost/savings. At CVS Health this matters because small percentage improvements in detection precision at top ranks directly translate to large recoveries and reduced downstream manual work.
Core knowledge
-
Label generation & selection bias: Audit-confirmed labels come from prioritized reviews, creating labeling bias; correct with randomized holdouts, inverse-propensity weighting, or targeted random audits before model rollout.
-
Cost-aware objective: Optimize expected net-savings: where is predicted fraud probability, expected savings, investigation cost; rank by marginal ROI, not raw probability.
-
Imbalanced classification: Use precision@k, AUC-PR, and recall@cost rather than
AUC-ROC; calibrate probabilities with Platt scaling or isotonic regression for monetary decisioning. -
Top-k evaluation & business KPIs: Report
precision@k, recovery rate, cost-per-detection, and lift versus random sampling; compute cumulative savings curves for operational capacity. -
Supervised vs unsupervised detection: Supervised
XGBoost/LightGBMwork well with historical labels; unsupervisedIsolationForest,LOF, or density-estimation and graph algorithms detect novel schemes without labels. -
Graph/network features: Construct bipartite
provider–pharmacygraphs; use PageRank, community detection, and network centrality to capture collusion patterns and feature-engineer suspicious relationships. -
Temporal/sequence modeling: Use rolling features, deltas, and sequence models (
RNN, aggregated time-window features) to detect sudden behavior changes; beware look-ahead bias when building features. -
Human-in-the-loop & active learning: Use uncertainty sampling to get labels on borderline cases; balance exploration (discover new fraud modes) with exploitation (recoveries).
-
Explainability & auditability: Provide
SHAPexplanations or monotonic constraints so auditors can understand model decisions and document rationale for regulatory review. -
Drift & monitoring: Monitor
precision@kand expected savings over time; trigger retraining when top-k precision drops or feature distributions shift beyond thresholds. -
Causal evaluation for interventions: When testing new audit rules, use randomized controlled trials or difference-in-differences to estimate incremental recovery versus business-as-usual.
-
Operational constraints: Incorporate capacity, seasonal patterns, and legal/regulatory limits into evaluation and optimization; solve constrained knapsack: maximize expected savings subject to capacity.
Worked example — "Design an audit-score model to prioritize pharmacy claims for manual review"
First 30 seconds: clarify objective (maximize recovered dollars per audit, or maximize number of confirmed frauds?) and operational constraints (daily audit capacity, required precision). Ask what ground-truth labels exist (post-audit dispositions) and whether a randomized audit pool exists.
Skeleton of approach: (1) Define the target metric (precision@k and expected-net-savings curve) and optimization objective (maximize expected net savings under capacity). (2) Build features from claim-level (NDC, quantity, days’ supply), provider history (average claim size, change-in-rate), and network features (provider–pharmacy co-occurrence). (3) Train a cost-sensitive classifier (LightGBM) with sample weighting = expected savings per case; calibrate probabilities. (4) Evaluate on a temporally separated holdout and a randomized audit sample to measure real-world precision and ROI.
Explicit tradeoff: flag that maximizing AUC-ROC can hurt business outcomes — instead, choose loss or sampling that emphasizes top-ranked precision (e.g., focal loss or optimizing directly for NDCG/precision@k). Report model performance as gains in expected net savings, not only statistical metrics.
Close: say you'd pilot with a randomized holdout audit to measure true lift, instrument for continuous monitoring, and iterate on feature generation and human feedback loops.
A second angle — "Detect anomalous prescribing patterns for a provider"
Same core skills apply but constraints shift: labels are rare or absent, and discovery of novel fraud methods is prioritized over immediate recoveries. Frame as an unsupervised + investigative workflow: build provider-level time-series aggregates, compute z-scores and peer-group deviations, then apply graph embedding (e.g., node2vec) to reveal suspicious clusters. Use ensemble anomaly scores combining density, isolation, and peer-deviation; prioritize cases by a business-weighted score combining anomaly magnitude and potential financial exposure. Emphasize validating findings with a small randomized investigator panel to create labels for bootstrapping a supervised model.
Common pitfalls
Pitfall: Optimizing for
AUC-ROCin a 0.1% positive-rate problem. This yields optimistic-looking models that perform poorly at top-k; instead useAUC-PR, precision@k, and cost-weighted metrics.
Pitfall: Evaluating only on audited (biased) historical cases. That perpetuates selection bias; always hold back a randomized audit sample or use causal estimators like inverse-propensity weighting to estimate real-world lift.
Pitfall: Presenting probability outputs without calibration or business mapping. Uncalibrated scores mislead ROI estimates; calibrate and translate probabilities into expected savings before prioritization.
Connections
Interviewers may pivot to causal inference (measuring the effect of an audit program), uplift modeling (who to audit to change future behavior), or MLE/monitoring topics (serving, latency, drift detection) if the conversation moves toward deployment or evaluation.
Further reading
-
Interpretable Machine Learning — Christoph Molnar — practical guidance on
SHAP, feature importance, and explainability for auditors. -
[Practical Lessons in Fraud Detection — Various KDD/industry talks] — look for conference tutorials on fraud detection pipelines and evaluation pitfalls (search KDD/NDSS/ACM workshops).
Practice questions
Behavioral & Leadership
Healthcare Privacy And PHI Compliance
Focus areaFocus area — CVS work involves PHI, minimum-necessary access, secure sharing, and auditability, directly supporting your pharma-compliance preparation.
What's being tested
Interviewers are probing your ability to design and evaluate analyses and models that respect patient privacy and regulatory constraints while still delivering valid, actionable results. Expect questions about tradeoffs between utility and risk (e.g., model accuracy vs. re-identification risk), how you reduce exposure of protected health information (PHI) during feature engineering and outputs, and how you reason about threat models for data leakage. CVS Health cares because analytics and models must support clinical and business decisions without creating legal, ethical, or reputational risk.
Core knowledge
-
PHI: Know the HIPAA definition — identifiers (e.g., names, MRNs, addresses, dates) that, alone or in combination, can identify individuals; any dataset with those is regulated and requires safeguards.
-
HIPAAminimum necessary: Principle to limit data to the smallest set needed for the objective; document justification and alternatives whenever you request broader fields. -
De-identification techniques: Understand safe-harbor (remove 18 identifiers) vs expert determination (statistical assessment of re-identification risk) and their different guarantees and operational costs.
-
Quasi-identifiers & linkage risk: High-cardinality features (zip+dob+sex) can enable linkage attacks; mitigate via generalization (bin ages), suppression (drop small-count cells), or hashing+salt — each impacts model signal.
-
Statistical privacy models: Differential privacy provides formal bounds: a mechanism
Mis (ε,δ)-DP if for all datasetsD,D'differing by one record, . Smaller ε ⇒ stronger privacy, lower utility. -
Privacy budget & aggregation: DP consumes privacy budget per query; plan budgets for training, validation, repeated reporting. For count queries: add Laplace noise scale = where is sensitivity.
-
Membership and model inversion attacks: Know the threat — adversary infers if an individual was in training or reconstructs attributes from model outputs; defenses include DP training, output clipping, temperature scaling, or restricting confidence scores.
-
Feature engineering tradeoffs: Prefer aggregated features (counts, flags over windows) to raw identifiers; quantify utility loss from aggregation by comparing AUC/precision before/after with held-out non-sensitive validation.
-
Synthetic data & utility testing: Synthetic
EHRcan help prototyping; validate by measuring distributional similarity (KS-test) and downstream model performance; synthetic data is not a drop-in privacy guarantee. -
Model outputs & downstream sharing: Avoid releasing per-patient model scores with high granularity; consider cohort-level insights, top-k anonymized lists, or DP-noised summaries. Log and justify every external data product.
-
Access & governance responsibilities: As a Data Scientist you must request appropriate data, attach documented justification, follow
IRB/legal approvals, and work with compliance on Data Use Agreements; you are not responsible for building access-control systems, but must follow them. -
Monitoring for leakage: Monitor metrics such as unexpected distributional shifts, unusually high performance on small cohorts (overfitting to identifiers), and audit model explanations (SHAP/feature importances) for leaking quasi-identifiers.
Worked example — "Design a predictive model using PHI-containing EHR data"
First 30s: clarify the business goal, minimum-patient-level granularity needed, and whether individual predictions or cohort-level risk scores are required; ask about legal constraints (HIPAA, IRB, data use agreements). Skeleton: (1) define minimum necessary features, (2) design safe feature transforms (age bins, aggregated counts), (3) select evaluation strategy that tests privacy-utility (holdout without PHI linkage), (4) choose mitigations for leakage (DP-SGD or output clipping), (5) deployment/output policy (who sees scores, log access). Tradeoff: explicitly quantify how much predictive power you expect to lose from generalizing a key high-signal feature (e.g., exact dob → age bucket) and propose experiments: train baseline with de-identified features versus a synthetic authorized dataset to estimate gap. Close by saying: if more time, you'd run a formal re-identification risk assessment (expert determination), tune DP epsilon via privacy-utility curves, and coordinate with compliance for sign-offs.
A second angle — "Running an A/B test that touches PHI (care pathway change)"
Frame: focus on metric design that preserves privacy and obeys minimum necessary. Instead of raw patient-level outcomes, pre-aggregate primary endpoints at cohort level (counts, rates) and add DP noise if outputs leave secure environments. Ensure randomization unit is appropriate (patient vs. clinic) to avoid leaking individual allocation through sparse cells. Analysis: pre-register metrics, specify suppression thresholds for small n (e.g., any cell < 10 suppressed), and run sensitivity checks for imbalance introduced by data masking. You’d flag the operational constraint: frequent interim looks consume DP budget; prefer group-level monitoring and fewer looks or use alpha-spending frameworks combined with DP accounting.
Common pitfalls
Pitfall: Over-generalizing feature warnings — removing all timestamps wholesale. Temporal granularity is often key to clinical prediction; instead, coarsen to week/month windows and test utility loss while documenting why coarser bins still meet the objective.
Pitfall: Treating synthetic data as equivalent to de-identification. Synthetic datasets can leak if trained on small or unique cases; always validate utility and run re-identification stress tests before relying on them for production decisions.
Pitfall: Communicating technical fixes as governance. Saying "we used hashing" without addressing salt management, or "we trained with DP" without reporting ε, δ, and the impact on utility undermines trust; give concrete parameters and rationale.
Connections
Interviewers may pivot to experiment design under privacy constraints (sequential testing with DP), model monitoring for fairness and leakage, or to operational roles (Data Engineering) around secure enclaves and audit logging — be ready to explain interfaces and responsibilities, not low-level infra.
Further reading
-
Dwork & Roth, The Algorithmic Foundations of Differential Privacy — formal DP definitions and mechanisms.
-
NIST De-identification Guidance for Health Data — practical recommendations and risk assessment frameworks.
Practice questions
Focus area — SOX compliance was explicitly selected; prepare analytics for control testing, evidence trails, exceptions, access review, and financial reporting risk.
What's being tested
Interviewers are probing whether a Data Scientist can design statistically sound analytics to detect, monitor, and quantify failures in SOX (Sarbanes–Oxley) internal controls — without owning pipeline or remediation work. Expect to show sampling strategy, hypothesis testing for exceptions, anomaly-detection framing, metric-definition, and how to make outputs auditable and explainable to internal audit. CVS cares because automated, statistically defensible control monitoring reduces audit effort and financial risk while preserving explainability.
Core knowledge
-
Control types: Understand difference between preventive and detective controls; monitoring frequency (daily/weekly/monthly) drives sample size and timeliness of analytics outputs.
-
Population vs sample: Use stratified sampling when exception rates vary by known strata (business unit, vendor, dollar-band). Compute sample size for proportions: with conservative p=0.5 if unknown.
-
Exception rate metrics: Define numerator/denominator precisely (e.g., exceptions per
1000transactions), time-windowed rates, and normalize for transaction volume and seasonality (workdays, month-end). -
Statistical tests: Use binomial or chi-square tests for proportions,
t-tests for continuous control metrics, and adjust for multiple comparisons via Bonferroni or false discovery rate (BH) corrections when testing many rules. -
Control charts and change detection: Apply EWMA or CUSUM charts for shifts; set control limits at and choose k based on Type I/II tradeoffs; use
p-charts for proportions. -
Anomaly detection framing: Prefer scoring anomalies (probabilistic) over hard rules; evaluate with precision/recall and
precision@kwhen labeled failures are scarce. For unsupervised, use isolation forest or density estimation plus manual review. -
Explainability & auditability: Provide reproducible code notebooks, deterministic SQL queries, data snapshots, and concise feature-level explanations (feature importances, rule contributions) for auditors.
-
Dealing with drift and config changes: Instrument detection of upstream schema or business-process changes; control baseline windows must exclude rollout periods to avoid false positives.
-
Cost-sensitive thresholds: Quantify reviewer cost per alert and missed-risk cost; choose threshold to optimize expected cost = (FP_cost * FP_rate + FN_cost * FN_rate).
-
Temporal aggregation & lookback: Short windows increase variance; use rolling windows (e.g., 7/30/90 days) and decompose seasonality with STL or differencing before anomaly detection.
-
Graph and graph-analytics: For segregation-of-duties checks, model user-role-activity as a bipartite graph; compute centrality/connected components to find unexpected cross-role access.
-
Reconciliation to financials: For controls that impact reported numbers, quantify control effectiveness as reduction in error-rate and show sensitivity of financial statements under worst-case control failure.
Tip: Always start with a one-line operational definition of the control and the precise numerator/denominator you will monitor.
Worked example — "Design analytics to monitor a journal-entry approval control"
Frame: Ask clarifying questions in first 30s — what constitutes an approved journal entry, SLA for approval time, relevant attributes (amount, user, role, business unit), and existing labeled exceptions. Skeleton answer pillars: (1) metric and SLAs (exception rate, approval lag), (2) sampling and alert thresholds (stratified by high-dollar entries), (3) detection methods (control charts + anomaly scoring) and (4) explainability/tooling for auditors. I’d propose a p-chart for daily exception rate with EWMA for sensitivity to small shifts, plus an unsupervised score (isolation forest) on entry attributes to rank high-risk entries for review. A tradeoff to call out: optimizing sensitivity (catch all risky entries) increases reviewer workload — quantify reviewer-hours per 100 alerts and pick thresholds to keep expected weekly reviews feasible. Close by stating next steps: implement a 90-day pilot, collect feedback and labeled outcomes to build a supervised classifier and compute ROC/precision@k; provide reproducible SQL and notebook for audit trail.
A second angle — "Detect segregation-of-duties (SoD) violations across user-role assignments"
Same statistical principles apply but different data shape and constraints. Frame as a graph problem: build a bipartite user-role matrix and derive role-pair co-occurrence frequencies; test unusual role-pair assignments using chi-square or z-scores after controlling for role prevalence. Use anomaly scores to prioritize investigations and produce human-readable evidence (which transactions, timestamps, approving user). Constraints like low label counts push you toward unsupervised ranking and rule-based thresholds; emphasize explainability (show the path enabling the violation) over opaque model scores.
Common pitfalls
Pitfall: Normalizing by raw counts — Monitoring raw exception counts without adjusting for transaction volume or seasonality will produce misleading alerts; always use rates and adjust for business-cycle effects.
Pitfall: Overclaiming causality — Reporting a correlated spike as a control failure without investigating upstream process changes or deployments will erode auditor trust; present suspicion with supporting evidence, not certainty.
Pitfall: Black-box models without audit trail — Delivering a complex ML model that flags transactions but cannot show feature contributions and deterministic SQL to reproduce results will fail auditability requirements.
Connections
Interviewers may pivot to fraud detection techniques (time-series anomaly detection, graph-based fraud rings), model risk management (validation and documentation), or to practical sampling questions (statistical auditing sampling vs. monetary-unit sampling).
Further reading
-
[Benjamini & Hochberg (1995) — "Controlling the false discovery rate"] — foundational method for multiple-testing adjustments when monitoring many controls.
-
[D.C. Montgomery — "Introduction to Statistical Quality Control"] — practical coverage of control charts (
CUSUM,EWMA) and process-shift detection methods.
Practice questions