Healthcare Privacy And PHI Compliance

What's being tested

Interviewers are probing your ability to design and evaluate analyses and models that respect patient privacy and regulatory constraints while still delivering valid, actionable results. Expect questions about tradeoffs between utility and risk (e.g., model accuracy vs. re-identification risk), how you reduce exposure of protected health information (PHI) during feature engineering and outputs, and how you reason about threat models for data leakage. CVS Health cares because analytics and models must support clinical and business decisions without creating legal, ethical, or reputational risk.

Core knowledge

PHI: Know the HIPAA definition — identifiers (e.g., names, MRNs, addresses, dates) that, alone or in combination, can identify individuals; any dataset with those is regulated and requires safeguards.
HIPAA minimum necessary: Principle to limit data to the smallest set needed for the objective; document justification and alternatives whenever you request broader fields.
De-identification techniques: Understand safe-harbor (remove 18 identifiers) vs expert determination (statistical assessment of re-identification risk) and their different guarantees and operational costs.
Quasi-identifiers & linkage risk: High-cardinality features (zip+dob+sex) can enable linkage attacks; mitigate via generalization (bin ages), suppression (drop small-count cells), or hashing+salt — each impacts model signal.
Statistical privacy models: Differential privacy provides formal bounds: a mechanism M is (ε,δ)-DP if for all datasets D,D' differing by one record, $P[M(D)∈S] ≤ e^{ε} P[M(D')∈S] + δ$ . Smaller ε ⇒ stronger privacy, lower utility.
Privacy budget & aggregation: DP consumes privacy budget per query; plan budgets for training, validation, repeated reporting. For count queries: add Laplace noise scale = $\Delta f/\epsilon$ where $\Delta f$ is sensitivity.
Membership and model inversion attacks: Know the threat — adversary infers if an individual was in training or reconstructs attributes from model outputs; defenses include DP training, output clipping, temperature scaling, or restricting confidence scores.
Feature engineering tradeoffs: Prefer aggregated features (counts, flags over windows) to raw identifiers; quantify utility loss from aggregation by comparing AUC/precision before/after with held-out non-sensitive validation.
Synthetic data & utility testing: Synthetic EHR can help prototyping; validate by measuring distributional similarity (KS-test) and downstream model performance; synthetic data is not a drop-in privacy guarantee.
Model outputs & downstream sharing: Avoid releasing per-patient model scores with high granularity; consider cohort-level insights, top-k anonymized lists, or DP-noised summaries. Log and justify every external data product.
Access & governance responsibilities: As a Data Scientist you must request appropriate data, attach documented justification, follow IRB/legal approvals, and work with compliance on Data Use Agreements; you are not responsible for building access-control systems, but must follow them.
Monitoring for leakage: Monitor metrics such as unexpected distributional shifts, unusually high performance on small cohorts (overfitting to identifiers), and audit model explanations (SHAP/feature importances) for leaking quasi-identifiers.

Worked example — "Design a predictive model using PHI-containing EHR data"

First 30s: clarify the business goal, minimum-patient-level granularity needed, and whether individual predictions or cohort-level risk scores are required; ask about legal constraints (HIPAA, IRB, data use agreements). Skeleton: (1) define minimum necessary features, (2) design safe feature transforms (age bins, aggregated counts), (3) select evaluation strategy that tests privacy-utility (holdout without PHI linkage), (4) choose mitigations for leakage (DP-SGD or output clipping), (5) deployment/output policy (who sees scores, log access). Tradeoff: explicitly quantify how much predictive power you expect to lose from generalizing a key high-signal feature (e.g., exact dob → age bucket) and propose experiments: train baseline with de-identified features versus a synthetic authorized dataset to estimate gap. Close by saying: if more time, you'd run a formal re-identification risk assessment (expert determination), tune DP epsilon via privacy-utility curves, and coordinate with compliance for sign-offs.

A second angle — "Running an A/B test that touches PHI (care pathway change)"

Frame: focus on metric design that preserves privacy and obeys minimum necessary. Instead of raw patient-level outcomes, pre-aggregate primary endpoints at cohort level (counts, rates) and add DP noise if outputs leave secure environments. Ensure randomization unit is appropriate (patient vs. clinic) to avoid leaking individual allocation through sparse cells. Analysis: pre-register metrics, specify suppression thresholds for small n (e.g., any cell < 10 suppressed), and run sensitivity checks for imbalance introduced by data masking. You’d flag the operational constraint: frequent interim looks consume DP budget; prefer group-level monitoring and fewer looks or use alpha-spending frameworks combined with DP accounting.

Common pitfalls

Pitfall: Over-generalizing feature warnings — removing all timestamps wholesale. Temporal granularity is often key to clinical prediction; instead, coarsen to week/month windows and test utility loss while documenting why coarser bins still meet the objective.

Pitfall: Treating synthetic data as equivalent to de-identification. Synthetic datasets can leak if trained on small or unique cases; always validate utility and run re-identification stress tests before relying on them for production decisions.

Pitfall: Communicating technical fixes as governance. Saying "we used hashing" without addressing salt management, or "we trained with DP" without reporting ε, δ, and the impact on utility undermines trust; give concrete parameters and rationale.

Connections

Interviewers may pivot to experiment design under privacy constraints (sequential testing with DP), model monitoring for fairness and leakage, or to operational roles (Data Engineering) around secure enclaves and audit logging — be ready to explain interfaces and responsibilities, not low-level infra.