Privacy-Preserving Analytics And Governance
Asked of: Data Scientist
Last updated

What's being tested
Understanding of technical privacy primitives, tradeoffs between utility and privacy, and practical governance controls for safe analytics. Interviewers expect clear threat-modeling, mechanism selection (e.g., DP vs. secure computation), and budget/accounting considerations.
Core knowledge
- Differential privacy: epsilon/delta, global vs. local DP, Laplace and Gaussian mechanisms, composition theorems.
- DPSGD and noise injection for model training; TensorFlow Privacy / Opacus common tools.
- Secure computation: PSI/PSU, MPC (secret sharing, SPDZ), and homomorphic encryption (Paillier) tradeoffs (latency, scalability).
- De-identification limits: k-anonymity, l-diversity, and why re-identification risk remains.
- Privacy budget management: accounting, per-user vs. per-query budgets, and amplification by subsampling.
- Governance controls: RBAC/ABAC, logging, data lineage, retention, privacy reviews, automated auditing.
- Practical heuristics: suppress small cells, clamp/clip per-user contributions, and test utility vs. epsilon.
Worked example — "Design a privacy-preserving pipeline to compute daily active users with differential privacy"
First clarify requirements: unit of privacy (user vs. event), required fidelity (per-region vs. per-product), and query frequency (daily vs. streaming). Identify threat model: internal analyst access, external attacker, regulatory constraints. Choose mechanism: for daily counts use central DP with per-user contribution clipping, Gaussian mechanism, and an epsilon budget per day (account for composition over days). Describe accounting: pick an overall epsilon budget, apply advanced composition or moments accountant, and consider amplification from subsampling. Finally, explain governance: RBAC for raw data, log DP releases, automated delta/epsilon tracking, and procedures for re-running experiments under budget.
A common pitfall
The tempting but wrong approach is treating simple de-identification (removing PII) as sufficient. Analysts often leak information by thresholding or releasing raw small counts; suppressing after seeing raw counts leaks membership. Another trap is choosing an arbitrary epsilon without considering composition and repeated queries, producing a false sense of safety or excessive utility loss with local DP.
Further reading
- Cynthia Dwork & Aaron Roth, "The Algorithmic Foundations of Differential Privacy" (book).
- Google Differential Privacy Library docs and OpenDP project for practical implementations.