Design system to detect privacy-leak records

Q: Design system to detect privacy-leak records

This question evaluates a candidate's ability to design scalable ML-driven systems for detecting and classifying privacy-sensitive and PII-containing records across structured and unstructured data, testing competencies in data engineering, machine learning (including deep learning and LLM/RAG integrations), system architecture, and privacy/security considerations. It is commonly asked to probe reasoning about functional and non-functional requirements, trade-offs in detection and classification approaches, evaluation metrics like precision and recall, feedback loops and operational scaling, and it falls under ML system design with a practical, application-level focus rather than purely conceptual abstraction.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

You are given a very large database that contains user data (both structured fields and unstructured text such as logs, messages, and documents). The company wants to automatically:

Identify records that may contain privacy-sensitive or PII (personally identifiable information) , such as names, phone numbers, email addresses, or more subtle leaks (e.g., combinations of attributes that uniquely identify a person).
Classify these records by type and severity of privacy risk.

You may use traditional ML, deep learning, and LLM-based approaches (e.g., retrieval-augmented generation, RAG).

Design an end-to-end system that solves this problem. In your design, describe:

Functional and non-functional requirements.
High-level architecture and main components.
How you detect and classify privacy leaks (including any rule-based, ML, and LLM/RAG parts).
How the system scales to large datasets.
How you evaluate quality (precision/recall) and build a feedback loop.
Any privacy or security concerns in the detection pipeline itself.

Assume the database could have billions of rows, with multiple data sources and schemas.

Design system to detect privacy-leak records

Solution

Comments (0)

Design system to detect privacy-leak records

Overview

Solution

Comments (0)