Design system to detect privacy-leak records

Q: Design system to detect privacy-leak records

This is a ML System Design interview question from ByteDance for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

You are given a very large database that contains user data (both structured fields and unstructured text such as logs, messages, and documents). The company wants to automatically:

Identify records that may contain privacy-sensitive or PII (personally identifiable information) , such as names, phone numbers, email addresses, or more subtle leaks (e.g., combinations of attributes that uniquely identify a person).
Classify these records by type and severity of privacy risk.

You may use traditional ML, deep learning, and LLM-based approaches (e.g., retrieval-augmented generation, RAG).

Design an end-to-end system that solves this problem. In your design, describe:

Functional and non-functional requirements.
High-level architecture and main components.
How you detect and classify privacy leaks (including any rule-based, ML, and LLM/RAG parts).
How the system scales to large datasets.
How you evaluate quality (precision/recall) and build a feedback loop.
Any privacy or security concerns in the detection pipeline itself.

Assume the database could have billions of rows, with multiple data sources and schemas.

Design system to detect privacy-leak records

Solution

Comments (0)