You are given a very large database that contains user data (both structured fields and unstructured text such as logs, messages, and documents). The company wants to automatically:
-
Identify records that may contain privacy-sensitive or PII (personally identifiable information)
, such as names, phone numbers, email addresses, or more subtle leaks (e.g., combinations of attributes that uniquely identify a person).
-
Classify
these records by type and severity of privacy risk.
You may use traditional ML, deep learning, and LLM-based approaches (e.g., retrieval-augmented generation, RAG).
Design an end-to-end system that solves this problem. In your design, describe:
-
Functional and non-functional requirements.
-
High-level architecture and main components.
-
How you detect and classify privacy leaks (including any rule-based, ML, and LLM/RAG parts).
-
How the system scales to large datasets.
-
How you evaluate quality (precision/recall) and build a feedback loop.
-
Any privacy or security concerns in the detection pipeline itself.
Assume the database could have billions of rows, with multiple data sources and schemas.