Design system to detect privacy-leak records
Company: TikTok
Role: Software Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Technical Screen
You are given a very large database that contains user data (both structured fields and unstructured text such as logs, messages, and documents). The company wants to automatically:
1. **Identify records that may contain privacy-sensitive or PII (personally identifiable information)**, such as names, phone numbers, email addresses, or more subtle leaks (e.g., combinations of attributes that uniquely identify a person).
2. **Classify** these records by type and severity of privacy risk.
You may use traditional ML, deep learning, and LLM-based approaches (e.g., retrieval-augmented generation, RAG).
Design an end-to-end system that solves this problem. In your design, describe:
- Functional and non-functional requirements.
- High-level architecture and main components.
- How you detect and classify privacy leaks (including any rule-based, ML, and LLM/RAG parts).
- How the system scales to large datasets.
- How you evaluate quality (precision/recall) and build a feedback loop.
- Any privacy or security concerns in the detection pipeline itself.
Assume the database could have billions of rows, with multiple data sources and schemas.
Quick Answer: This question evaluates a candidate's ability to design scalable ML-driven systems for detecting and classifying privacy-sensitive and PII-containing records across structured and unstructured data, testing competencies in data engineering, machine learning (including deep learning and LLM/RAG integrations), system architecture, and privacy/security considerations. It is commonly asked to probe reasoning about functional and non-functional requirements, trade-offs in detection and classification approaches, evaluation metrics like precision and recall, feedback loops and operational scaling, and it falls under ML system design with a practical, application-level focus rather than purely conceptual abstraction.