This question evaluates a candidate's ability to design scalable ML-driven systems for detecting and classifying privacy-sensitive and PII-containing records across structured and unstructured data, testing competencies in data engineering, machine learning (including deep learning and LLM/RAG integrations), system architecture, and privacy/security considerations. It is commonly asked to probe reasoning about functional and non-functional requirements, trade-offs in detection and classification approaches, evaluation metrics like precision and recall, feedback loops and operational scaling, and it falls under ML system design with a practical, application-level focus rather than purely conceptual abstraction.
You are given a very large database that contains user data (both structured fields and unstructured text such as logs, messages, and documents). The company wants to automatically:
You may use traditional ML, deep learning, and LLM-based approaches (e.g., retrieval-augmented generation, RAG).
Design an end-to-end system that solves this problem. In your design, describe:
Assume the database could have billions of rows, with multiple data sources and schemas.
Login required