This question evaluates a candidate's competency in designing data-quality validation pipelines and in assessing the need for fine-tuning of pretrained language models for spreadsheet-oriented tasks, covering schema and content validation, semantic correctness checks, sampling and manual review, dataset splitting, evaluation metrics, baseline experiments, and leakage detection. It is commonly asked in the ML System Design domain to measure both conceptual understanding of data integrity and model evaluation principles and practical application skills in dataset engineering and experimental methodology when deciding whether a pretrained model is already sufficient or requires task-specific fine-tuning.
You are given a dataset for a spreadsheet assistant. Each example contains:
Design a data-quality validation pipeline for this dataset. The pipeline should detect malformed records, duplicates, inconsistent labels, low-value examples, and train/test leakage.
Then explain how you would use the cleaned dataset to decide whether a pretrained Hugging Face model is already good enough for these tasks, or whether task-specific fine-tuning is needed.
Your answer should cover: