Meta trains or fine-tunes coding agents using private source-code repositories. These agents may later be used to answer coding questions, generate code snippets, or assist developers.
Design a system and research strategy to ensure that the coding agent does not output verbatim or near-verbatim code from private repositories. Your answer should cover:
-
What types of leakage you are trying to prevent.
-
How you would change the training data pipeline, model training, inference-time safeguards, and evaluation process.
-
How you would detect both exact and near-duplicate private-code outputs.
-
Pros, cons, and trade-offs of each mitigation.
-
How you would respond if an interviewer stress-tested your assumptions, for example by asking whether your approach can provide a true guarantee.