You are given DNA sequences composed of the characters A, C, G, and T. The task is to predict whether a sequence is real biological DNA or a fake / synthetic negative example.
Available data:
-
A
small labeled dataset
containing both real and fake DNA sequences.
-
A
large additional training dataset
containing only real DNA sequences.
Discuss how you would approach this problem. In particular:
-
How would you frame the learning setup given that most of the extra data are positive examples only?
-
How would you generate or augment fake DNA samples without creating unrealistic artifacts that make the classification problem artificially easy?
-
How would you represent DNA sequences and choose a model, from simple baselines to more advanced sequence models?
-
How would you evaluate the classifier, choose metrics, and avoid leakage from duplicate or highly similar sequences?
-
What risks would you watch for, such as class imbalance, overfitting, calibration, and distribution shift between synthetic negatives and real-world fake DNA?